Session+6.2

Short read sequencing
The most important thing to understand about modern sequencing technology is that it produces //short reads//, typically between 50 to 150 base pairs. This is in contrast to Sanger sequencing, which could produce much longer reads (up to 1kb). And time goes on, technologies to read very long sequences are being developed (e.g. Pacific Biosciences) but they are not yet mainstream. So we'll focus on short reads.

The general idea behind short-read sequencing is that you take your long stretch of nucleic acids and chop it up:



These short fragments are easy to sequence in large batches. Peter will talk about some more details of the chemistry of short read sequencing, but it's important for now that you realize that what you get off the machine are "randomly" chopped up fragments of DNA.

What to do with short reads
The challenge with short read data is figuring out how it goes together. One option is //de novo// assembly. In this case, you chop up a bunch of copies of the same sequence and hope that the different fragments from different copies have some overlap, so you can figure out how to put them back together. We won't talk about this method today, although there are many approaches to //de novo// assembly. Many of these are based on a mathematical structure called a [|De Bruijn graph].

We will focus on aligning reads to a known reference genome. This means that there is an important requirement:

=__//**YOU MUST HAVE A REFERENCE GENOME**//__=

Now that we have that out of the way...

Note that the reference genome does not necessarily have to be a member of the same species, but it's helpful if the reference genome that you are mapping to is not too distantly related to our favorite organism.

Getting acquainted with the data
The data you get back from an Illumina sequencer comes in a format called //fastq//. Note the similarity to fasta: it's essentially a fasta file with a bit of extra information (specifically, the base //q//uality). Each entry in a fastq file consists of four lines. Why four lines when it really makes sense to parse a line at a time? Who knows. An entry in a fastq file looks like this: code @HS2:202:D0YYKACXX:6:1106:16588:78868 1:N:0:GGCTAC GTGTTGGCACTTCACGTGGACAGGGCGGGCAACCTGTCCCGAGTGACAGA + CBCFFFFFHHHHHIJJIHJJJJJJJJJJJGGHFFFFDEEEDDDBDCCCC3 code The first line (with the "@" symbol) gives some information about the sequencer and a unique identifier for the read. The second line gives the actual sequence read. The third line can be a little variable: sometimes, after the "+" you'll have the same thing that followed the "@" in the first line, but in other cases (like the example here), you might not have anything. The fourth line are the so-called quality scores for each of the bases, and they bear a bit more discussion.

Sequencing machines don't read the sequence of DNA the same way we read the words on a page. Instead, they actually look at hundreds of pictures, each with millions of little colored dots, and try to guess the nucleotide at that position in the read based on the color. But it's only a guess! Thus, they try to provide some measure of the certainty of the base call. They are given by a simple formula: math Q = -10\log_{10} p math where //p// is the probability that a base is called incorrectly (this number is somewhat mysterious). Thus the quality score divided by 10 gives the order of magnitude of the error probability: Q = 20 means that there is a 10^-2 = .01 probability of the base being called incorrectly, Q = 30 means that there is a 10^-3 = .001 probability that the base is called incorrectly, etc.

But wait! The quality score line has //letters//, not numbers! The reason for this is simple: 30, for example, takes two characters, but if we want a 1-to-1 correspondence between quality and a nucleotide, this won't work. So they apply one final transformation: first they add 64 to the quality value and then translate that into an [|ASCII character]. Thus, to translate "C" (the quality of the first nucleotide) into a quality score, look up on the table the "Dec" (or DECimal) corresponding to C, and subtract 64. Thus, since C corresponds to 67, and 67-64=3, that first nucleotide is very, very untrustworthy.

A quick note on how to read fastq files in python. The normal "for line in file:" type of loop won't work here, because each entry corresponds to four lines. So perhaps a better structure is an while loop where you break at the end of the file: code format="python" while True: id1 = fastq_file.readline.strip if id1=="": break seq = fastq_file.readline.strip id2 = fastq_file.readline.strip qual = fastq_file.readline.strip code

Mapping by string matching
There are two things you need to start mapping to a reference genome: But what do you do with them? Think about what you would do if I gave you this reference sequence (in fasta format): code >the_best_chromosome_ever TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC code and this read: code @my_read ATTCGTACTT + ZWTGBDSWEG code and asked you to align the read to the reference by hand. The first thing I would do is simply start at the beginning and see if it fits there: code ATTCGTACTT TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC code doesn't work, nor does code .ATTCGTACTT TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC code and so on, until I find that it fits somewhere: code ...................ATTCGTACTT TTAGTGTAAGTTCAGACCAATTCGTACTTCGTTCAGAACTCACATTTTAACAACAGAGGACACATGCCCTACCTCCATGATCTACTGACGTCCCTGAGGC code We can implement this in python using the .find method for strings and a try-except loop. For example, if the whole genome is a single chromosome, code format="python" my_genome = read_genome_fasta("genome.fa") try: print my_read, my_genome.find(my_read) except ValueError: print my_read, "unmapped"
 * 1) A bunch of short reads
 * 2) A reference genome

code However, this is a very labor intensive process! What if the sequence of the read had come from the very end of the chromosome? Clearly there must be a better way to do this.

Mapping by hashing
Remember how dictionaries can be used to look up the value associated with a particular key very quickly? We can use that to our advantage. Suppose that instead of checking through the entire genome every time, we instead put every k-mer (that is, sequence of k nucleotides) into a dictionary and for our read asked what position it had in that dictionary. But first we have to chose a good k to use. The first thought might be to make k equal to the length of your read; however, this is often unnecessary and dangerous, because later parts of the read might be more prone to errors. A good rule of thumb is to chose k so that 4^k is about equal to the genome size: this will make sure that most k-mers appear at most once in the genome.For example, if we have made a function called make_genome_dict that takes as arguments a genome and a k and returns a dictionary of all k-mers and their positions, code format="python" my_genome_dict = make_genome_dict(genome, 10) if my_read in my_genome_dict: print my_read[0:10], my_genome_dict[my_read] else: print my_read, "unmapped" code which should print code ATTCGTACTT 19 code so it found that our read maps to position 19 (remember: 0 offset!) in the reference genome.One problem here is that there could be some issues with polymorphism: what if the second position in our read at a C instead of a T?

Mapping by indexing (a.k.a. what you'll do in real life)
The way that most mainstream read-mapping programs work nowadays is to do something even fancier than mapping by hashing, but instead build an index of the genome in a special way. One of the most common ways is called a "Burrows-Wheeler index", and is used by [|BWA] (//B//urrows-//W//heeler //A//ligner) as well as [|BoWTie] (//B//urrows-//W//heeler //T//ransform, capitalization mine). Instead of thinking about how to build an aligner that works by indexing, we'll instead focus on using one of these. For this course, we'll use Bowtie.

One of the most useful things is the [|Bowtie getting started page], which walks you through some basic usage. In short, there are 2 essential steps to mapping reads using Bowtie: To build the index, you need to use the program bowtie-build. This program takes very simple inputs:
 * 1) Build an index of the reference genome
 * 2) Map the reads, using the index


 * $ bowtie-build path/to/reference.fasta path/to/index**

The results of a bowtie-build command will be a series of files. If I am in a folder with the //E. coli// genome in a file called E_coli_genome.fasta, then I would build the index like so:

//E_coli_genome.fasta// //>F dna:plasmid plasmid:EB1_e_coli_k12:F:1:99159:1// //TGATCTTACCCAGCAATAGTGGACACGCGGCTAAGTGAGTAAACTCTCAGTCAGAGGTGA// //CTCACATGACAAAAACAGTATCAACCAGTAAAAAACCCCGTAAACAGCATTCGCCTGAAT// //TTCGCAGTGAAGCCCTGAAGCTTGCTGAACGCATCGGTGTTACTGCCGCAGCCCGTGAAC// //TCAGCCTGTATGAATCACAACTCTACAACTGGCGCAGTAAACAGCAAAATCAGCAGACGT// //CTTCTGAACGTGAACTGGAGATGTCTACCGAGATTGCACGTCTCAAACGCCAGCTGGCAG// //AACGGGATGAAGAGCTGGCTATCCTCCAAAAGGCCGCGACATACTTCGCGAAGCGCCTGA// //AATGAAGTATGTCTTTATTGAAAAACATCAGGCTGAGTTCAGCATCAAAGCAATGTGCCG// //CGTGCTCCGGGTGGCCCGCAGCGGCTGGTATACGTGGTGTCAGCGGCGGACAAGGATAAG// //CACGCGTCAGCAGTTCCGCCAACACTGCGACAGCGTTGTCCTCGCGGCTTTTACCCGGTC// // // //E_coli_genome.fasta E_coli_index.1.ebwt E_coli_index.2.ebwt E_coli_index.3.ebwt E_coli_index.4.ebwt E_coli_index.rev.1.ebwt E_coli_index.rev.2.ebwt//
 * $ ls**
 * $ head E_coli_genome.fasta**
 * $ bowtie-build E_coli_genome.fasta E_coli_index**
 * $ ls**

All of those files that end in .ebwt are the indices that bowtie built. These files are //not// human-readable.

Finally, you need to map the reads using the index. A standard usage of bowtie to map the reads is


 * $ bowtie -S --un path/for/unmapped_reads.fastq path/to/index path/to/reads.fastq > path/for/mapped_reads.sam**

Breaking that down, there are a few things: There are a few extra options: for example, if you are working on a server or a computer with multiple cores you can put "-p" followed by a number to tell Bowtie how many cores to use. If you just type "bowtie" on the command line you will get a list of options.
 * -S tells Bowtie to output the reads in SAM format, which is a standard format for mapped reads. We'll talk about it later
 * --un path/for/unmapped_reads.fastq tells Bowtie where to put reads that it fails to map. It will simply puke their fastq entries to the path that is given after --un
 * path/to/index tells Bowtie where to find the index that you built using bowtie-build
 * reads.fastq is the file containing all your reads
 * > path/for/mapped_reads.sam tells Bowtie where to put all the reads that it maps.

For example, in a directory with a bunch of //E. coli// reads (that happens to be one directory deeper than the folder in which I built my index),

//E_coli_reads.fastq// //@HS2:202:D0YYKACXX:6:1101:1158:2193 1:N:0:TTAGGC// //GTCCGTTGATGCACGTAATCGCCGGTAAAGCGGTTGCTCTGAAAGAAGCG// //+// //CCCFFFFFHHHHHJIJJJJIJIIJJGHIJIJJJFHGGIIIJIJJJFHHIG// //@HS2:202:D0YYKACXX:6:1101:1264:2152 1:N:0:TTAGGC// //NGGCGGTCACAACGCAGGCCATACTCTCGTAATCAACGGTGAAAAAACCG// //+// //#114AD@D?FHDD:@FCDHIDGIEHGHFHGCHEEBHEHH;@CFCCCEBA?// //@HS2:202:D0YYKACXX:6:1101:1464:2207 1:N:0:TTAGGC// //ACTGAAAGCAAAATTTGCCGTAAGCAACGTTAACGGCCCAATCTCGCTGG// //# reads processed: 25000// //# reads with at least one reported alignment: 24705 (98.82%)// //# reads that failed to align: 295 (1.18%)// //Reported 24705 alignments to 1 output stream(s)//
 * $ ls**
 * $ head E_coli_reads.fastq**
 * $ bowtie -p 5 -S --un unmapped.fastq ../E_coli_index E_coli_reads.fastq > E_coli_mapped_reads.sam**

Bowtie outputs some very important statistics: the fraction of aligned reads and the fraction of unaligned reads. Here we can see that we probably mapped the right reads to the right organism: almost 99% of reads aligned. Why did 1% of reads fail to align? It could be a number of reasons, including (but not limited to) extremely error-filled reads, contamination, or an excess of polymorphism compared to the reference.

Let's take a quick look inside the SAM file:

media type="custom" key="20502242"

The lines tagged "@SQ" give the names of the chromosomes and their lengths. In this case, we have two chromosomes: the F plasmid and the main //E. coli// chromosome. On lines 3 (0 offset!) and higher, we have the information we're really interested in: where the reads mapped.

The first field is the read name, from the "@" line of the fastq file. The second field is an arcane bitwise flag field, which I will explain shortly. The third field is the name of the chromosome to which the read mapped: in //E. coli// this is a kind of useless but in organisms with more than one chromosome this is a very important thing to know! Then comes the position on the chromosome. It's important to note that this is always the position relative to the first position in the reference fasta file. The next few fields aren't terribly important, until you get to the sequence of the read and then the quality scores, followed by another set of relatively unimportant fields. If you are interested, you can check out the [|full specification of the SAM file format].

A few words on the bitwise flag. Basically, it is a compact way to express a bunch of things that are either true or false about the read; for example, whether the read aligned only once or not, whether the read is on the + strand or not, etc. Since something that is true or false can be expressed as either 1 (if true) or 0 (if false) you can represent the status of a bunch of these kinds of things as a binary number.

Because this isn't a course on binary numbers, we'll focus on one of the important things, which is whether the read mapped to the plus strand or the minus strand. Basically, and simply, for a read that maps to the minus strand, then when you take the bit-wise AND of the flag for that read and the number 16, it should be greater than 0. Don't worry about what that means, but in python it is easy to implement. For example, if the flag is "16" for a given read, then code format="python" >>> 16&16 16 code while if the flag is, say, 4 code format="python" >>> 16&4 0 code Because 0 is always False and a number greater than 0 is always True, you can use a simple if statement if you need to do different things to the plus and minus strand. code format="python" if flag & 16: #do stuff that you would do if the read mapped to the minus strand else: #do stuff that you would do if the read mapped to the plus strand. code

Feel free to ask about what is actually going on here, but for this course you should only need to use it to this complexity.

And that is the basics of mapping reads!

Exercises
[] []

Before you even map any reads, it's important to understand some interesting properties about Illumina data. Instead of working on all the data (which could take a long time), download the data from the above link and extract it. Use that data for all of the following Does what you see surprise you?
 * 1. Properties of reads**
 * 1) Compute the average quality score for each position in a read. Hint: you will need to use the function ord to turn ASCII characters into decimal numbers
 * 2) Compute the average frequency of each of the 4 nucleotides, as well as the missing character "N" at each position in the read

Build a basic mapping tool that will search for each of the reads in the //E. coli// genome and output the position where they match the //E. coli// genome. You will need to
 * 2. Mapping by string searching**
 * 1) Read in the genome in fasta format (you should be able to do this with a module from last week)
 * 2) Search each chromosome in the genome for the read

When you run the program, make sure to use the time command on the terminal to see how long it takes, e.g.


 * $ time python my_mapper.py**

How long does it take?

WARNING: make sure to search both the + and - strands!

Build a basic mapping tool that will hash the //E. coli// genome into a dictionary (of arbitrary size k-mers) and then use that dictionary to map the reads to the //E. coli// genome. Using k = 11, time the program. How long did it take? How does it come to mapping by string searching? You will need to
 * 3. Mapping by hashing**
 * 1) Read in the genome in fasta format
 * 2) Make a dictionary with every k-mer in the genome
 * 3) Search the dictionary for the first k nucleotides of every read

WARNING: make sure to hash both the + and - strands!

First, install bowtie. Do this by following a similar set of commands to that you did to install blast.
 * 4. Mapping by indexing**

Using bowtie-build and bowtie, map the reads to the //E. coli// genome. How long did it take? Count the number of reads that mapped to the plus strand vs. the minus strand. Is it what you expect?

[]


 * CHALLENGE:** Download the linked set of reads (supposedly) coming from //E. coli//. What fraction of reads mapped? Do you think something went wrong? What could have gone wrong (hint: RNAseq experiments are prepared by humans...)? Figure out a way to test your hypothesis and test it!

Solutions (for the programming problems):
code format="python" import sys import collections
 * 1.**

reads = open(sys.argv[1],"r")
 * 1) open the fastq file

initialized = 0 while True: #read the details of the reads id = reads.readline.strip if id == "": break seq = reads.readline.strip id2 = reads.readline.strip qual = reads.readline.strip #learn the length of the reads if initialized == 0: quality = [] nucs = [] for i in range(len(seq)): quality.append(0) nucs.append(collections.Counter) initialized = 1 #get the nucleotides and the quality scores for i, nuc in enumerate(seq): nucs[i][nuc]+=1 qual_score = ord(qual[i])-33 quality[i] += qual_score
 * 1) loop through the fastq file

print "A\tC\tT\tG\tN\tqual" for i in range(len(nucs)): total_count = sum(nucs[i].values) for nuc in ["A","C","G","T","N"]: print str(float(nucs[i][nuc])/total_count) + "\t", print float(quality[i])/total_count

code code format="python" import sys
 * 2.**

complement = {"A": "T", "C": "G", "G":"C", "T":"A"}

def reverse_complement(seq): return(''.join(map(lambda x: complement[x], seq))[::-1])

def read_fasta(file): fasta_file = open(file,"r") sequences = {} for line in fasta_file: if line[0] == ">": cur_name = line.strip[1:] sequences[cur_name]=[] else: sequences[cur_name].append(line.strip) for key in sequences: sequences[key] = ''.join(sequences[key]) return sequences

fasta_file_path = sys.argv[1] reads_path = sys.argv[2]

my_genome = read_fasta(fasta_file_path) my_reverse_genome = {} for chrom in my_genome: my_reverse_genome[chrom] = reverse_complement(my_genome[chrom]) sys.stderr.write("Done reading genome!\n")
 * 1)   print my_reverse_genome[chrom][:10]
 * 2)   print my_genome[chrom][-10:]

reads = open(reads_path,"r")

read_num = 0

while True: read_num += 1 if read_num % 10000 == 0: sys.stderr.write('%i\n'%read_num) id1 = reads.readline.strip if id1 == "": break seq = reads.readline.strip id2 = reads.readline.strip qual = reads.readline.strip #go over every chromosome and find if it's in there mapped = 0 for chrom in my_genome: forward = my_genome[chrom].find(seq) if forward != -1: print seq, "+", chrom, forward mapped = 1 break reverse = my_reverse_genome[chrom].find(seq) if reverse != -1: print seq, "-", chrom, reverse mapped = 1 break

if mapped == 0: print seq, ".", ".", "."

code


 * 3.**

code format="python" import sys

complement = {"A":"T","C":"G","G":"C","T":"A"}

def reverse_complement(seq): return(''.join(map(lambda x: complement[x], seq))[::-1])

def read_fasta(file): fasta_file = open(file,"r") sequences = {} for line in fasta_file: if line[0] == ">": cur_name = line.strip[1:] sequences[cur_name]=[] else: sequences[cur_name].append(line.strip) for key in sequences: sequences[key] = ''.join(sequences[key]) return sequences

def hash_genome(genome,k): genome_hash = {} #loop over chromosomes for chrom in genome: chrom_len = len(genome[chrom]) genome_hash[chrom] = {} for i in range(chrom_len-k): if genome[chrom][i:(i+k)] in genome_hash: genome_hash[chrom][genome[chrom][i:(i+k)]].append(i) else: genome_hash[chrom][genome[chrom][i:(i+k)]]=[i] return genome_hash

fasta_file_path = sys.argv[1] reads_path = sys.argv[2] k = int(sys.argv[3])

my_genome = read_fasta(fasta_file_path) my_reverse_genome = {} for chrom in my_genome: my_reverse_genome[chrom] = reverse_complement(my_genome[chrom])

sys.stderr.write("Done reading genome!\n")

my_hashed_genome = hash_genome(my_genome,k) my_hashed_reverse_genome = hash_genome(my_reverse_genome,k)

#print "Yo" #print chrom, my_hashed_genome[chrom].keys[:10] #print chrom, my_hashed_reverse_genome[chrom].keys[:10] #raw_input
 * 1) for chrom in my_hashed_genome:

sys.stderr.write("Done hashing genome!\n")

reads = open(reads_path,"r")

while True: id1 = reads.readline.strip if id1 == "": break seq = reads.readline.strip id2 = reads.readline.strip qual = reads.readline.strip

#go over every chromosome and find if it's in there... mapped = 0 for chrom in my_hashed_genome: if seq[0:k] in my_hashed_genome[chrom]: print seq, "+", chrom, my_hashed_genome[chrom][seq[0:k]] mapped = 1 break elif seq[0:k] in my_hashed_reverse_genome[chrom]: print seq, "-", chrom, my_hashed_reverse_genome[chrom][seq[0:k]] mapped = 1 break if mapped == 0: print seq, ".", ".", "."

code