Session+7.1

SNP calling: finding polymorphism
Although in our current project, we are not interested //per se// in finding where the sample sequence differs from the reference sequence. However, in many kinds of experiments and analyses, identifying variable sites is a key step in the analysis pipeline. For example, if you might be conducting an experiment to find the genetic basis of a mutant phenotype in which some samples have a mutant phenotype (let's call these guys //cases//) and other samples don't (let's call these guys //controls//). One option in this case is to do an association study, which looks for all the variable sites where one allele is enriched in the cases and the other allele is enriched in the controls. Another, increasing common application of sequencing data is to look at sequence variation in natural populations. This kind of data can be extremely informative about the evolutionary and demographic history of a species, although we we won't be talking about that here.

Errors or SNPs?
If you were guaranteed that every read were 100% accurate, there would be no SNP calling problem. However, because some of the calls that map to a particular site may be errors, we need some way of figuring out whether or not the site is truly a SNP.

Let's first consider a haploid organism. This is both most relevant to this class, since the data we have is from //E. coli//, but also an easy place to start. Let's take a look at a little bit of a reference sequence and then the reads that map to it. I'm going to write the data in a similar form to what's called a pileup file; you'll see a proper one later. code A AAAAACAAA G CCCCCGCCC C CCCAGCAGG code The first column is the reference allele while the second column summarizes the nucleotides from the reads. For example, in the first row, the reference allele is an A and you obtained 8 reads that had an A at that position and 1 read with a C at that position.

The first row is relatively easy: the reference allele is an A, and 8 out of 9 reads are A. Probably that C is an error, and we would think that this site does not have a SNP in this individual.

The second row is the opposite: the reference allele is G, but only 1 out of the 9 reads is a G. Instead, because every other read has a C, we might think that this site is a SNP and this sample has a C, instead of a G.

The last row is a hard case: the reference is a C, but only 4 out of the 9 reads is a C. However, unlike before, the other reads are roughly evenly split between 3 Gs and 2 As. What do you do in this case?

One approach is to adopt a statistical framework that attempts to estimate error rates and makes use of all the data. This is relatively complicated and we won't talk about it here. Another approach makes implicit use of prior knowledge you may have. First of all, in a haploid you expect only 1 allele at any site; so if there aren't enough reads with the same nucleotide, it might be wise to ignore that site. Second, assuming that your reference and your sample are not too diverged, you should expect SNPs to be relatively rare. Even between humans and chimp you expect only about 1 site in 100 to be different! So you probably want to put more weight on a site not being a SNP. Then, you can simply choose an arbitrary cutoff and declare a site a SNP or not.

A diploid organism can be much harder, because you have to deal with heterozygous sites. In a perfect world, your reads would come equally from both copies of the allele, but that isn't always the case. Consider the following sites: code G GCGCCGGCCC T GGGTGGGTGG C CCATGCGATG code The first site has 4 Gs and 6 Cs. This is reasonably close to 50% G and 50% C, exactly what you would expect from a heterozygous site. But what about the second site? It has only two Ts but eight Gs. Is that possibly a heterozygous site, or a site homozygous for the non-reference allele? The final site is once again a mess.

Again, a statistical framework can estimate error rates, sequencing bias, and model the process that assigns alleles to reads. However, another option is to again make arbitrary, but informed, cutoffs.

Pre-packaged SNP calling
There are a number of software packages that will perform SNP calling for you. One of the simplest to use is samtools. Samtools uses a relatively simple command line that takes as an input a //indexed// fasta file and a sorted bam file (we'll talk about that in a moment). The general framework can be found [|here].

The first thing you need to do to use samtools is index your fasta file using samtools faidx:

//F 99159 50 60 61// //Chromosome 4639675 100938 60 61//
 * $ samtools faidx E_coli_genome.fasta**
 * $ head E_coli_genome.fasta.fai**

As you can see, running samtools faidx results in a file with the same name as the input fasta but with ".fai" tagged on the end of it. The faidx file simply shows the name and size of the sequence, the place in the file where the sequence begins and the number of characters per line. Using this information it is quite easy to get an arbitrary sub-sequence if you know the start and end positions of that subsequence.

The other task is to generate a sorted bam file with your reads. A bam file is simply a binary version of a sam file and samtools plays more nicely with them. To generate a bam file from a sam file of mapped reads, we have to first convert our sam file file into a bam file and then sort it. To conver the sam file into a bam file, use samtools view:

//[samopen] SAM header is present: 2 sequences.//
 * $ samtools view -Sb E_coli_mapped_reads.sam > E_coli_mapped_reads.bam**

These bam files are NOT human-readable. the -Sb option tells it that the input is in //Sam// format and that it should output in //b//am format. To sort the bam file, use samtools sort:


 * $ samtools sort E_coli_mapped_reads.bam E_coli_mapped_reads.sorted**

which results in a file E_coli_mapped_reads.sorted.bam. The first argument is the input bam file and the second argument is the prefix for the output bam file (in other words, the output will be called prefix.bam).

Now we can finally call SNPs! We do this using a series of programs. The first one, samtools mpileup, will be used to generate bcf file. However, first it's instructive to take a brief look at a pileup file:

//[mpileup] 1 samples in 1 input files// // Set max per-file depth to 8000//
 * $ samtools mpileup -f ../E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.pileup**

the -f option simply means that the reference genome is given as a fasta file. Let's take a brief look inside E_coli_mapped_reads.pileup:

code Chromosome 21150 G 17 c,,,,,,,,,,,,,,,^~, +JIIJJGGJJH@JCJHC Chromosome 21151 T 17 ,,,,,,,,,,,,,,,,, AIJGIJBFJJE;J>JGF Chromosome 21152 C 17 ,,,,,,,,,,,,,,,,, GGJJJJGHJJJDJEJHH Chromosome 21153 A 17 ,,,,,,,,,,,,,,,,, EJJJJJIEJJJIJHJCH Chromosome 21154 A 17 ,,,,,,,,,,,,,,,,, >IIIJJIJJIJEJDJFE Chromosome 21155 T 17 ,,,,,,,,,,,,,,,,, FJIIJJJIJIJEJ@IGF code

The first column is the chromosome, followed by the position, the reference allele and the coverage. The next two columns tell you about the reads at that position:, signifies matching reference on forward strand while. is matching reference on reverse strand; similarly upper and lower case letters indicate a read from a non-reference allele on either the forward or reverse strand, respectively. There are a few other characters, for example indicating the beginning of a read, but this was just meant to be a short detour rather than a full exploration of the pileup format.

However for actual SNP calling we need this to be in bcf format. This is done using an almost identical command line, the only difference being that the option should be "-Suf" instead of simply "-f". The "S" will tell samtools to compute per-site strand bias values, which we'll come back to later.

//[mpileup] 1 samples in 1 input files// // Set max per-file depth to 8000//
 * $ samtools mpileup -Suf ../E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.bcf**

Now, we use bcftools view to do all the genotype calling. bcftools uses a statistical method to call SNPs.

//[afs] 0:279.924 1:4.392 2:3.684//
 * $ bcftools view -vg E_coli_mapped_reads.bcf > E_coli.vcf**

The options -vg tell it to output only variable sites and to call genotypes. Calling genotypes is a bit weird for //E. coli//, since it's haploid, but this can be a good practice: sites that are called heterozygous might be indicative of low quality sites or that you are not sequencing a clonal culture. The output is stored in a vcf. "vcf" stands for //v//ariant //c//all //f//ormat, and is a compressed way to store variant information. Taking a look inside, first there is a long header: code code and then some lines indicating the variants code Chromosome 2816123. T C 6.02. DP=2;AF1=1;AC1=2;DP4=0,0,0,2;MQ=20;FQ=-33 GT:PL:GQ 1/1:36,6,0:6 code This looks complicated but it's rather simple. First let's look at the line describing the variant. The first column is the chromosome, then the position. The next column is an ID for that SNP, for example if you had a list of known SNPs. Finally you get the good stuff: the reference allele and the alternative allele. In this case, samtools thinks that while the reference is a T at position 2816123 in the //E. coli// chromosome, our sample actually has a C. The rest of the fields give you some information about the SNP. For example, the field that starts with "DP=2" is called the info field. If you want to know what all the things there mean, take a look up at the header. For example, this line code code means that "DP=2" indicates that 2 reads mapped to this position in the genome. The last column tells you the actual genotype, as well as the genotype likelihood scores (essentially -log(probabilty of a given genotype)). So here it thinks that the E. coli is heterozygous for the alternative allele.
 * 1) fileformat=VCFv4.1
 * 2) samtoolsVersion=0.1.17 (r973:277)
 * 3) INFO=
 * 4) INFO=
 * 5) INFO=
 * 6) INFO=
 * 7) INFO=
 * 8) INFO=
 * 9) INFO=
 * 10) INFO=
 * 11) INFO=
 * 12) INFO=
 * 13) INFO=
 * 14) INFO=
 * 15) INFO=
 * 16) INFO=
 * 17) INFO=
 * 18) INFO=
 * 19) INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
 * 20) FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 * 21) FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
 * 22) FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
 * 23) FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
 * 24) FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
 * 25) FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
 * 1) CHROM POS ID REF ALT QUAL FILTER INFO FORMAT E_coli_mapped_reads.sorted.bam
 * 1) INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">

Exercises

 * 1.** Using the reads you mapped yesterday, write a SNP caller than takes as input the SAM file and outputs a list of possible SNPs. Use a cut-off method to determine if something is a SNP in the sample. Do this straight from the mapped reads by keeping track of what nucleotides mapped to each position in the genome. Remember that the location stated for a mapped read is always the 5' most end of the read.


 * 2.** Using samtools, call SNPs in the reads from yesterday. How many heterozygous sites are called? Why might there be "heterozygous sites" in a haploid genome? Make a vcf file with just the variable sites in it.


 * 3.** Compare the output of your SNP caller to the samtools SNP caller. What fraction of sites that you called a SNP did samtools also call a SNP? How about the other way around?