Session+6.1

=**System Calls and Scripting External Programs**= By Matt Davis, Updated 7/25/11 by Mike Lawson

**Topics:**

 * Manipulating the file system and UNIX environment from Python
 * Executing other external user space programs with os and sys modules
 * Executing external processes with the subprocess module
 * Running BLAST from a Python script

**Introduction:**
This morning we will learn to use the **os** and **sys** modules to interface with the UNIX environment by running external programs to manipulate the filesystem and memory space. Then we will move on to the more sophisticated **subprocess** modules.

**Useful tools from the os module**
The UNIX environment that you have access to when you type in the terminal is also available to Python. That means that you don't have to open a separate terminal window just to change directories, create a new file, or see what operating system you program is running on. This also means that you have the capacity to script what would otherwise be exceedingly mundane tasks, such as renaming lots of files or creating various subdirectories for your output from different data sets.

You've already seen one key component of the **os** module, the function used to open filehandles, **open**. Let's import the module and look at some of the others.

//os.py// code format="python" import os cwd = os.getcwd print cwd, 'is the current working directory.'
 * 1) Let's make sure we know which working directory we are in

cwdfiles = os.listdir('.')
 * 1) And which files are present in the cwd? Note that the '.' represents the current
 * 2) working directory

for cwdfile in cwdfiles: print cwdfile
 * 1) The output of listdir is stored as a list, so let's print it item by item.

homefiles = os.listdir('/Users/¡yournamehere!') for file in homefiles: print file
 * 1) And we can look in other directories too. Note that you'll have to change
 * 2) the dir below to one that exists on your filesystem.

os.chdir('/') print os.getcwd, 'is the now the current working directory' cwdfiles = os.listdir('.') for file in cwdfiles: print file
 * 1) We could also first change our cwd to a different directory, and then get the file list.

code

We can also use the **os** module to manipulate the filesystem.

code format="python" import os numbers = [1, 2, 3] letters = ['a', 'b', 'c'] for num in numbers: for let in letters: os.mkdir(str(num) + let)
 * 1) Let's make some new directories and files

diritems = os.listdir('.')
 * 1) Let's get all our new directories in a list

teachers = ['Mike', 'Peter', 'Aaron']
 * 1) And let's make some files in them

for diritem in diritems: try: os.chdir(diritem) for teach in teachers: for num in numbers: try: open(teach + '_' + str(num), 'w') except IOError: print "caught an error attempting to open the new file", \ teach + '_' + str(num) os.chdir('../') except: print "Caught a general exception attempting to open the new file", \ teach + '_' + str(num) os.chdir('../')
 * 1) using what we've learned about exceptions and testing

except OSError: print "caught an error from module OS, possible that", diritem, \ "isn't a directory after all." except: print "caught a general exception trying to change to directory", diritem code

Whew, what a mess. Let's clean up all those unimportant files, leaving only the important ones. code format="python" import os

diritems = os.listdir('.') print diritems for diritem in diritems: try: os.chdir(diritem) subdiritems = os.listdir('.') for subitem in subdiritems: if ( subitem.startswith('Mike') ): continue else: os.remove(subitem) os.chdir('../') except OSError: print "caught an error from module OS, possible that", diritem, "isn't a directory after all." except: print "caught a general exception trying to change to directory", diritem code
 * 1) print "diritem is" + str(diritem)
 * 2) print "type is" + str(type(diritem))

Okay, enough with files for now. Let's look at how we can execute UNIX commands from python. We'll start with the simplest, dullest, most boringest way: using **os.system**

code format="python" import os os.system('ls -l') code

$ **./scratch.py** total 16 drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1a drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1b drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1c drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2a drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2b drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2c drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3a drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3b drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3c -rw-r--r-- 1 lawson staff 731 Jul 28 22:52 61_cleanupfiles.py -rw-r--r-- 1 lawson staff 1126 Jul 28 22:49 61_makefiles.py

Okay, so we can see the output of the UNIX command **ls -l** from the script, and compare it to the regular terminal:

$ **ls -l** total 16 drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1a/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1b/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1c/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2a/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2b/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2c/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3a/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3b/ drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3c/ -rw-r--r-- 1 lawson staff 731 Jul 28 22:52 61_cleanupfiles.py -rw-r--r-- 1 lawson staff 1126 Jul 28 22:49 61_makefiles.py

A subtle difference of trailing slashes is the only difference in my case. But this is not the same as using **os.listdir('.')**. The two differences are that **os.listdir('.')** does not have access to the detailed information that regular command line **ls -l** does, and that **os.listdir('.')** returns a list of the directory items, not merely a string of the command line output. But this was a truly trivial example, so let's move on to, well, another truly trivial example.

code format="python" import os os.system('../S1.2/hello.py') code
 * 1) remember our script from the first of the week?

$ **./scratch.py** //hello "world", if that is your real name.// //That's World, to you//

Yep, it's still there. And we can use another program to run it. We could run it hundreds of times this way, if we were truly trivial people. Hrmmm...

Just like opening files, it's a good idea to look for exceptions in the process of executing system calls. This is implemented very simply with **os.system**, as the function will return a non-zero value if the call fails to execute.

code format="python" import os

dirs = ['.', 'asdjfaklsdfjlksadjf'] for d in dirs: if ( os.system('ls '+d) != 0 ): print "Whoa, that didn't work out, which ls was happy to tell us about." else: print "Well, okay, great. Thanks ls." code

$ **./scratch.py** //1a 1b 1c 2a 2b 2c 3a 3b 3c Matt_2 S5.2 test// //Well, okay, great. Thanks ls.// //ls: asdjfaklsdfjlksadjf: No such file or directory// //Whoa, that didn't work out, which ls was happy to tell us about.//

But this really is the simple way. Often when we are using python to run other programs, the programs we are executing may finish at different times. When that's the case, we may want to control the order in which they are executed, or merely the number of subprocesses running at once. These days, even laptops have multiple CPUs in them, but we still might not want to dump hundreds of subprocesses on our computer at once. The **subprocess** module allows us to be a bit more intelligent about how we execute system calls by using a method called **subprocess.Popen**. This is somewhat different, however, in that the subprocess we want to execute must be created as an **object**, which we can then monitor with the **methods** of this object.

code format="python" import subprocess as sp
 * 1) let's chop the module name down to sp for the purposes of typing fewer characters

sp.Popen( ['ls', '-l'] ) code
 * 1) let's create a subprocess without assigning it to a variable, and see what happens

We're going to use a little UNIX tool called **time**, which will tell you how long it took to run your script.

$ **time python scratch.py** //total 24// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c// //-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2// //-rwxr-xr-x 1 matt matt 219 Jun 3 23:01 scratch.py// //-rw-r--r-- 1 matt matt 529 Jun 3 22:57 test//

//real 0m0.149s// //user 0m0.114s// //sys 0m0.028s//

And we see that script outputs the same as **os.system** did, but with the added "time" info on the last line.

code format="python" import subprocess as sp

for each in range(0,100): a = sp.Popen( ['ls', '-l'] ) code

When we run this code, we see that it takes about about half a second of real-life clock time to run, and about the same for the "system" clock. The number in the middle, the "user" time, is about half, because my laptop has two CPU cores, over which the system time has been roughly divided.

$ **time python scratch.py**

//real 0m0.510s// //user 0m0.269s// //sys 0m0.490s//

code format="python" for each in range(0,100): a = sp.Popen( ['ls', '-l'] ) a.wait code

The **wait** method of the subprocess object tells the system to wait until this subprocess has completed before doing anything else in the program. When we look at the time output for this code, we see that it takes much more time on the clock (almost three quarters of a second), but the same amount of user and system time. This is because while the program waits for the process, the second CPU core cannot be crunching away on another process. While this may seem inefficient (it certainly is), it is also courteous to the other processes on your computer. In other words, if you're using your laptop for anything other than running your program, you may want to not grind everything to a halt for the duration of your program's runtime (which for me is often hours and hours). And if you're using a server to run your program on, your coworkers may appreciate it if you do not make use of every CPU on the server while they're trying to get their work done. In truth, the operating system will make its own attempt at balancing the needs of processes, but especially in the case where you're trying to use your laptop or PC as a workstation, this can save you some frustration.

$ **time python scratch.py**

//real 0m0.735s// //user 0m0.272s// //sys 0m0.459s//

And just to show that it really is the CPU sharing that's driving the effect:

code format="python" import subprocess as sp

for each in range(0,50): a = sp.Popen( ['ls', '-l'] ) b = sp.Popen( ['ls', '-l'] ) a.wait code
 * 1) cut the number of subprocesses in half, then run two sets of them

And we see that our runtime is almost exactly what it was in the first case.

$ **time python scratch.py**

//real 0m0.506s// //user 0m0.270s// //sys 0m0.491s//

Okay, now that we know how to run the subprocess, and we know that we have to create a subprocess object if we want to do things like **wait**, let's look at other advantages of using this type of object.

code import subprocess as sp


 * 1) let's say we want to store our output to a file instead of seeing it on the screen

fh = open('outfile', 'a') proc = sp.Popen(['ls', '-l'], stdout=fh) code

And if we run the script and then look at the contents of the file:

$ **python scratch.py** $ **cat outfile** //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b// //drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c// //-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2// //-rw-r--r-- 1 matt matt 0 Jun 3 23:45 outfile// //-rwxr-xr-x 1 matt matt 250 Jun 3 23:45 scratch.py//
 * total 16**

There's all our output, which can be much more convenient than having it scroll down the screen.

However, what can be even *more* convenient is having it stored in a data structure in memory.

code format="python" import subprocess as sp

proc = sp.Popen(['ls', '-l'], stdout=sp.PIPE)
 * 1) The sp.PIPE is a mechanism that allows the output to be redirected to
 * 2) wherever you point it. In this case, to our subprocess object named proc.
 * 3) The next step is to get the data out of the object and into a data structure
 * 4) that we can organize however we see fit.

proc_output = proc.communicate print "" print "We got a tuple\nWith the first part being:\n", proc_output[0], \ "\nand the second being:\n", proc_output[1] print ""
 * 1) we use the method below to get a tuple of two things: the stdout and the stderr

print "" print "and if we don't format anything, we see the tuple like this:\n", proc_output print ""
 * 1) Once we use the .communicate method, the stdout and stderr are gone from the proc object.

stdoutlist = proc_output[0].split('\n') print "" print "And now we see the first half of the tuple as a list of lines" \ + "of output from our subprocess:\n", stdoutlist print "" code
 * 1) But we could easily convert this into a friendlier structure

I've intentionally kept this **ls -l** example set very simple, in terms of what command we are actually executing, because it's important to see how this object type works. However, typically this tool would be used to run things that are going to generate very large sets of data. For you, that might be a systematic search of a database like the PDB, or large-scale BLAST searches, multiple sequence alignments, searches for transcription factor binding sites in genomes, or a host of other common but computationally expensive bioinformatic problems. Or, for the foreseeable future, it may just be the most reliable way to organize small sets of output data into a data structure or log file.

**BLAST**
Good ol'e Basic Local Alignment Search Tool, now that's bioinformatics even Grandma can get down with. In its most basic form, BLAST takes a short sequence of either amino acids or nucleotides, searches a database of reference sequences, and returns the most likely matches for the query sequence. It's brilliant, and the only trouble is that we're often interested in BLASTing lotsa query sequences, and much less interested in the mundanity of parsing the output that comes back to us. Fortunately, we can easily script the execution of BLAST such that we can run queries many times with different sequences and settings, and also parse the output back into dictionaries to manipulate and even store longterm on the disk.

The first part of this section will show us how to install a new program in our UNIX environment, and the second part will let us practice using **Popen** to script the execution of BLAST.

**Installing BLAST**
I suspect most everyone here is accustomed to using the NCBI BLAST web interface, but is just another program, whether it's running on an NCBI web server, or on our laptops. Before we get back to python, let's take a minute to install BLAST the old fashioned way, such that we have ready access to it from anywhere on our computers.

In the **PythonCourse/programFiles** directory, you should find a tar.gz file containing the BLAST executable for your platform.

The **.tar.gz** file extension tells us that the file has been archived (the **.tar** part) and compressed (the **.gz** part). The UNIX command to uncompress and expand the file is:

$ **tar zxvf [filename]**

For example:

$ **tar zxvf ncbi-blast-2.2.25+-universal-macosx.tar.gz** //ncbi-blast-2.2.25+/// //ncbi-blast-2.2.25+/bin/// //ncbi-blast-2.2.25+/bin/blast_formatter// //ncbi-blast-2.2.25+/bin/blastdb_aliastool// //ncbi-blast-2.2.25+/bin/blastdbcheck// //ncbi-blast-2.2.25+/bin/blastdbcmd// //ncbi-blast-2.2.25+/bin/blastn// //ncbi-blast-2.2.25+/bin/blastp// //ncbi-blast-2.2.25+/bin/blastx// //ncbi-blast-2.2.25+/bin/convert2blastmask// //ncbi-blast-2.2.25+/bin/dustmasker// //ncbi-blast-2.2.25+/bin/legacy_blast.pl// //ncbi-blast-2.2.25+/bin/makeblastdb// //ncbi-blast-2.2.25+/bin/makembindex// //ncbi-blast-2.2.25+/bin/psiblast// //ncbi-blast-2.2.25+/bin/rpsblast// //ncbi-blast-2.2.25+/bin/rpstblastn// //ncbi-blast-2.2.25+/bin/segmasker// //ncbi-blast-2.2.25+/bin/tblastn// //ncbi-blast-2.2.25+/bin/tblastx// //ncbi-blast-2.2.25+/bin/update_blastdb.pl// //ncbi-blast-2.2.25+/bin/windowmasker// //ncbi-blast-2.2.25+/ncbi_package_info// //ncbi-blast-2.2.25+/ChangeLog// //ncbi-blast-2.2.25+/README// //ncbi-blast-2.2.25+/LICENSE// //ncbi-blast-2.2.25+/doc/// //ncbi-blast-2.2.25+/doc/user_manual.pdf//

We now have a new directory with all those uncompressed files in it, so let's change to it:

$ **cd ncbi-blast-2.2.26+** $ **ls** //ChangeLog LICENSE README bin/ doc/ ncbi_package_info//

We've got a **ChangeLog** file that will tell us what version of BLAST we've got, a **doc** directory with documentation, a **data** directory with the various matrices to use for esimating substituion rates, and then the **bin** directory that contains the actual binary executable for the program. If we change to the **bin** directory and take a look, we'll see several files:

$ **cd bin** $ **ls** //blast_formatter* blastdbcmd* blastx* legacy_blast.pl* psiblast* segmasker* update_blastdb.pl*blastdb_aliastool* blastn* convert2blastmask* makeblastdb* rpsblast* tblastn* windowmasker*blastdbcheck* blastp* dustmasker* makembindex* rpstblastn* tblastx*//

These are the executable programs that BLAST uses to do its dirty work. You might recognize some of these names from NCBI web site options (e.g. **blastn** vs. **tblastx**, etc). We want to setup a local directory to install these program files into, which we'll do in a new subdirectory of our home directory.

First, let's change back to our home directory:

$ **cd** $ **pwd**
 * ///Users/matt// **

Now, we can simply move the un-tarred BLAST directory into our new directory:

$ **mv PythonCourse/programFiles/ncbi-blast-2.2.26+/ .** $ **ls** **ncbi-blast-2.2.26+** //ChangeLog LICENSE README bin doc ncbi_package_info//

There we go, now the files are present directory beneath our home directory.

Now, in order to be able to use these program from anywhere in our directory structure, we need to add them to our executable **PATH**. We did something very similar to this during our installation of Aquamacs and Chimera, so this may look familiar. Copy these lines one at a time to the terminal (make sure to exclude the **$**, it's just there to symbolize your prompt).

$ **echo 'PATH=$PATH:~/ncbi-blast-2.2.26+/bin' >> ~/.bash_profile** $ **echo 'export PATH' >> ~/.bash_profile**

Two more steps until we're ready to roll. First, check your bash profile: Your bash profile should look something like this:
 * $ less .bash_profile**
 * alias aquamacs='/Applications/Aquamacs.app/Contents/MacOS/Aquamacs'**
 * alias chimera='/Applications/Chimera.app/Contents/MacOS/chimera'**
 * alias blast_formatter='/ncbi-blast-2.2.25+/bin/blast_formatter'**
 * export EDITOR=aquamacs**
 * PATH=$PATH:/usr/local/git/bin**
 * export PATH**
 * PATH=$PATH:~/ncbi-blast-2.2.26+/bin**
 * export PATH**

Finally, close the terminal window you were just working in, open a new one, and enter the following: $ **source ~/.bash_profile**

Now, if all went well, we should be able to execute one of out BLAST programs and get some feedback:

$ **blastn**
 * //BLAST query/options error: Either a BLAST database or subject sequence(s) must be specified// **

If you see this message, then you've properly installed BLAST. Good job.

Okay, now that we have these program available, let's start using them.

Since BLAST is going to compare a query sequence to a database of sequences, collected from say, a genome, or a genre of genomes, we first need to create the reference database to BLAST against. Many of these are available for download, but we're going to create our own from the //S. cerevisiae genome. Go ahead and navigate to the FTP server for BLAST at __ [] __, which you can connect to through your favorite web browser, login as a guest, and copy yeast.aa.gz and yeast.nt.gz into your working directory from the FASTA folder. // // Since these files came compressed with **gzip**, we're going to uncompress them into our working directory, which we'll return to (i.e. our **S6.1** directory). //

$ **cd ~/PythonCourse/S6.1**

And now to uncompress the files into our working directory. $ **gunzip ~/PythonCourse/data/yeastFASTA/yeast.nt.gz** $ **gunzip** **~/PythonCourse/data/****yeastFASTA/****yeast.aa.gz**
 * $ mv ~/PythonCourse/data/yeastFASTA/yeast.nt .**
 * $ mv ~/PythonCourse/data/yeastFASTA/yeast.aa .**

And one more step before we can use our new BLAST installation: we have to create database of sequences we just moved to used by BLAST using the **makeblastdb** command. The **-dbtype** argument specifies the type of sequence we're formatting (the **.nt** extension stands for nucleotide), and the **'nucl'** tells the program to format the database a nucleic acid sequence. The **-in** argument specifies the name of the input file, in this case **yeast.nt**.

$ **makeblastdb -dbtype='nucl' -in yeast.nt**

Now we have a fully operational BLAST installation and a yeast nucleotide database the search against. We can see that BLAST has created some additional files for our yeast database in our working directory:

$ **ls** //1b 2a 2c 3b os.py yeast.aa yeast.nt.nhr yeast.nt.nsq//
 * //1a 1c 2b 3a 3c outfile yeast.nt// **//yeast.nt.nin//

Now let's make a file to query the database with by opening a new text file and pasting in these lines:

>test seq ATGGTTCATTTAGGTCCAAAGAAACCACAGGCTAGAAAGGGTTCCATGGCTGATGTGCCC AAGGAATTGATGGATGAAATTCATCAGTTGGAAGATATGTTTACAGTTGACAGCGAGACC TTGAGAAAGGTTGTTAAGCACTTTATCGACGAATTGAATAAAGGTTTGACAAAGAAGGGA GGTAACATTCCAATGATTCCCGGTTGGGTCATGGAATTCCCAACAGGTAAAGAATCTGGT AACTATTTGGCCATTGATTTGGGTGGTACTAACTTAAGAGTCGTGTTGGTCAAGTTGAGC GGTAACCATACCTTTGACACCACTCAATCCAAGTATAAACTACCACATGACATGAGA

And then save the file as **query.fasta**

And.... here we go a scriptin'.

code import subprocess as SP

program = 'blastn' queryseq = 'query.fasta' database = 'yeast.nt'
 * 1) some arguments for running BLAST


 * 1) When we create our subprocess below, we're going to supply a list as the first argument.
 * 2) The list will contain the name of our program, in this case, blastn
 * 3) We'll also use the list to supply the argument flags and their
 * 4) arguments for the program.
 * 5) This is equivalent to running the following command on the command line:
 * 6) blastn -query query.fasta -db yeast.nt
 * 1) blastn -query query.fasta -db yeast.nt

proc = SP.Popen([program, '-query', queryseq, '-db', database], stdout=SP.PIPE)


 * 1) Since we used the PIPE to store the blastn output, we can retrieve it with .communicate

output = proc.communicate


 * 1) note that this is returned to us a tuple, even though we only requested
 * 2) to communicate with the stdout when we called Popen.
 * 3) To see the output in a reasonable format, try:

outlist = output[0].split('\n')

for line in outlist: print line

code

You can imagine how you would set a list of query sequences, or a series of databases to query against, or even vary some of the BLAST settings (for example, the minimum E-value to report), and run this again. You also might care which species your hit came from (if you were searching against a larger database), or which chromosome hits were on (to determine synteny, e.g.)

In this case, we've used the PIPE to capture the output, which might be helpful when we want to systematically sort through the results in each BLAST hit, only storing the results in certain circumstances. However, we can also write the results to an output file instead of the PIPE. Let's try that:

code import subprocess as SP

program = 'blastn' queryseq = 'query.fasta' database = 'yeast.nt'
 * 1) some arguments for running BLAST


 * 1) open a filehandle for our output file, we'll use the append flag in case
 * 2) we want to collect multiple queries worth of output

try: outfh = open('blastn_output', 'a') except: print "Error opening output file: blastn_output"

proc = SP.Popen([program, '-query', queryseq, '-db', database], stdout=outfh) code Now you can **cat** the **blastn_output** file and see the results.

**Exercises**
1) Modify your Popen call to set a minimum E-value of 1e-08. Parse the output of your BLAST hits into a dictionary keyed by chromosome, containing the length of the hit, the E-value, and the percent identity of the hit.

HINT: check out the BLAST options by issuing the command **blastall --help**. There are output options that will greatly facilitate your parsing efforts.

2) Write a script using the genetic code (provided here in parse-able text format) to translate your nucleotide sequence into proteins. Then use blastp instead of blastn to search against the amino acid sequence file we downloaded earlier. Remember to make the blast database first! (protein flag = "prot")

code Ala/A GCU, GCC, GCA, GCG Leu/L UUA, UUG, CUU, CUC, CUA, CUG Arg/R CGU, CGC, CGA, CGG, AGA, AGG Lys/K AAA, AAG Asn/N AAU, AAC Met/M AUG Asp/D GAU, GAC Phe/F UUU, UUC Cys/C UGU, UGC Pro/P CCU, CCC, CCA, CCG Gln/Q CAA, CAG Ser/S UCU, UCC, UCA, UCG, AGU, AGC Glu/E GAA, GAG Thr/T ACU, ACC, ACA, ACG Gly/G GGU, GGC, GGA, GGG Trp/W UGG His/H CAU, CAC Tyr/Y UAU, UAC Ile/I AUU, AUC, AUA Val/V GUU, GUC, GUA, GUG STOP/* UAG, UGA, UAA code

**Solutions**
code import subprocess as SP
 * 1)**

program = 'blastn' queryseq = 'query.fasta' database = 'yeast.nt'
 * 1) some arguments for running BLAST

proc = SP.Popen([program, '-query', queryseq, '-db', database, '-evalue', '1e-08', '-outfmt', '6'], stdout=SP.PIPE)
 * 1) Note the addition of arguments to the blastn call

output = proc.communicate

outlist = output[0].split('\n')

chrs = {} # set up our dictionary of chromosomes

for line in outlist: line = line.strip.split if line: # only process non-blank lines chr = line[1] pident = float(line[2]) matchlen = int(line[3]) evalue = float(line[10]) if chr in chrs.keys: chrs[chr].append((pident, matchlen, evalue)) else: chrs[chr] = [(pident, matchlen, evalue)]

print chrs code

code
 * 2)**
 * 1) /usr/bin/env python

import subprocess as SP from fastaParser import fastaParser

def parse_nuctable(filename): fh = open(filename)

codon_table = {} for line in fh: line = line.strip if line: # skips blank lines data = line.split

aa = data[0] # something like 'Ala/A' oneletter = aa.split('/')[1] codons = data[1:] for codon in codons: codon = codon.strip.replace(',', '') codon_table[codon] = oneletter
 * 1) turns 'GCU, ' into 'GCU'

return codon_table

def seq_to_codons(sequence): codons = [] i = 0 while i < len(sequence): codons.append(sequence[i: i+3]) i += 3 return codons

def translate_sequence(nt_sequence): aa_sequence = "" codon_table = parse_nuctable('nuctable.txt')

nt_sequence = nt_sequence.replace('T', 'U') for codon in seq_to_codons(nt_sequence): aa_sequence += codon_table[codon]
 * 1) nuctable uses the RNA sequence, not the DNA

return aa_sequence

def translate_file(filename): """ Translates the given file from a fasta with DNA genes to the corresponding aa sequence, outputs it to a file, then outputs the name of the file you output to.""" gene_dict = fastaParser(filename)
 * 1) note that we don't preserve order of genes in the original file

outname = filename + '.aa.fasta' outfile = open(outname, 'w') for gene in gene_dict: outfile.write('>%s\n' % gene) outfile.write('%s\n' % translate_sequence(gene_dict[gene])) outfile.close
 * 1) doesn't break up the lines nicely... oh well

return outname

program = 'blastp' queryseq = 'query.fasta' database = 'yeast.aa'

queryseq = translate_file(queryseq)


 * 1) This was copy/pasted directly from the lecture. Remember to change your
 * 2) directory to point to blastall

proc2 = SP.Popen([program, '-query', queryseq, '-db', database],stdout=SP.PIPE)
 * 1) Your path to blastall may be different than mine. I have the old blast installed.
 * 2) use legacy_blast.pl as appropriate

output = proc2.communicate

print output[0]

code