Session+4.2

=Interactive Python: Running code as we go along=

By now, you've been writing code for 3 days, and it should be pretty apparent that it's easy to make python crash. Whether through syntax errors (which make Python crash before it even runs your code), or index or key errors, or a host of other errors that you'll run into, there's just a lot that can go wrong. Oddly enough, having your program crash is actually the //good// kind of error! Far more insidious is when your code happily runs and does something not entirely what you meant.

Tomorrow morning, Diana's going to go through a couple "Strategic Initiatives" to help us get bugs out of our code, and keep it bug free, but it will bear repeating that one of those is "Test Early, Test Often".



One of the ways that we can fight against this is to have tests for our code. (There's even a style of programming, called Test Driven Development, where you write the test first, then write the code to pass the test.) Writing tests for your code is hard, but it's also a worthwhile skill to have.

Just like in bench science, it's always a good idea to have both positive and negative controls.

One of the ways to make your tests as compact and easy to understand as possible is to have your functions that you're testing also be small and logically compact. In a highly contrived example, let's imagine a function that checks whether a list has the proper number of entries for a GFF file (9), then if it does, finds something that looks like a FlyBase gene identifier, and prints it.

code format="python" def get_flybase_id(gff_list): if len(gff_list) == 9: fbgn_start = gff_list[8].find('FBgn') fbgn_end = gff_list[8].find(';', fbgn_start) return gff_list[8][fbgn_start:fbgn_end]

code

Now logically, this function has two separate ideas, each of which relies on a bit of domain specific knowledge. For instance, both the 8 and the 9 are what programmers call "magic numbers". While these aren't necessarily bad, someone else who's looking at your code might wonder why 9, and not 10, or 8. And the 8 in the next couple lines //might// be because it's 1 less than 9 (which is the maximum index), or it could just be a separate number that //happens// to be close to 9. Basically, for a 5 line function, it will be pretty confusing for the next person who looks at it, and that next person might be you in a couple months, which is more than long enough to forget why you wrote something the way you did.

What if we compare it to this set of functions:

code format="python" def get_flybase_id(gff_list): if is_valid_gff(gff_list): return find_fbgn(gff_list)

def is_valid_gff(gff_list): return len(gff_list) == 9 # There are 9 entries in the GFF specification: # http://www.sanger.ac.uk/resources/software/gff/spec.html

def find_fbgn(gff_list): annotation = gff_list[-1] fbgn_start = annotation.find('FBgn') fbgn_end = annotation.find(';', fbgn_start) return annotation[fbgn_start: fbgn_end]

code

This has a couple distinct advantages: by breaking it up, we can re-use each of the functions later. Second, we can test each of the functions separately, and if it turns out that there's a bug in one of them, then we can fix it once, and everywhere it gets reusued, it'll be fixed there. Finally, as a wise man once said, there's two ways to write software: one is to write code so simple that there are obviously no bugs, and the second is to write code so complex that there are no obvious bugs. By breaking things up into sub-functions, it's easier to do the former than the latter. It's also easier to try different inputs to the function, and see what results it gives.

Now, we can do this by making a separate file that tests the output to make sure it gives what we want, but another quick and dirty way to do it is to use an interactive interpreter.

The basic interpreter
In the last 3 days, has anyone tried typing just $ **python**

on the command line? If you haven't, go ahead and try it now. Rather than complaining about the fact that there's no file for it to run, it prints some stuff out, and gives us a totally different command line:

//Enthought Python Distribution -- www.enthought.com// //Version: 7.3-1 (32-bit)//

//Python 2.7.3 |EPD 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)// //[GCC 4.0.1 (Apple Inc. build 5493)] on darwin// //Type "credits", "demo" or "enthought" for more information.// //>>>//

The >>> is a prompt to enter commands, like in the shell, but now we can enter any Python code that we want. I'll be using it from now on to indicate anything that will go into an interactive interpreter, as opposed to into a file. Let's start by making a variable

code format="python" >>> people = ['Aisha', 'Diana', 'Peter'] code

Nothing happened... or did it? code format="python" >>> print people ['Aisha', 'Diana', 'Peter'] >>> people ['Aisha', 'Diana', 'Peter'] code As it turns out, we so often want to see the value of something from the interactive interpreter that we can just type the name of the variable (or many other expressions without an assignment in them), and the interpreter will print them out for us.

code format="python" >>> people*3 ['Aisha', 'Diana', 'Peter', 'Aisha', 'Diana', 'Peter', 'Aisha', 'Diana', 'Peter'] >>> people # is left unchanged ['Aisha', 'Diana', 'Peter'] code

One key thing you might like to do from here is to get **help** about what else you can do

code format="python" >>> help

Welcome to Python 2.7! This is the online help utility.

If this is your first time using Python, you should definitely check out the tutorial on the Internet at http://docs.python.org/tutorial/.

Enter the name of any keyword, or topic to get help on writing Python programs and using Python. To quit this help utility and return to the interpreter, just type "quit".

To get a list of available modules, keywords, or topics, type "keywords", or "topics". code

As the help function has just informed us, you can use **help** to get help on specific modules, keywords, or topics. Let's use it to get the low-down on the other very useful interpreter function, **dir**.

code Help on built-in function dir in module __builtin__:

dir(...) dir([object]) -> list of strings

If called without an argument, return the names in the current scope. Else, return an alphabetized list of names comprising (some of) the attributes of the given object, and of attributes reachable from it. If the object supplies a method named __dir__, it will be used; otherwise the default dir logic is used and returns: for a module object: the module's attributes. for a class object: its attributes, and recursively the attributes of its bases. for any other object: its attributes, its class's attributes, and recursively the attributes of its class's base classes. (END) code

This shows us that dir can tell us about objects in our namespace, which in non-jargon means that dir can tell us about things like variables and modules and functions, and which ones we have access to at whatever point we're at in our program. To get out of the help mode and back into the main interpreter, we just keep hitting "q" for quit until we get back to our familiar >>>

code format="python" help> q

You are now leaving help and returning to the Python interpreter. If you want to ask for help on a particular object directly from the interpreter, you can type "help(object)". Executing "help('string')" has the same effect as typing a particular string at the help> prompt. code

This gives us some useful information about how better to use help in the future.

Now, what does dir do by itself?

code format="python" >>> dir ['__builtins__', '__doc__', '__name__', '__package__', 'people'] code

So, we should recognize our list variable, "people", from before. So, what does dir have to say about that?

code format="python" >>> dir(people) ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__','__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] code

This command shows us ALL the things that belong to our list variable, including the methods that we can call to manipulate it, such as .append, etc. It also shows us a whole bunch of internal objects that belong to our list variable (which don't belong to us, and we don't get to use them -- at least not in this class).

So, hopefully you can see how to use this power for good: dir can tell you all the methods and functions available to an object. For those of us that prefer to work in the interpreter, we find ourselves only rarely consulting exterior documentation. We can use dir to find out what is available to an object, and then we can use help to figure out what things are:

code format="python" >>> help(people.append)

Help on built-in function append:

append(...) L.append(object) -- append object to end (END) code

Which very succinctly tells us what we've already learned: people.append will add whatever object we put in the parentheses to the end of the people list.

Using IPython, an enhanced Python interpreter
The default Python interpreter is a very basic interface. There are things that you'd like to be able to do, such as enter more than one line of code at a time, or perhaps look back through your history of commands. The enhanced **iPython** interpreter provides these perks and more.

You can start off by launching iPython from the command line:

$ **ipython**

code format="python" Enthought Python Distribution -- www.enthought.com

Python 2.7.3 |EPD 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34) Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python. ?        -> Introduction and overview of IPython's features. %quickref -> Quick reference. help     -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details.

In [1]: code

IPython launches and tells us what version we're running, and gives us a couple of tips. Just typing a question mark and pressing enter will get us a comprehensive introduction to IPython, which I'll leave to you to peruse in your leisure time. You'll notice that the prompt for IPython includes a line number for each line of your code, signified on the first line by ln [1]:. The %quickref guide gives us a succint description of the fanciness of iPython. For starters, you have instant access to the system prompt for commands like ls and cd.

code format="python" In [2]: ls data/               pythons_of_the_world.txt

code

A number of magic functions, all of which start with a percent sign %, provide functionality that the regular interpretter cannot provide. One for regular use is %history, which will show you the lines of code you have previously entered in this session:

code format="python" In [6]: %hist -nt 1: get_ipython.magic(u'quickref') 2: get_ipython.system(u'ls -F ') 3: get_ipython.magic(u'history ') 4: get_ipython.magic(u'hist ') 5: get_ipython.magic(u'pinfo history') 6: get_ipython.magic(u'hist -nt') code

The -n flag means "show line numbers", which is actually usually more of a hindrance, so the default is to do it without. The -t flag means to translate what you wrote into what IPython is feeding directly into Python. This generally applies only to the "magic" commands that you type in.

code format="python" In [7]: mystr = 'a;dlska;'

In [8]: for i in mystr: ...:    print i ...: a d l s k a

In [9]: %hist %quickref %ls %history %hist %history? %hist -nt mystr = 'a;dlska;' for i in mystr: print i %hist code

See how the regular Python code got left alone? Now one of the primary use cases for IPython is to try out code that you're not sure how it's going to work, then add it into a .py file that you'll run over and over, once you're mostly confident that what you have is pretty good. If this is the first time you're saving into a given file, IPython makes it even easier!

code format="python" In [11]: %save histfile.py 7-8 The following commands were written to file `histfile.py`: mystr = 'a;dlska;' for i in mystr: print i code

And now any time you want to run that script, you can do it from IPython:

code format="python" In [12]: run histfile  # The .py is optional a d l s k a code

One of the nice things about this approach is that all the variables from the global namespace of your script also get added to the interpreter's namespace, so you can then use a script to generate or load in some data, play around with it in IPython, then go back and update your script. It's also possible to edit pre-existing files from within IPython!

code format="python" In [10]: savefile histfile.py 7-8 Editing...

1 # coding: utf-8 2 mystr = 'a;dlska;' 3 another_str = "This one doesn't get printed" 4 for i in mystr: 5    print i 6

Editing... done. Executing edited code... a d l s k a

In [14]: print another_str This one doesn't get printed code

Note that **edit** assumes you want to run the code immediately afterwards. If this isn't the case, you can call **edit -x**.

Okay, so there are lots of nice little magic tricks in iPython, and you can use the %quickref guide to explore them more on your own. Meanwhile, lemme show you a couple of other handy tricks that don't involve magic functions.

Remember in the basic Python interpretter, we could use dir to find out what objects belonged to a given object? Well, in iPython, all we have to do to capture the same information is type the object and period, then press the tab.

code format="python" In [16]: mystr. mystr.capitalize mystr.isalnum     mystr.lstrip      mystr.splitlines mystr.center     mystr.isalpha     mystr.partition   mystr.startswith mystr.count      mystr.isdigit     mystr.replace     mystr.strip mystr.decode     mystr.islower     mystr.rfind       mystr.swapcase mystr.encode     mystr.isspace     mystr.rindex      mystr.title mystr.endswith   mystr.istitle     mystr.rjust       mystr.translate mystr.expandtabs mystr.isupper     mystr.rpartition  mystr.upper mystr.find       mystr.join        mystr.rsplit      mystr.zfill mystr.format     mystr.ljust       mystr.rstrip mystr.index      mystr.lower       mystr.split code

This shows us all the things we can use or modify that belong to our variable mystr, from our for loop. If you know what letter(s) a particular function you wants ends with, you can type them and it narrows the list:

code format="python" In [16]: mystr.l mystr.ljust  mystr.lower   mystr.lstrip

In [16]: mystr.l code

If you give enough of the name that it's unique, you can just type that and hit tab, which will save you time.

And what if you don't know what one of those functons or methods is? Try this: code format="python" In [16]: mystr.endswith? Type:      builtin_function_or_method Base Class:  String Form: Namespace: Interactive Docstring: S.endswith(suffix[, start[, end]]) -> bool

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

In [17]:

code

One question mark at the end of the line brings up a help dialog for the object, including the docstring for a function or method. This is nearly the same as doing **help(mystr.endswith)**, but 5 fewer keystrokes, or more if you need to scroll around to put the "help(" at the beginning after typing mystr.endswith.

If you call it with two question marks, and the source is available, it will display that as well! A lot of Python is written in Python, so one way to learn about things is to see how other people have written their modules.

What if there's a large block of code you want to paste in? You may have noticed, as we were typing in the for loop, it automatically gave us the standard 4 spaces to indicate the block. To get around this, use the **%cpaste** magic command, which lets you paste things in blocks of code from elsewhere, without trying to add indentation. To stop pasting, have a line with only two hyphens on it ('--').

code format="python" In [25]: cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
 * def get_flybase_id(gff_list):
 * if is_valid_gff(gff_list):
 * return find_fbgn(gff_list)
 * def is_valid_gff(gff_list):
 * return len(gff_list) == 9
 * # There are 9 entries in the GFF specification:
 * # http://www.sanger.ac.uk/resources/software/gff/spec.html
 * def find_fbgn(gff_list):
 * annotation = gff_list[-1]
 * fbgn_start = annotation.find('FBgn')
 * fbgn_end = annotation.find(';', fbgn_start)
 * return annotation[fbgn_start: fbgn_end]
 * fbgn_end = annotation.find(';', fbgn_start)
 * return annotation[fbgn_start: fbgn_end]

In [26]: is_valid_gff([1,2,3]) Out[26]: False

In [27]: is_valid_gff(range(9)) Out[27]: True code

Now, let's see how well that code we wrote before stands up:

code format="python" In [36]: gffline = ('2L\tmE1_TFBS_cad\tTF_binding_site\t5126\t6402\t.\t.\t.\t'  ....: 'ID=FBsf0000240997;Name=TFBS_cad_000012')

In [37]: is_valid_gff(gffline.split) Out[37]: True

In [38]: find_fbgn(gffline.split[-1]) Out[38]: ''

code

We can look at the string that I took from a FlyBase gff file and see that there's a FBgn in it, but before, I assumed there would be a semicolon after that, and there isn't. We can certainly go through and correct the code, although as an example, I think you all see why you need to test code, because even things that are relatively simple are still complex enough to contain no obvious error... (The really great part is that when I was writing this lecture, I was expecting this positive control to come back with the name, and then demonstrate a more pathological case where things went wrong with the code)

Profiling
One thing that's pretty valuable, that IPython lets us do easily is "profiling" our code, which is a way to see where all the time is being spent. Now, I've said earlier that it's better to do something slow and simply, than (potentially) super-fast and complicated, but if you find that you're doing something a lot, it may be worthwhile to spend some time figuring out how to make it faster (especially if that comes at a low complexity cost).

As an example, let's look at our fasta parser:

code format="python" def fastaParser(filename): current_gene = "" genes = {} fh = open(filename, 'r')

for line in fh: line = line.strip if line.startswith('>'): current_gene = line[1:] genes[current_gene] = '' else: join(genes, current_gene, line)

return genes

def join(gene_dict, gene, line): gene_dict[gene] += line

code

Now, this is probably okay for short little genes, but what happens if we try to parse the yeast genome, which has lots of long sequences?

When we run this, it takes a really long time (something like 30 seconds on my computer). Now, the yeast genome isn't that big (not to knock those mighty single-celled fermenters, but 12megabases is nothing to fly's 130MB, or a human's 3 GB, or //Paris japonica//, a canopy plant with a whopping 150 GB genome), so it would be nice if we could find a faster way to run it. To do that, though, we'd need to know where to focus our efforts. To do that, from within IPython, we run the code with the '-p' flag:

code format="python" In [1]: run -p yeast_genome.py

607964 function calls in 33.858 seconds

Ordered by: internal time

ncalls tottime  percall  cumtime  percall filename:lineno(function) 202628  33.176    0.000   33.176    0.000 exercises.py:17(join) 1   0.484    0.484   33.856   33.856 exercises.py:1(fastaParser) 202645   0.112    0.000    0.112    0.000 {method 'startswith' of 'str' objects} 202645   0.084    0.000    0.084    0.000 {method 'strip' of 'str' objects} 1   0.001    0.001   33.857   33.857 yeast_genome.py:1 1   0.000    0.000   33.857   33.857 {execfile} 1   0.000    0.000    0.000    0.000 {posix.getcwdu} 1   0.000    0.000   33.858   33.858 interactiveshell.py:2257(safe_execfile) ... code

So I broke up the fastaParser function in a bit odd of a way, but that's because I happened to know where it would be spending almost all of its time: the join function. (By the way, if you're profiling your own code and you think there may be a basic operation that you can't profile (something like +), feel free to break it out into it's own function when necessary).

What's happening here is that since a string is an immutable type, whenever you + two strings together, you have to make a new one. This is generally okay for short strings, but if you're doing it a lot of times, with increasingly longer strings, it gets really slow (in fact, it slows down with the square of the number of things you're adding together).

A different way to do this would be:

code format="python" def fastaParser(filename): current_gene = "" genes = {} fh = open(filename, 'r')

for line in fh: line = line.strip if line.startswith('>'): current_gene = line[1:] genes[current_gene] = [] else: join(genes, current_gene, line)

for gene in genes: genes[gene] = ''.join(genes[gene])

return genes

def join(gene_dict, gene, line): gene_dict[gene].append(line) code

So what have I done here? Instead of building up the string as we go along, I add them to a list. Lists have been optimized such that you can add to them, and they don't slow down as they get longer. Then, once everything is in a list, we go through for all of the keys and use the string "join" method (which is different from the join function we wrote), which will combine all the items in the list, putting an empty string between them.

code format="python" 810608 function calls in 0.605 seconds

Ordered by: internal time

ncalls tottime  percall  cumtime  percall filename:lineno(function) 1   0.310    0.310    0.605    0.605 exercises.py:1(fastaParser) 202628   0.113    0.000    0.137    0.000 exercises.py:20(join) 202645   0.070    0.000    0.070    0.000 {method 'strip' of 'str' objects} 202645   0.063    0.000    0.063    0.000 {method 'startswith' of 'str' objects} 17   0.025    0.001    0.025    0.001 {method 'join' of 'str' objects} 202633   0.024    0.000    0.024    0.000 {method 'append' of 'list' objects} 1   0.000    0.000    0.605    0.605 {execfile} 1   0.000    0.000    0.605    0.605 yeast_genome.py:1 1   0.000    0.000    0.605    0.605 interactiveshell.py:2257(safe_execfile) 1   0.000    0.000    0.000    0.000 {posix.getcwdu} 2   0.000    0.000    0.000    0.000 {open} 1   0.000    0.000    0.000    0.000 syspathcontext.py:55(__enter__) 1   0.000    0.000    0.000    0.000 posixpath.py:312(normpath) 1   0.000    0.000    0.000    0.000 posixpath.py:341(abspath) 1   0.000    0.000    0.605    0.605 py3compat.py:170(execfile) code

So now we see that our own join function goes about 300 times faster, and the we spend almost no time at all in the extra string joining. Plus, the code is not really that much more complicated.

A general piece of advice is not to try to optimize your code until *after* you've profiled it. Otherwise, you might spend a lot of time making something largely irrelevant twice as fast, but only improve your total speed by a very tiny amount.

=Exercises= Exercise 1 should be considered "required". Work on it tonight if you're unable to finish it during the time provided. Nevertheless, it may be helpful to do some of the other exercises first if you want to get a better handle on some of the things we've talked about.

1. One of the things we'll need next week during the Project portion of the class is a GFF file with all of the //E. coli operons//, or as EcoCyc might call them "[|transcription units]". They call it something different, since in their mind an operon **must** have more than one gene in it. Instead, we're interested in everything that's likely to appear on a single transcript, whether that's one gene or several.

In the middle of the last decade, the Arkin lab put together [|a set of predictions] about which genes in the E. coli genome (as well as lots of other bacterial genomes), were likely to be on the same transcript, and thus fit a rough definition of "operon". Their predictions were based on :
 * The distance between them in nucleotides
 * Whether the genes are conserved near each other in other genomes, based on MicrobesOnline Ortholog Groups
 * The correlation of their expression patterns, if gene expression data is available
 * Whether they both belong to a narrow GO category
 * Whether they share a COG functional category

While their E. coli data is online, it's in a format that is not standard and not terribly convenient. What we'd like to do here is take the tab-delimited version of the table, as well as the tab-delimited list of all genes that's also provided, and make a [|GFF file] that has one operon per line.

A GFF is also tab delimited, and has the following fields:
 * 1) This is the name of the reference sequence. If we had multiple chromosomes, it could change, but for //E. coli//, it will just be "Chromosome"
 * 2) This is some information on where this annotation came from. If you're combining the results of multiple programs, it might change, but for us, it can just be "MicrobesOnline"
 * 3) This is the type of feature. A GFF file can have multiple kinds of features, like genes, exons, introns, tRNAs, and more all in the same file.
 * 4) An integer from 1 to the size of the reference, inclusive. Note that, unlike Python, sequences start counting from 1.
 * 5) An integer from to the size of the reference. By convention, even if it's on the - strand, this will be a greater or equal coordinate than the . Often times, you'll have to switch these by hand when your feature is on the minus strand. I'm not terribly pleased by this compromise, but it does seem to be standard, and you'll likely break other software if you go against it.
 * 6) Often times, this is just a "." (without the quotes), meaning no score was assigned. Extra credit if you want to take the pOp column from the table and combine them in a sensible way.
 * 7) This will be either "+" or "-" (without the quotes). Some feature types won't have a direction, and will be ".", but really very few.
 * 8) This will be "." for our purposes. The most common case you'll see a number here is for exons. The first one will start at 0, but if other exons get spliced in the middle of a codon, it might be 1 or 2.
 * 9) [attributes] This is a semicolon separated list of whatever else you want to put in. The format is something like: Name1 "Value1"; Name2 "Value2". For our purposes, let's have one attribute, named "operonName", with the value a hyphen separated list of genes in that operon. For instance, we should have an operon named: "lacZ-lacY-lacA". (This is not the usual way of naming things, but easier, and clear enough. For bonus points, figure out how to get the canonical "lacZYA")

As for the tab separated operon calls file, here are the rules I can glean by spending some time looking at it:
 * 1) If two adjacent genes are in the same direction, determine whether they are part of the same operon. If they are, print TRUE in the bop column, otherwise, print FALSE. For reasons I won't speculate on, the cutoff seems to be if the probability is 0.570 or greater.
 * 2) If two adjacent genes are in opposite directions, then they are clearly not on the same operon, so no line is necessary.
 * 3) Genes on the minus strand look like they can appear in either the transcriptional order (like the trp operon) or in order along the '+' strand (like the lac operon)

For reasons that are, like the cutoff of 0.570, incomprehensible to me, the gene rlmE is referred to by its synonym rrmJ in the genome information file, and as rlmE in the operon prediction file. Feel free to edit one of them in the data file, or to have a special case at some point in your program.

It's worth bearing in mind that depending on whether you try to collate the names into the operon calls, or the operon calls into the names, this can be a lot harder or a lot easier.

The first few lines of my operons.gff file looks like this:

code Chromosome   MicrobesOnline    operon    190    255. +   .    operonName "thrL"; Chromosome   MicrobesOnline    operon    337    5020. +   .    operonName "thrA-thrB-thrC"; Chromosome   MicrobesOnline    operon    5234    5530. +   .    operonName "yaaX"; Chromosome   MicrobesOnline    operon    5683    6459. -   .    operonName "yaaA"; Chromosome   MicrobesOnline    operon    6529    7959. -   .    operonName "yaaJ"; Chromosome   MicrobesOnline    operon    8238    9191. +   .    operonName "talB"; Chromosome   MicrobesOnline    operon    9306    9893. +   .    operonName "mog"; Chromosome   MicrobesOnline    operon    9928    10494. -   .    operonName "yaaH"; Chromosome   MicrobesOnline    operon    11382    11786. -   .    operonName "yaaI"; Chromosome   MicrobesOnline    operon    12163    15298. +   .    operonName "dnaK-dnaJ"; code

2) From ipython, type **%magic** to read about more magic functions that we did not discuss. Usually from inside of iPython, if you type **ls** you will get the directory listing of the current working directory. Change this behavior so that it displays the long directory listing (i.e. like using **ls -l** in the terminal).

3) One of the things we're teaching you in this course is how to learn more on your own. Give a man a fish and all that... In this exercise, we'll give you some guidance on exploring the BioPython module. One of the more useful sub-modules is SeqIO, which does input/output for sequences (and, I'm sorry to say, probably does it better than the FASTA parsers we've written thus far). In IPython, go to the directory with the yeast chromosomes, then simply type (or cpaste):

code format="python" from Bio import SeqIO chromosomes = [rec for rec in SeqIO.parse('yeast_genome.fa', 'fasta')] # or whatever your genome is named rec0 = chromosomes[0] code

a) Now, use one of the methods we taught today to figure out what attributes this chromosome has. Some of these are methods, and some are just plain variables. b) Play around with the reverse_complement method. How does it handle ambiguous bases? Capitalization? You may find it helpful to create your own Seq objects, like this:

code format="python" from Bio import Seq my_seq = Seq.Seq('CAGGTAG') code

c) Get the help information on SeqIO.write, and figure out how to write out the reverse-complement of all of the yeast genome to a new file.

d) Write a function that will convert a one-letter amino acid sequence (e.g. A = alanine, K = lysine, etc) into a three-letter amino acid sequence (alanine is Ala, lysine is Lys). You may find the following dictionary helpful:

code format="python" threecode = {'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp', 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His', 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met', 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg', 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp', 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa', '*':'Ter', 'U':'Sel', 'O':'Pyl', 'J':'Xle', } code

e) Now, import Bio.SeqUtils, and look at SeqUtils.seq3. How similar is this to what you wrote in part d?

=Solutions=

Exercise 1
Okay, this one was kind of hard. Here's how I thought about it, although there were certainly other (possibly more correct) valid solutions.

It looks to me like the genome listing was in order, so I made the assumption that if two genes are not contiguous in the gene listing, then they won't be in the same operon. (The converse, that if they *are* contiguous they will be in the same operon is clearly false).

Therefore, I made a dictionary of all of the pairs of genes that are called in the same operon, and then looped through the genome listing. If adjacent genes are in the same operon, I keep track of them in a "current operon" list.

code format="python" from collections import defaultdict

def get_operon_pairs(operon_filename): operons = open(operon_filename) operons.readline # This skips the header line

calls = defaultdict(bool) # If no call, it is not an operon calls[('rrmJ', 'ftsH')] = True calls[('ftsH', 'rrmJ')] = True

for line in operons: line = line.strip elements = line.split('\t') gene1 = elements[4] gene2 = elements[5] is_operon = (elements[6] == "TRUE") calls[(gene1, gene2)] = is_operon calls[(gene2, gene1)] = is_operon

return calls

operons = get_operon_pairs("operonCalls.txt") gene_info = open('geneInfo.txt') gene_info.readline
 * 1) Skip the header line

current_name = '' current_start = -1 current_stop = -1 current_strand = '.' current_operon_names = []

GFF_base = '\t'.join(['Chromosome', 'MicrobesOnline', 'operon',                     '%d', # low end position                      '%d', # high end position                      '.', '%s', # strand                      '.', 'operonName "%s";\n']) outfh = open('operons2.gff', 'w')

genes_thus_far = set

for line in gene_info: next_line = line.strip

if 'CRISPR' in line or 'insE' in line or 'insC' in line: continue

next_elements = line.split('\t') next_start = int(next_elements[4]) next_stop = int(next_elements[5]) next_strand = next_elements[6] next_name = next_elements[8]

if not operons[(current_name, next_name)]: if current_name: # current_name starts out empty, so on the first valid line, this # won't write out to the file GFF_line = GFF_base % (min(current_start, current_stop),                                  max(current_start, current_stop),                                   current_strand,                                   '-'.join(current_operon_names)) outfh.write(GFF_line) current_operon_names = [next_name] current_start = next_start current_stop = next_stop current_strand = next_strand

else: # Genes are in the same operon if current_strand != next_strand: print line # This is just a consistency check. if current_strand == "+": current_start = min(current_start, next_start) current_stop = max(current_stop, next_stop) current_operon_names.append(next_name) elif current_strand == "-": current_start = max(current_start, next_start) current_stop = min(current_stop, next_stop) current_operon_names.insert(0, next_name)

current_strand = next_strand current_name = next_name

GFF_line = GFF_base % (min(current_start, current_stop),                      max(current_start, current_stop),                       current_strand,                       '-'.join(current_operon_names)) outfh.write(GFF_line) outfh.close
 * 1) Clear out the last one, which has no "next"

code

Exercise 2
code format="python"

In [13]: alias ls ls -l

code

Exercise 3
c) Get the help information on SeqIO.write, and figure out how to write out the reverse-complement of all of the yeast genome to a new file.

code format="python" from Bio import Seq, SeqIO my_seq = Seq.Seq('CAGGTAG') SeqIO.write(my_seq.reverse_complement, 'new_file.fasta') my_rc = my_seq.reverse_complement SeqIO.write(my_rc, open('new_file.fasta', 'a') ) code
 * 1) or

d) Write a function that will convert a one-letter amino acid sequence (e.g. A = alanine, K = lysine, etc) into a three-letter amino acid sequence (alanine is Ala, lysine is Lys).

code format="python" threecode = {'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp', 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His', 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met', 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg', 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp', 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa', '*':'Ter', 'U':'Sel', 'O':'Pyl', 'J':'Xle', }

def seq3(aa_seq): aa_list = [] for aa in aa_seq: aa_list.append(threecode[aa]) return "".join(aa_list)


 * or

def seq3(aa_seq): aa_list = [threecode[aa] for aa in aa_list] return "".join(aa_list)

code