Session+5.1

Topics:

 * Common Errors
 * Error messages from Python
 * Exception Handling (try, except)
 * Print statement debugging

Introduction
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." – Brian W. Kernighan

Let's be honest, most of us aren't perfect. And in our zen self-awareness, we are well-served to be sensitive to our predispositions to err in particular ways. We are probably even better served to listen to the computer when it tells us we've made a big mistake. Python can be quite good at communicating our flaws to us, if only we're receptive to the constructive criticism in its spartanly high-contrast black and white print.

This morning we're going to look at common sources of error, see what to look for as feedback from Python, and learn a couple of tricks to both obviate, or if necessary, track down bugs, should they occur.

A few commonly encountered sources of errors include:

 * **mis-typed variable names**
 * **mis-typed variables**
 * **non-existent variables (or indices in a variable)**
 * **unexpected input**

If you're very lucky, an error will cause Python to give up right away, while others will cause insidious bugs that sneak through unnoticed until you present your work in lab meeting and someone calls you on your exciting, but seemingly impossible (and ultimately bogus) result.

Strategic Initiative 1: Test Early, Test Often
There's an approach to software engineering called "Test Driven Development". In that way of doing things, before you write a function that does something, you write code that tests it, often in a separate file. It's really convenient to use modules to do this, where you could have a test.py file that imports the relevant functions from your other modules and tests them on as many cases as it can. This also has the advantage that you notice fairly quickly when your new code breaks something that used to work. Especially in scientific programming, this isn't always practical to do, but the general ideas hold: testing is good.

Even if you don't do TDD, you can save yourself lots of time by testing frequently. Debugging 100 lines of code is often more than 10 times as hard as debugging 10 lines. Writing a ton of code that generates output without checking if each component works individually does NOT make you a coding rock star; it makes you sloppy.

Strategic Initiative 2: Be verbose
"Errors should never pass silently." -- The Zen of Python, by Tim Peters

One of the easiest ways to debug code is to print out the value of variables at the point things start going wrong. Of course, if you knew where things were going wrong, you would probably know what was going wrong in the first place, so to find this, a divide-and-conquer approach is often fastest. Start out by putting a bunch of print statements in throughout your code, then narrow things down gradually until you're right at the broken line of code.

It's not a bad idea to include these print statements in moderation before you need them. If you think there will be a subset of your data that will fail a logical test, setup your if and else statements to report the incidence of failure.

code format="python"
 * 1) lab_members = {name : [status]}
 * 2) pretend this dictionary is already filled with data

for name in lab_members: if lab_members[name] == 'Post-doc': get_a_job(name) print "Time for this post-doc to get a job..." elif lab_members[name] == 'Grad Student': do_research print "Be patient, this one's just a grad student." else: print "Warning:", name, "is not a post-doc nor a grad student!" code

Strategic Initiative 3: Be Boring, Be Obvious
"Programs must be written for people to read, and only incidentally for machines to execute." --Structure and Interpretation of Computer Programs Hal Abelson and Gerald Jay Sussman

As you get better at coding, you will start to take shortcuts and combine lines. As soon as something doesn't behave as you expect, you should decompose your compound statements, as this is a common source of error.

Compound:

code format="python" helix_aa.extend(range(int(line[21:25].strip),int(line[33:37].strip)+1)) code

Expanded:

code format="python" start_str = line[21:25] end_str = line[33:37] start_stripped = start_str.strip end_stripped = end_str.strip start = int(start_stripped) end = int(end_stripped)+1 helix_res = range(start,end) helix_aa.extend(helix_res) code

These two code snippets do the same thing, and Python doesn't much care which you use. On the other hand, it's much easier to look at each line of code in the expanded case and see that it does what it should be doing, whereas in the compound form, you could very easily have parentheses that are matched in unexpected ways that throws the whole meaning of things off.

On a related note, it's good practice to comment your code. The Python Style Guide has some good recommendations about comments. Anything you did that required some thinking on your part should ideally be commented, since you might come back to the code weeks or months later and have forgotten why you did it that way. Also try to keep your comments up to date with the code.

Error Messages from Python: Syntactical Mistakes and Others
Much like in spoken languages, Python has rules for what is and is not grammatically allowed. Before running your script, Python will check it to make sure that it does follow all these rules, and will let you know if it doesn't. Keep in mind, though, that it can't catch errors in meaning, so long as your code follows the rules of the language (as a natural language equivalent of things Python can't catch, consider Groucho Marx: "One morning I shot an elephant in my pajamas. How he got into my pajamas I'll never know.").

code format="python"
 * 1) !/usr/bin/python

while True print 'Bioinformatics is AWESOME!' code

$ **./errors.py**

code format="python" Traceback (most recent call last): File "./errors.py", line 3 while True ^ SyntaxError: invalid syntax code

This error message, for its brevity, is quite helpful. We quickly learn the following:


 * 1) The problem is on **line 3** of the file **errors.py**. This may seem trivial in this case, but when you have nested functions conjured from various modules in various directories, knowing which file to look in is a big, big help.
 * 2) There is a problem after True. Note that, for syntax errors, the caret will often point somewhere **after** the place where the actual error is, and sometimes can point quite a bit after, though in this case it's pretty close.
 * 3) The problem is that there is invalid syntax: we haven't followed the rules of the language. As you've likely seen already, and will no doubt see much more of in the future, there are all kinds of different errors that can show up, usually with a terse message explaining what went wrong.

What is the problem and how is it fixable?

There are lots of other kinds of errors that will almost inevitably crop up in your code. You've even already seen some in our lecture on data structures.

code format="python" instructors = ['Aisha', 'Peter', 'Diana'] print instructors[10] code

$**./errors.py**

code format="python" Traceback (most recent call last): File "errors.py", line 4, in    print instructors[10] IndexError: list index out of range code

Again, let's see what we can work out from this error message:
 * 1) The error is now on **line 4** of **errors.py**.
 * 2) There's some problem with the expression **print instructors[10]** (presumably in this case you could have looked that up yourself, but Python does try to be helpful).
 * 3) That problem is of the type **IndexError**, and in particular there is a **list index out of range**.

By now, it should be clear what the problem is (assuming it wasn't clear to you before I even ran it): **instructors** isn't long enough to have an item at index 10, so Python throws up its hands and gives up. Sometimes, though, we don't want Python to surrender at the first sign of trouble. If only there were some way to help it deal with problems that we can anticipate...

Easier to Ask Forgiveness than Permission: Try and Except
"Errors should never pass silently. Unless explicitly silenced." -- The Zen of Python, by Tim Peters

The try and except statements let us attempt to do something that may or may not work, and then respond appropriately if or when something goes wrong. For example, sometimes the file we want to work with isn't available for some reason (and if you're really lucky, it's only because you forgot to plug in an external hard drive).

code format="python" fh = open('AwesomeDataFile.txt', 'r') code

$ **./errors.py**

code format="python" Traceback (most recent call last): File "badsyntax.py", line 4, in    fh = open('AwesomeDataFile.txt', 'r') IOError: [Errno 2] No such file or directory: 'AwesomeDataFile.txt' code

code format="python" try: f1 = open('AwesomeDataFile.txt', 'r') except IOError: print "The file 'AwesomeDataFile.txt' doesn't exist, falling back to 'CrummyDataFile.txt'..." f1 = open('CrummyDataFile.txt', 'r') code

Sometimes you might have a section of code where multiple things could go wrong. In that case, you can respond to each of them appropriately using multiple except statements. It's even possible to have an except statement without listing the type of error, which will deal with anything not handled by one of the previous except statements.

code format="python" try: do_something except IOError: handle_ioerror except IndexError: handle_indexerror except: print "Some other error happened" code

It's considered slightly bad form, however, to have a catch-all except block. If you didn't anticipate the error, it will probably be helpful to you to get the default error message, even if it is a little arcane, rather than a message like the above, which is pretty darn vague.

For a listing of the default kinds of errors (also called "exceptions") that can happen, see the list of Built-in Exceptions in the Python Documentation.

Else and Finally
Similar to the use of else in for and while loops, anything in an else block will run *only if* no errors happened. On the other hand, anything in a finally block runs after the try/except blocks, and will run regardless of whether an exception was thrown; if there was an exception, it doesn't stop it from being thrown, it just waits a bit before it does.

code format="python" def divide(x,y): try: result = x/y except ZeroDivisionError: print "Warning: divide by zero!" return float('NaN') #Not a number else: return result finally: print "executing final clause"

print divide(2,1) print print divide(2,0) print print divide('A','B') code

$**./errors.py** code format="python" executing final clause 2

Warning: divide by zero! executing final clause nan

executing final clause Traceback (most recent call last): File "errors.py", line 16, in    print divide('A','B') File "errors.py", line 3, in divide result = x/y TypeError: unsupported operand type(s) for /: 'str' and 'str' code

The first case made it through the division just fine.

The second case raised an exception (a division by zero error), but we anticipated that, so we replace it with a special Not a Number value, and move on.

The third case we found an exception that we didn't anticipate. The exception came up, so we still ran the finally clause, but then raised the exception. Any questions?

1) Exception handling we have known.
In lesson 2.1, we learned about two functions that can be applied to remove an item from set: remove and discard. We looked at the functions using the following example:

code format="python"
 * 1) !/usr/bin/env python

list_of_letters = ['a', 'a', 'b', 'c','c','c','d','e'] print 'ORIGINAL' set_of_letters = set(list_of_letters) print set_of_letters

print 'DISCARD' set_of_letters.discard('q') print set_of_letters

print 'REMOVE' set_of_letters.remove('q') print set_of_letters code

a) Now that we've learned more about exception handling, explain what is happening here.

b) Create a script that contains a list of 5 of your favorite beers or wines or soft drinks (depending on your preference) for a tasting party stored as a set. Each time someone drinks a beverage, it is removed from your fridge and cannot be drunk again. Ben drinks a beverage. Adjust the set as appropriate. Mel sees Ben drinking his beverage and wants the same one. Tell her you're out of that beverage.

2) Modify the FASTA parser from Session 4.1 to handle:
1) non-existent filenames 2) files that do not conform to the FASTA format (i.e. >gene for IDs, and strings of A,T,G, or C for sequence). 3) sequences that are in both cases

3) Modify the FASTA parser to identify file compression format and read compressed files.
We need to add functionality to our FASTA parser for use next week. We will be using very large FASTA files (> 3GB) when they are uncompressed. In order to save disk space, we'll be leaving them compressed (~1 GB) while we use them. However, this will require us to use the gzip module to read a compressed file directly. We will also need the mimetypes module, which can identify the type of file we are using.

First, write a new function called open_file_by_mimetype that determines whether a file is gzipped or not (using the mimetypes modules) and returns an open file handle. Then make a function called read_fasta using your FASTA parser script, but modify it to call this new function such that your parser can read both types of FASTA files automatically.

When you're done, store your new file-opening function in a module called file_tools.py and your improved read_fasta parser in a module called seqeunce_tools.py.

4) All code is bug-free until your first user.
You have another coworker who heard about an AMAZING secondary structure analysis code (solution to problem 6 from last year. She asks if you will analyze her protein, interleukin-19, as well (HINT: use PDB code 1N1F). Crud! This protein breaks your code. Why? Rewrite your code to work on both interleukin-19 and on the original H1N1 neuraminidase example.

1. Exception handling we have known.
a) Discard something that's not there doesn't result in an error; removing something that's not there does.

b) code format="python" print listOfBeers = ['Racer 5', 'Dogfishhead 90 Minute IPA', 'Chimay', 'Unibroue Maudite', 'Josephsbrau PLZNR'] print 'The party is all stocked with my favorite beers:' setOfBeers = set(listOfBeers) for beer in setOfBeers:   print '\t', beer print def BxxrHour(x):    print 'Craving some %s.  Goes to get one.  ' % (x)    try:        setOfBeers.remove(x)    except:        print "\"Hey! Someone totally pwned my %s!!\"" % (x)        print 'Beverage FAIL!!!'    else:        print 'Epic beverage WINS!'    finally:        print "The remaining beverages are:"        for beer in setOfBeers:            print '\t', beer        print        return setOfBeers        print print 'Ben wants a beverage:' setOfBeers = BxxrHour('Racer 5') print 'Mel wants a beverage:' BxxrHour('Racer 5') code
 * 1) !/usr/bin/env python
 * 1) a) We converted the original list_of_letters to a set
 * 2) called set_of_letters, which got rid of all duplicates
 * 3) in the set.  Then we discarded 'q'.  Discard can handle
 * 4) values that don't exist in sets, so discard does not
 * 5) return an error because there's no q, but remove can't
 * 6) handle values that don't exist in sets, so it returns a
 * 7) KeyError.

2. Modify the FASTA parser from Session 4.1 to handle:
callparser.py code format="python" from parser import fastaParser parsed = fastaParser('seq.FASTA') print parsed code
 * 1) !/usr/bin/env python

fastaParser.py

code format="python" def fastaParser(fastafilename): current_gene = ""  # Start with an empty string, just in case genes = { }        # Make an empty dictionary of genes try: fh = open(fastafilename, 'r') except IOError: print 'Could not find file with filename %s' % (fastafilename) result = 'Please verify that your filename is correct and try again.' return result for lineInd, line in enumerate(fh.readlines): if lineInd == 0: if not line[0] == '>': print 'File does not conform to FASTA format.' result = 'Please try again with FASTA formatted file.' fh.close return result else: pass else: pass line = line.strip # Clear out leading/trailing whitespace line = line.upper # Deals with whatever case the # sequence is by making it all upper case if len(line) > 0 and line[0] == ">":  # This one is a new gene current_gene = line[1:] genes[current_gene] = "" else:               # Add onto the current gene genes[current_gene] += line fh.close return genes code
 * 1) !/usr/bin/env python

3. Modify the FASTA parser to identify file compression format and read compressed files.
callparser2.py code format="python" from sequencetools import * import sys fastafilename = sys.argv[1] x = read_fasta(fastafilename) print x code
 * 1) !/usr/bin/env python

sequencetools.py code format="python" def read_fasta(fastaFilename): from filetools import open_file_by_mimetype as ofbm current_gene = ""  # Start with an empty string, just in case geneDict = { }     # Make an empty dictionary of genes fh = ofbm(fastaFilename) for lineInd, line in enumerate(fh.readlines): if lineInd == 0: if not line[0] == '>': print 'File does not conform to FASTA format.' result = 'Please try again with a FASTA formatted file.' fh.close return result else: pass else: pass line = line.strip # Clear out leading/trailing whitespace if lineInd % 10000 == 0: print "At line %s." % (lineInd) else: pass line = line.upper # Deals with whatever case the # sequence is by making it all upper case if len(line) > 0 and line[0] == ">":  # This one is a new gene current_gene = line[1:] geneDict[current_gene] = "" else:               # Add onto the current gene geneDict[current_gene] += line fh.close return geneDict code filetools.py code format="python" def open_file_by_mimetype(filename): import mimetypes if 'gzip' in mimetypes.guess_type(filename): try: import gzip fh = gzip.open(filename, 'r') print 'File is a gzipped file.' print return fh       except IOError: print 'Could not find file with filename %s' % (filename) result = 'Please verify that your filename is correct and try again.' return result else: try: fh = open(filename, 'r') print 'File is not a gzipped file.' return fh       except IOError: print 'Could not find file with filename %s' % (filename) result = 'Please verify that your filename is correct and try again.' return result code
 * 1) !/usr/bin/env python
 * 1) !/usr/bin/env python

4. All code is bug-free until your first user.
code format="python"
 * 1) !/usr/bin/env python

import sys, os full_seq = [] helix_aa = [] sheet_aa = [] atoms = [] f1 = open('1N1F.pdb' ,'r') for next in f1: tmp = next.strip.split if tmp[0] == 'SEQRES': if tmp[2] == 'A': full_seq.extend(tmp[4:]) elif tmp[0] == 'HELIX': try: int(tmp[5]) except: tmp[5] = tmp[5][:-1] helix_aa.append(tmp[:9]) elif tmp[0] == 'SHEET': sheet_aa.append(tmp[:10]) elif tmp[0] == 'ATOM': if len(tmp) < 12: begin = tmp[0:2] end = tmp[3:] middle = [tmp[2][:3], tmp[2][4:]] tmp = begin + middle + end try: int(tmp[5]) except: continue atoms.append(tmp) num_helix_res = 0.0 print "There are %s residues in the sequence" % len(full_seq) feature = ['Other']*(10000) for aa in helix_aa: # We add 1 because there are b-a+1 residues between a and b, inclusive num_helix_res += float(aa[8]) - float(aa[5]) + 1 for i in range(int(aa[5]), int(aa[8])+1): feature[i] = 'Helix' num_sheet_res = 0.0 for sheet in sheet_aa: num_sheet_res += float(sheet[9]) - float(sheet[6]) + 1 for i in range(int(sheet[6]), int(sheet[9])+1): feature[i] = 'Sheet' helix_bfactors = {} sheet_bfactors = {} other_bfactors = {} for atom in atoms: Chain = atom[4] BFactor = float(atom[10]) ResidueNum = int(atom[5]) if feature[ResidueNum] == 'Helix': if Chain not in helix_bfactors: helix_bfactors[Chain] = [] helix_bfactors[Chain].append(BFactor) elif feature[ResidueNum] == 'Sheet': if Chain not in sheet_bfactors: sheet_bfactors[Chain] = [] sheet_bfactors[Chain].append(BFactor) else: if Chain not in sheet_bfactors: other_bfactors[Chain] = [] other_bfactors[Chain].append(BFactor)
 * 1) Peter's code starts here
 * 1) Peter's code starts here
 * 1) Set up a listing of features by residue, then fill it in as we go along
 * 1) atom[4] == chain id
 * 2) atom[5] == residue #
 * 3) atom[10] == b-factor


 * 1) This is my code to modify the program in order to make it robust to different .pdb's
 * 1) This is my code to modify the program in order to make it robust to different .pdb's

bfactors = {'Helix':helix_bfactors, 'Beta sheet':sheet_bfactors, 'Other':other_bfactors}

for y in bfactors: for x in bfactors[y]: if len(x) > 0: bfactordict = bfactors[y] for chain in bfactordict: average = sum(bfactordict[chain])/len(bfactordict[chain]) print 'Chain: %s\t %s: %5f\t' % (chain, y, average) code