Session+5.2

=Beyond learning to code: Maintaining and Writing programs=

Introduction
This week, we've shown you a pretty large fraction of the core Python language. With enough patience, you could read through most of the Python documentation on your own and write code to do whatever you want. However, just as there's more to being a scientist than learning how to pipette (important a skill though that may be), there's more to writing software than learning the syntax of a language. This afternoon, I'll introduce you to a couple important skills that will serve you no matter what language you ultimately decide to program in.

The Project
Historically, the second week of the course has been dedicated to a specific project. Not only does this allow us to naturally talk about more science-specialized aspects of programming, but it also gives people in the class an opportunity to see how larger programs are structured. In years past, we've tried both coming up with our own analyses of published data, and attempted to replicate the results of a moderately computation heavy experimental paper. Both of these have ended up being a lot of work on the instructors part for not a whole lot of payoff: mostly uninteresting results, and when replicating someone else's paper, a lot of inconsistencies with no clear origin. So this year we're trying something new again:

Next week, we'll be going over some RNA-seq data that we've collected specifically for the course. These are experiments that one of us (Mike) wanted to do anyways. We'd like to stress that this is pre-publication data, so we'd appreciate you not sharing the data beyond this course. On the other hand, it's worth noting that a lot of the analysis we'll be going through next week is exactly the analysis we wanted to do on the data. You now know enough Python to do real science (although we'll be using some more modules that make things easier).

Now, here's the project: In bacterial transcription, there are two major ways that transcripts are terminated. The first, intrinsic termination, a hairpin forms in the elongating RNA that destabilizes the elongation complex. These hairpins can be located using RNA secondary structure predictors. The second major mechanism for termination is factor-dependent. Approximately half of the factor-dependent termination sites depend on the protein Rho. Rho is a hexameric ATPase that binds to the elongating RNA and disrupts translocation of the elongation complex. Rho binds to a pyrimidine rich (C/T/U) region, but there hasn't been any identified binding motif.

Some genomics work has been done on bicyclomycin (BCM) treatment of //E. coli//, which inhibits Rho. In particular, there are expression microarrays and ChIP-chip studies that have been done, but each of these has distinct flaws that limit our ability to draw the conclusions we'd like. The microarray study was performed using a pre-designed Affymetrix chip that is focused on the gene transcription, rather than the UTRs. While there are some conclusions to be drawn from this, it misses the most interesting part of the effect of Rho inhibition. The ChIP-chip study attempts to identify Rho-dependent genes, but due to the lack of a good antibody for Rho, instead looks at RNAP binding, and uses that as a proxy for Rho binding. Furthermore, microarray studies in general have problems with linearity: twice as bright a spot on the array does not necessarily mean that there's twice as much RNA.

What we decided to do was the simplest thing that could possibly work: Look at the transcriptome in three different concentrations of BCM, at two different time points after treatment. Then, we do RNAseq on each of those samples.

Homework
First, it's critically important that you undertand what RNAseq is. If you're not familiar with Illumina high throughput sequencing, you might try reading this page from Oregon HSU: http://www.ohsu.edu/xd/research/research-cores/mpssr/project-design/mpssr_sequencing_technology.cfm. The technology we used is the Illumina HiSeq, which is fundamentally similar to the GA IIx, but with about 10 times more reads per lane. For more on RNAseq in particular, this review is pretty straightforward: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/

Source code control
Good record-keeping is of the utmost importance in science, and it turns out to be really, really helpful in programming too. Sometimes, the "improvements" you make to a large piece of software actually break something that you weren't thinking about, so it's nice to have a record of what you did when, and easily be able to go back to previous versions. Alternatively, you could have multiple different versions of the same piece of software floating around lab, and if you're not careful, it's easy to make one set of modifications to one version, and a different set of modifications to another version.

This is problem isn't specific to just scientists, and the software engineering community has come up with a number of different software tools to help keep track of the changes that get made to source code. These are called Version Control Systems, and today I'll be showing you a brief overview of one, called Git.

git init
The first thing you'll want to do is **init**ialize a new repository. A repository is just the term for a collection of files that Git will keep track of. From the command line, the way to do this is straightforward. code $ git init Initialized empty Git repository in /Users/pcombs/Documents/PythonCourse2011/.git/ code What it's saying is that it's created a directory called ".git" inside of the PythonCourse directory. By convention, Unix-derived operating systems (like Linux and Mac OS X) hide things that begin with a. by default, though it is possible to get them to show up using the **ls -a** instead of just **ls**. For the most part, though, you won't need to muck about in .git directly anyways, so don't worry too much about it.

Also, in the event that you already had a git repository set up in the current folder, doing **git init** won't overwrite it, it just "reinitializes". I haven't figured out what this means, exactly, aside from a slightly different message that it shows up. But basically, you don't need to worry if you think you might already have a repository there: you can init one anyways and it won't break anything. In fact, all but a very few git commands are safe, and won't //destroy// data.

git add
So now that we have our shiny new repository, what do we do with it? Git will only keep track of things that we tell it to track, and the way we do that is by using **git add**. I'm first going to make a really simple file, and then **git add** it.

//hello.py// code format="python" print "Hello, world!" code

$ **git add hello.py**

git commit
At this point, we're almost tracking hello.py, but not quite! Git uses a 2-step process: first, you **add** files to a staging area (called the index); then, once you've added the files you want to the index, you **commit** them to the repository. A **commit** (Computer Scientists aren't the best at grammar, so what used to be a verb is now a noun) is a snapshot of your code at a particular time. Each **commit** has an associated message that can be as long or as short as you'd like, but traditionally, the first line is a brief, one-line summary of the changes you've made, and then you can put in a blank line and then as long or as short of a message as you'd like to explain the changes in more detail. This is like your lab notebook, so be as verbose as you need to explain why you did what you did.

$ **git commit**

So now let's make some more changes to our code: //hello.py// code format="python" print "Hello, world!" x = 3 print x code

status and diff
Now let's say we made those changes last night right before going home, and we don't remember if we added them to the index and/or committed them. There are a couple commands you can use to check on them:

$ **git status** code no changes added to commit (use "git add" and/or "git commit -a") code
 * 1) On branch master
 * 2) Changes not staged for commit:
 * 3)   (use "git add ..." to update what will be committed)
 * 4)   (use "git checkout -- ..." to discard changes in working directory)
 * 5)    modified:   hello.py
 * 1)    modified:   hello.py

In this case, we see that we've modified hello.py, but we haven't added it to the index. If we want to find out what changes we've made, exactly, we can do: $ **git diff** code diff --git a/hello.py b/hello.py index 4351743..2aae829 100644 --- a/hello.py +++ b/hello.py @@ -1 +1,3 @@ print "Hello, world!" +x = 3 +print x code

Now by default, **git diff** will tell you the difference between what is in the working directory and the most recent commit. That is, it's the changes that we //could// *add*. If you want to find out what changes we've already **add**ed, you can give it the --staged flag, so:

$ **git add hello.py** $ **git diff** //(nothing gets displayed, so there aren't any more changes we can add)// $ **git diff --staged** code diff --git a/hello.py b/hello.py index 4351743..2aae829 100644 --- a/hello.py +++ b/hello.py @@ -1 +1,3 @@ print "Hello, world!" +x = 3 +print x code

$**git commit**

Branches
So now let's say your lab mate (let's call him "Aaron") comes up to you after you show off the results of your program in lab meeting and says, "That program's really cool, but to use it for my project, I'd want to print out 3 squared instead of 3." Now, your project relies on plain 3, so you'd need to either
 * 1) Print out both 3 and 3 squared and rely on the user to figure out which one to use. That might work in this case, but maybe Aaron asked for modifications that aren't compatible with that approach.
 * 2) Copy the whole folder full of code elsewhere, and then make the change there. The problem with that approach is that if you discover a bug in the original program, you have to fix it in both places, which won't necessarily be trivial or obvious, and then you're never quite sure whether you've actually made the fix in both places, and ...
 * 3) Make a new **branch** of the repository. The code is allowed to diverge, but by storing the two branches in the same repository, you can keep track of the changes, and merge the changes from one to the other.

$**git branch xsquared** $**git checkout xsquared**

//hello.py// code format="python" print "Hello, world!" x = 3 print x**2 code

$ **git add hello.py** $ **git commit -m "Prints x^2 instead of x"**

Now we have both branches of code running in parallel to each other, and we can make changes in one without affecting the other. If you're ever not sure what branch you're on, you can do: $ **git branch** # Note the lack of a name for the branch code master code
 * xsquared

Merging
As we work some more, we realize perhaps that something is wrong. Our program isn't nearly excited enough. That's an easy change, though:

$ **git checkout master** //hello.py// -- on branch master code format="python" print "Hello, world!!!!" x = 3 print x**2 code $ git add hello.py $ git commit -m "Getting excited"

We are //really// excited and want to make this change apply to both branches, though, so it would be nice to have some way to **merge** the changes into the //xsquared// branch. $ **git checkout xsquared** # First, we switch back over to xsquared $ **git merge master** # We say what branch we want to merge the changes from. code Auto-merging hello.py Merge made by recursive. hello.py |   2 +- 1 files changed, 1 insertions(+), 1 deletions(-) code

Now, when we take a look at the code, we see that the program has automatically done the Right Thing™, and made the changes it was supposed to.

//hello.py// -- on branch xsquared code format="python" print "Hello, world!!!!" x = 3 print x**2 code

Advanced Merging
Sometimes, though, it's not possible for git to know what changes to make, and sometimes it does guess wrong. Let's work through an example where that happens.

Let's say that in two different branches, we make a change to the same line of code:

//hello.py// -- on branch master code format="python" print "Hello, world!!!! Let's print x" x = 3 print x code $ git add hello.py $ git commit -m "More descriptive message on master"

$**git checkout xsquared** //hello.py// -- on branch xsquared code format="python" print "Hello, world!!!! Let's print x**2." x = 3 print x**2

print "Goodbye, cruel world..." code

Now we add this in with two separate commits, one for the introductory message, and one for the sign-off message: $ **git add -p** # The -p flag lets us do thing's piecewise code diff --git a/hello.py b/hello.py index e9692b1..57e3f45 100644 --- a/hello.py +++ b/hello.py @@ -1,3 +1,6 @@ -print "Hello, world!!!!" +print "Hello, world!!!! Let's print x**2." x = 3 print x**2 + +print "Goodbye, cruel world..." + Stage this hunk [y,n,q,a,d,/,s,e,?]? ? y - stage this hunk n - do not stage this hunk q - quit; do not stage this hunk nor any of the remaining ones a - stage this hunk and all later hunks in the file d - do not stage this hunk nor any of the later hunks in the file g - select a hunk to go to / - search for a hunk matching the given regex j - leave this hunk undecided, see next undecided hunk J - leave this hunk undecided, see next hunk k - leave this hunk undecided, see previous undecided hunk K - leave this hunk undecided, see previous hunk s - split the current hunk into smaller hunks e - manually edit the current hunk ? - print help @@ -1,3 +1,6 @@ -print "Hello, world!!!!" +print "Hello, world!!!! Let's print x**2." x = 3 print x**2 + +print "Goodbye, cruel world..." + Stage this hunk [y,n,q,a,d,/,s,e,?]? s Split into 2 hunks. @@ -1,3 +1,3 @@ -print "Hello, world!!!!" +print "Hello, world!!!! Let's print x**2." x = 3 print x**2 Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]? y @@ -2,2 +2,5 @@ x = 3 print x**2 + +print "Goodbye, cruel world..." + Stage this hunk [y,n,q,a,d,/,K,g,e,?]? n code $ **git commit "More descriptive intro message on xsquared"** $ git diff code diff --git a/hello.py b/hello.py index e5df48d..57e3f45 100644 --- a/hello.py +++ b/hello.py @@ -1,3 +1,6 @@ print "Hello, world!!!! Let's print x**2." x = 3 print x**2 + +print "Goodbye, cruel world..." + code $ **git add hello.py** $ **git commit -m "Added sign-off message"**

So now we have some code (the sign-off message) from xsquared that we want to merge back into the master branch.

$ **git checkout master** $ **git merge xsquared** code Auto-merging hello.py CONFLICT (content): Merge conflict in hello.py Automatic merge failed; fix conflicts and then commit the result. code

So let's take a look at the **diff**erence between the code now and our last commit: $ **git diff** code diff --cc hello.py index 0a78149,57e3f45..0000000 --- a/hello.py +++ b/hello.py @@@ -1,3 -1,6 +1,10 @@@ ++<<<<<<< HEAD +print "Hello, world!!!! Let's print x" ++======= + print "Hello, world!!!! Let's print x**2." ++>>>>>>> xsquared x = 3 - print x + print x**2 + + print "Goodbye, cruel world..." + code

So we see a few things here:
 * The first line has two different versions. Because the same line was changed, it has no way to know what the Right Thing™ is, so it just gives us both options and makes us manually make the change.
 * It's been a little overzealous with the changes, and turned the "print x" into "print x**2". This is easy to fix by hand.**
 * **It added in the sign-off message. That we'll just leave there.**


 * Once we make those changes, we can add them to the index and then commit them.**
 * $** git add hello.py
 * $** git commit -m "Resolved merge"**

By the way, this style of having an "experimental" branch and a "master" branch can be a good way to go about things. That way, you always have a branch that works, but you still have a place to add in new features and whatnot.

Collaboration
Even if you're going to be the only person touching your code, some kind of version control will likely be helpful, but if you're going to be working on it with other people, it's nearly essential. Git was designed by Linus Torvalds to help with the development of Linux, which has hundreds of individual contributors. (He also named it after himself: "I'm an egotistical bastard, and I name all my projects after myself. First Linux, now git.")

Teaching you how to do this is outside the scope of this course, but Git is able to deal with it. Unlike some other Version Control Systems, Git is distributed, meaning that there is no central copy that everyone agrees on. Each copy of a repository is just as valid as any other, and they can be merged at will. If you do find yourself collaborating with someone else (and maybe even if you don't), I'd encourage you to look at Github, a Git-based code server. In the free level, all your repositories are openly displayed (though only you can modify them, unless you give other users permission), but there are also relatively cheap options for having closed-source repositories, if you're concerned about getting scooped on something. It's also possible to set up a Git server on a central lab server, but setting that up is //way// outside the scope of this course.

There's also a really nice visual guide to what lots of the most common git commands do: @http://marklodato.github.com/visual-git-guide/index-en.html I'd encourage you to check it out if you ever get confused by git (don't worry, sometimes it happens to me too!)

Stubbing and the 'pass' statement
When we write complicated code, we need to decompose it into simpler parts. This is an intuitive concept, and one that we've touched on before. Let's say that we want to make a program that gambles online and makes money for you so that you are free to pursue the standard academic career path of postdoctoral positions ad infinitum.

//Stubbing// is writing what your program should be doing, without actually getting around to filling in the details. It's like writing an outline of a paper. In this case:

//gambler.py//

code format="python" import sys import internet import gambling
 * 1) !/usr/bin/env python

accountID = sys.argv[1] password = sys.argv[2]

[balance,sessionInfo] = internet.loginToIllegalGamblingServer(accountID,password) while balance > 0: balance = gambling.playGame('poker',sessionInfo) if balance > 1000000000: print 'Congratulations: you are a billionaire.' internet.logoffFromIllegalGamblingServer sys.exit print 'Darn!' internet.logoffFromIllegalGamblingServer code

//internet.py// code format="python" def loginToIllegalGamblingServer(accountID,password): pass

def logoffFromIllegalGamblingServer: pass code

//gambling.py// code format="python" def playGame(gameType,sessionInfo,balance): if gameType == 'poker': hand = requestHand(sessionInfo) if handIsGood(hand): amountWon = goAllIn(hand,balance,sessionInfo) if amountWon: return amountWon + balance else: return 0 else: return balance else: print "I don't know how to play that game yet" return balance

def requestHand(sessionInfo): pass

def handIsGood(hand): pass

def goAllIn(hand,balance,sessionInfo): pass code

We're free to write the easy parts first, saving the hard parts for later. We've already created the logical flow of the program, and by doing this early, we can keep it organized.

The only thing new that we've covered here is the statement **pass**. It's pretty simple: it does nothing. Although this sounds somewhat pointless, in this case it allows you to write little function stubs without Python (or your text editor) complaining. However, it pops up in other places as well, usually as a shortcut where you mean to write more code later. This could be in an **if** or **else** statement or while raising an **exception**. Each of those cases requires something after the colon for it to be valid Python, and **pass** is a valid way to put in something that does nothing. We won't cover those applications here, but keep them in mind while you're programming: we encourage you to try it out.