Session+8.1

=Numerical Python (NumPy)=

Topics

 * **array** data types
 * Conversion between standard Python data types and numpy data types
 * Introduction to **numpy** and **scipy** packages
 * Basic statistical analysis with numpy tools
 * Compare performance using vector math and **for** loops.

Introduction
As Python matured into the multifunctional language it is today, more and more features were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as Python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in Python was a numerical library, which means a collection of tools for dealing with large amounts of numbers at once. After a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.

NumPy Basics
Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:


 * 1) Arrays cannot be of mixed types. They can be all **integers, floats, strings, logical** (or **boolean**) values, or other //immutable// values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:
 * 2) Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but your multidimensional array must be expressible as "a m by n (by j by k by...) array". I have never encountered a situation where Python says there are too many dimensions (but I've never had need beyond, maybe, 4 dimensions).
 * 3) We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in slice of the array.

For the some of the lecture today, I'll be using the format code format="python" >>> code goes here code

to indicate things that I'm doing in IPython, as opposed to in a code file. It's the same format that the regular python interactive interpreter uses, (rather than In[] and Out[]), but it lets us use **cpaste**, which strips out the >>> when interpreting code.

Arrays
Here's one way: start with a list and change it:

code format="python" >>> import numpy as np

>>> a = [0] * 40 >>> a = np.array(a) code

Or this can be shortened:

code format="python" >>> a = np.array([0] * 40) code

But there's a better way to get a vector of zeros: code format="python" >>> a = np.zeros(40) code

Note that the default type when declaring an **array** is **float64**: code format="python" >>> type(a[0])  code

And here's the simplest way to change that: code format="python" >>> a = np.zeros(40, int) >>> type(a[0])  code

And here's how to declare something that's not all zeros: code format="python" >>> a = np.arange(40) >>> a array([ 0, 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,       34, 35, 36, 37, 38, 39]) >>> type(a[0])  code Notice the **int** type.

What if we want a float? There's a couple ways to do it: code format="python" >>> a = np.arange(40, dtype=float) #Explicitly tell it to make a float >>> type(a[0])  >>> a = np.arange(40.0) #Implicitly give a float, and it will still work >>> type(a[0])  code

Like with **range**, you can also give **arange** more parameters: code format="python" >>> np.arange(40, 50) # Start and Stop array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49]) >>> np.arange(40, 50, 1) # Start, Stop, and increment array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49]) >>> np.arange(40,50,.25) array([ 40.,  40.25,  40.5 ,  40.75,  41.  ,  41.25,  41.5 ,  41.75,        42.  ,  42.25,  42.5 ,  42.75,  43.  ,  43.25,  43.5 ,  43.75,        44.  ,  44.25,  44.5 ,  44.75,  45.  ,  45.25,  45.5 ,  45.75,        46.  ,  46.25,  46.5 ,  46.75,  47.  ,  47.25,  47.5 ,  47.75,        48.  ,  48.25,  48.5 ,  48.75,  49.  ,  49.25,  49.5 ,  49.75]) code
 * 1) If any of the parameters are floats, the output will be a float

As I said above, you can have arrays with more than one dimension

code format="python" >>> a = np.zeros( (10, 10) ) # Note the inner set of parentheses >>> a array( 0., 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.) code

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

code format="python" >>> a[5][5] = 3 # row, then column >>> a[6,6] = 42 # still row, column >>> a array( 0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.) code

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

code format="python" >>> a[1,:] = 1 >>> a array( 0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.) >>> a[:,0] = 7 >>> a array( 7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  7.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.],       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  7.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],       [  7.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.) code

Let us pause for a moment and think about how we would do this with a for loop in lists:

code format="python" LoL = [[0]*10 for i in range(10)]

for i, elem in enumerate(LoL[1]): LoL[1][i] = 1

for L in LoL: L[0] = 7 code

We can also take slices of arrays, just as if they were lists: code format="python" >>> a = np.arange(10) >>> a[2:5] array([2, 3, 4]) >>> a[-1] 9 >>> a[::-1] array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) code

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

Vector Math with Arrays
We can do math on many values at once with arrays, no for loop required.

code format="python" >>> a = np.arange(0,100,2) >>> b = np.arange(50) >>> a array([ 0, 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,       68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]) >>> a + 10 array([ 10, 12,  14,  16,  18,  20,  22,  24,  26,  28,  30,  32,  34,        36,  38,  40,  42,  44,  46,  48,  50,  52,  54,  56,  58,  60,        62,  64,  66,  68,  70,  72,  74,  76,  78,  80,  82,  84,  86,        88,  90,  92,  94,  96,  98, 100, 102, 104, 106, 108]) >>> b array([ 0, 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]) >>> b/2.0 array([ 0.,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,   3.5,   4. ,         4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,   8. ,   8.5,         9. ,   9.5,  10. ,  10.5,  11. ,  11.5,  12. ,  12.5,  13. ,        13.5,  14. ,  14.5,  15. ,  15.5,  16. ,  16.5,  17. ,  17.5,        18. ,  18.5,  19. ,  19.5,  20. ,  20.5,  21. ,  21.5,  22. ,        22.5,  23. ,  23.5,  24. ,  24.5])
 * 1) Simple math across the whole array:

>>> a * b array([  0,    2,    8,   18,   32,   50,   72,   98,  128,  162,  200,        242,  288,  338,  392,  450,  512,  578,  648,  722,  800,  882,        968, 1058, 1152, 1250, 1352, 1458, 1568, 1682, 1800, 1922, 2048,       2178, 2312, 2450, 2592, 2738, 2888, 3042, 3200, 3362, 3528, 3698,       3872, 4050, 4232, 4418, 4608, 4802]) >>> >>> a = np.arange(0,100,2) >>> b = np.arange(50) >>> >>> sum(a * b) # alternatively, the function np.dot(a,b) 80850 code
 * 1) pairwise multiplication, consider the for loop alternative

In general, you'll want your arrays to be the same shape and size when doing math with them, although there are arcane rules for what to do when they aren't (it may or may not just give an error).

Boolean (logical) Values with Arrays
code format="python" >>> a = np.zeros(10, dtype=bool) >>> a array([False, False, False, False, False, False, False,      False, False, False], dtype=bool)

>>> a[2:5] = True >>> a array([False, False, True,  True,  True, False, False,       False, False, False], dtype=bool)
 * 1) Slicing and mass-assignment still works

>>> b = (a == False) >>> b array([ True, True, False, False, False,  True,  True,  True,        True,  True], dtype=bool) >>> b = ~a >>> b array([ True, True, False, False, False,  True,  True,  True,        True,  True], dtype=bool) >>> b = not a # It would be nice if this worked, but no such luck Traceback (most recent call last): File " ", line 1, in ValueError: The truth value of an array with more than one element is ambiguous. Use a.any or a.all code
 * 1) Comparison and assignment

Basic Statistics with Numpy
NumPy is freaking huge, with around 1200 pages of reference documentation, but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

code format="python" >>> a = np.arange(99) >>> np.mean(a) 49.0 >>> np.std(a) 28.577380332470412 >>> np.var(a) 816.66666666666674 code
 * Summary Statistics**
 * 1) Standard Deviation
 * 1) Variance

code format="python" >>> a = np.random.uniform(0, 100, 10) # Low, High, Size of Output >>> a array([ 59.29058916,  6.92921008,  91.93188435,  45.38511999,        34.46457059,  41.29478046,  33.51351871,  42.5493656,        50.54698303,  19.79408663]) >>> a = np.random.uniform(0, 100, (3,3)) >>> a array( 12.06197457, 90.90232851,  51.27616458],       [ 61.12517983,  88.14112687,  29.36606864],       [ 77.35499074,  53.7975086,  72.63088545)
 * Random distributions**

>>> a = np.random.uniform(0,100,1000000) >>> np.mean(a) 50.022576176522485

>>> np.mean(np.random.exponential(np.exp(1),1000000)) 2.7188135956330033 code
 * 1) exponential sample with mean = e

Pylab
In the early exploratory phases of your data analysis, sometimes every second counts when you're trying to convert some idea in your head into code that you can try out. IPython has a mode called Pylab that has lots of things already imported into the main namespace, so you can start playing around with your data without having to import things, and without having to type out numpy. (or even np.) in front of each and every Numpy function you want to use. To get it, simply start up IPython with a flag:

$ **ipython --pylab**

This starts it up with all the numpy functions imported, as well as lots of Scipy (which I'll tell you about next) and matplotlib (more on this on Friday) set up so that you can do pretty heavy math without a lot of mental overhead remembering which package everything is in. The way I often like to go about things is to figure out what I want to do, try it in IPython with Pylab on, then use the **history** or **save** magic words in IPython to get what I did, put it into a file, trim out all the failed approaches, and turn key parts into functions. In that file, I still usually do an **import numpy as np**, and then go to all the numpy functions i used and change them from, for example, **x = array(...)** to **x = np.array(...**). This is because there are a few functions (like min and max) that numpy overrides the built-in versions, and it's good to be explicit about which one you want.

SciPy and Fitting
SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The stats module of SciPy has functions for even more statistical distributions, statistical tests, and other assorted functions that a good statistician might need. As an example, let's see how we might use the linregress function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

code format="python" import numpy as np

slope = .5 intercept = -10

x = np.arange(0,100) y = slope*x + intercept noise = 5 * np.random.randn(len(x))

y = y + noise code

I'll show you how to generate plots like this in the afternoon, but if you know any Matlab, it's eerily similar.

This isn't a math class, so we're going to start with the equations for slope, intercept, and correlation coefficient of the best-fit line as given:

Now we just translate the math to Python code: code format="python" n = len(x)

m = (n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x**2) - (sum(x))**2) b = (sum(y) - m * sum(x))/n r = (n * sum(x * y) - sum(x) * sum(y)) / np.sqrt((n*sum(x**2) - sum(x)**2)                                               * (n * sum(y**2) - sum(y)**2))

print m, b, r

code

//0.486677343735 -9.05040165994 0.928979337505//

This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

code format="python" from scipy import stats

r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)

print "Regression Slope: ", r_slope print "Regression Intercept: ", r_int print "Regression correlation: ", r_rval print "R^2:, ", r_rval**2 print "p(slope is 0): ", r_pval code //Regression Slope: 0.486677343735// //Regression Intercept: -9.05040165994// //Regression correlation: 0.928979337505// //R^2:, 0.86300260951// //p(slope is 0): 4.31945319634e-44//



One other function that you might find useful is the **corrcoef** function, which gives you a correlation matrix between two data sets. For two 1-D sets like we have (x and y), this will give a 2x2 matrix.

code format="python" >>> corrcoef(x, y) array( 1.      ,  0.92897934],       [ 0.92897934,  1.        ) code

Note that the 0,1 and 1,0 (correlation of x with y, and y with x, respectively) entries of this matrix are the same as the correlation coefficient from before. This two dimensional version of the output is kind of a pain for 2 sets of 1-D data, but it really does make sense with datasets with more variables.

There's a lot more to the stats module alone than we can cover in just one morning, so I'll just point you to the documentation for Scipy: [] You might also look into the cluster, interpolate, and optimize modules, depending on what exactly you want to do. The others could be helpful as well, but those are among the most likely for biologists.

Performance Enhancing Arrays
Remember how I told you that it's better to write something that's kind of good enough, rather than spending your time optimizing the code, as long as the "kind of good enough" version is a lot faster to write? Well, it turns out with arrays we get to have our cake, and eat it too! Not only is the code a lot faster to write, since you don't need to do explicit for loops, it also //runs// a lot faster. Let's investigate this by generating two lists (or arrays) with 10 million random numbers, then add them together:

//no_numpy_randsum.py//

code format="python" import random

list1 = [random.randrange(0, 100) for i in range(int(1e7))] list2 = [random.randrange(0, 100) for i in range(int(1e7))] list3 = [a + b for a, b in zip(list1, list2)]

code

Note that this is the //fast// "pure Python" version, since we're using list comprehensions. If you had to append onto a list, it would be //even slower//!

//with_numpy_randsum.py//

code format="python" import numpy as np

list1 = np.random.randint(0,101, 1e7) list2 = np.random.randint(0,101, 1e7) list3 = list1 + list2

code

Because numpy is able to know that everything is going to be an integer, it can do a lot of optimizations to the arrays that it wouldn't be able to do if each element could, conceivably be a different type. Furthermore, a lot of the time is spent doing checks on the input to the randrange function that, because we're using an array, can be done twice, rather than 20 million times.

Exercises
a) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5. b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array. c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.
 * 1) Writing Mathematical Functions**

So we had this idea that we might be able to find a periodicity in the spacing of pyrimidine residues downstream of the termination site in Rho dependent genes (by and large, we don't). Nevertheless: a) Make a function that finds all locations of a certain substring (like 'C', or 'CT', or whatever) and returns it as an array. b) Find the **diff**erence between each pair of adjacent substrings c) Use the BioPython module to read in a fasta file that has lots of 3' UTRs (we'll calculate these later) and uses the previous functions to generate a full list of the spacings between the pyrimidines. d) Compute the histogram of these spacings (we'll show you how to plot them later).
 * 2) Strings to arrays**

Solutions
code format="python" def two_thirds(an_array): return an_array/1.5
 * a)

from numpy import mean, std from pylab import normal def reject_outliers(an_array): middle = mean(an_array) spread = std(an_array) return an_array[((middle - spread) > an_array) | (an_array > (middle + spread))] # the | combines the two arrays # OR# ret = [] for el in an_array: if -spread < el - middle < spread: ret.append(el) return ret
 * b)

print reject_outliers(normal(0, 1, 1000))


 * c)

from pylab import arange, exp, uniform

x = arange(0,10,.01) y = exp(x)

z = uniform(0,10, len(y)) z = exp(z) z.sort # note that z.sort returns None

print sum(y - z) / len(y)

code

2) Histogram of spacings

code format="python" import numpy as np import sys

def find_all(string, substring): loc = string.find(substring) locs = [] while loc != -1: locs.append(loc) loc = string.find(substring, loc + 1) return np.array(locs)

def find_all_from_list(string, list): all_locs = [] for substring in list: all_locs.extend(find_all(string, substring)) return array(sorted(all_locs))

all_diffs = []

for rec in SeqIO.parse(sys.argv[1], 'fasta'): hits = find_all(rec.seq, 'C') diffs = np.diff(hits) all_diffs.extend(diffs)

print np.histogram(all_diffs, range(0,30)) code