Session+10.2

=How to make fancy figures=

Okay, there's a lot more to this than we can realistically cover in an afternoon, but there are lots of things you can do to make really cool figures in Python. We're going to be using matplotlib, which is a plotting library that took a lot of the plotting functionality from the popular MATLAB software, re-wrote it in Python, and (in my opinion) made it about 10 times saner and easier to use.

Nevertheless, we'll try to cover the basics, and give you a good enough understanding of how everything is set up so you can look at the [|extensive gallery of examples] and figure out how to make similar plots with your own data.

Basic plotting
We're going to do a lot of this plotting in the "pylab" mode of IPython. This is meant to be set up pretty close to the MATLAB environment, in terms of plotting and doing other math heavy stuff. Again, the way to start this is:

$ **ipython --pylab**

Let's say we have some data that's approximately a line, but there's some noise in it:

code format="python"
 * 1) Note that arange and rand are already in the namespace.
 * 2) ipython --pylab has, among other things, an implicit
 * 3) from pylab import *
 * 4) in its setup

x = arange(0,100) y = 0.5 * x + 5 + 10*rand(len(x)) plot(x,y) code

That was pretty easy! Now, let's say we don't want to have lines connecting each data point, but instead just a marker. We can look at the documentation for plot like we would anything else in IPython:

code In [6]: plot?

Type: function Base Class:  String Form: Namespace: Interactive File: /Library/Frameworks/.../site-packages/matplotlib/pyplot.py Definition: plot(*args, **kwargs) Docstring: Plot lines and/or markers to the
 * class:`~matplotlib.axes.Axes`. *args* is a variable length

argument, allowing for multiple *x*, *y* pairs with an optional format string. For example, each of the following is legal:: plot(x, y) # plot x and y using default line style and color plot(x, y, 'bo') # plot x and y using blue circle markers plot(y) # plot y using x as index array 0..N-1 plot(y, 'r+') # ditto, but with red plusses

If *x* and/or *y* is 2-dimensional, then the corresponding columns will be plotted.

An arbitrary number of *x*, *y*, *fmt* groups can be specified, as in:: a.plot(x1, y1, 'g^', x2, y2, 'g-') Return value is a list of lines that were added. The following format string characters are accepted to control the line style or marker:

=
=== =============================== character description

=
=== =============================== ``'-'`` solid line style ``'--'`` dashed line style ``'-.'`` dash-dot line style ``':'`` dotted line style ``'.'`` point marker ``','`` pixel marker ``'o'`` circle marker ``'v'`` triangle_down marker ``'^'`` triangle_up marker ... code

You can see that there's a lot of different things you can do for something as simple as plotting... Markers, colors, lines. If you keep reading, you can even incorporate labels for the lines. Let's try this code, now, and see what it looks like:

code format="python"
 * 1) Note that arange and rand are already in the namespace.
 * 2) ipython --pylab has, among other things, an implicit
 * 3) from pylab import *
 * 4) in its setup

x = arange(0,100) y = 0.5 * x + 5 + 10*rand(len(x)) plot(x,y, 'bo', label="Blue circles")
 * 1) Alternatively:
 * 2) scatter(x,y,label="Circles")

from scipy import stats

r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y) plot(x, x * r_slope + r_int, 'g-.', label='Dash-dotted Line')

code

Hey, what about the labels?

code format="python" legend code

Or, if we decide that we don't like the labels that we gave it before:

code format="python" legend(["Noisy data", "Linear regression"]) code

Or, with some other tweaks in the call to the **legend** function

code format="python" legend(["Noisy data", "Linear regression"],      loc='lower right',       numpoints=1) code

=Making our own plotting functions=

You know how in papers, they will sometimes have a kind of fancy figure, and then they'll have things in the same style, but for a bunch of different ways of slicing and dicing their data? It's really pretty effective scientific story-telling. It allows them to connect all those figures together conceptually, and readers only have to look for the relevant differences.

The thing is, if you're going to actually **make** those figures, it can be annoying to tweak the plots in the same way every time. Fortunately, we've spent almost two weeks now taking boring things that a person could do and making the computer automate them.

Plotting the average gene coverage
Now yesterday, we just looked at the average level of reads upstream, downstream, and within the genes. But there's a lot more data there than just number of reads: there's also the positioning of those reads. If the transcriptional machinery is falling off of the DNA, it would be nice to look at the plot and estimate what the rate of falling off is. So what we'll do is take the upstream, gene_cov, and downstream variables from CalcFalloff yesterday and plot them all together, giving us a sort of "average" gene.

code format="python" x_upstream = arange(-len(upstream), 0) plot(x_upstream, upstream) x_gene = arange(0, len(gene_cov)) plot(x_gene, gene_cov) x_downstream = arange(len(gene_cov), len(gene_cov) + len(downstream)) plot(x_downstream, downstream) code

Now, that will give us what we want, but it's a bit tedious to type that out, or even to scroll up in IPython every time we want to do that, so let's make ourselves a nice function that does it. We'll be putting it in a PlotUtils.py module, so let's go ahead and start that as well.

code format="python" import numpy as np from matplotlib import pyplot as mpl

def plot_averaged_genes(upstream, gene_cov, downstream) x_upstream = np.arange(-len(upstream), 0) mpl.plot(x_upstream, upstream) x_gene = np.arange(0, len(gene_cov)) mpl.plot(x_gene, gene_cov) x_downstream = np.arange(len(gene_cov), len(gene_cov) + len(downstream)) mplplot(x_downstream, downstream) code

Informative Interlude: Differences between IPython with and without --pylab
If you start IPython without the --pylab flag, and you discover that you want to plot things, you have a couple options. By far the best of these is to use the **%pylab** magic word, which will load all the pylab related things that you need. Next best is to use **from pylab import** *, which will load all the plotting functions, but they won't work quite properly for interactive plotting. Instead, every time you want to see what you've plotted, you'd need to use the **show** function, which will block anything else you do until you close the window. If you want to have a program that does its plotting, then saves the figures out, then you can do something like the above, where you have

code format="python" import numpy as np import matplotlib.pyplot as mpl


 * 1) Plotting code here

mpl.savefig('myfile.jpg') code

I think that in a perfect world, someone ought to be able to run a script and have almost all their figures for a paper just pop out, with only relatively minor tweaking. Then, if that code were available too, it should be possible for a reasonably savvy reviewer to see that all the data is on the up-and-up.

Now one thing that's kind of funny here is that the different plots are all different colors, despite the fact that they're from the same sample. Now, we could take the color as an argument to the function, but that means we have to specify the color every time, which is also kind of a hassle. So instead, we'll plot the first one using whatever color matplotlib thinks is best (internally, it keeps a track of the last few colors it plotted, and rotates through a list of about 10), and then plot the rest using that color.

code format="python" def plot_averaged_genes(upstream, gene_cov, downstream) x_upstream = np.arange(-len(upstream), 0) x_gene = np.arange(0, len(gene_cov), color=color) x_downstream = np.arange(len(gene_cov), len(gene_cov) + len(downstream), color=color) result = mpl.plot(x_upstream, upstream) color = result[0].get_color mpl.plot(x_gene, gene_cov) mpl.plot(x_downstream, downstream) code

By the way, notice how we actually stored the result of plot in a variable? If we look at the documentation on plot again, we see that its "Return value is a list of lines that were added". What if, after we're done with this function, we still want to tweak the results. It's polite to return a list of everything that we've plotted, just the way that plot does. Therefore, we'll keep extending our list:

code format="python" def plot_averaged_genes(upstream, gene, downstream): retval = []

x_upstream = np.arange(-len(upstream), 0) x_gene = np.arange(0, len(gene), color=color) x_downstream = np.arange(len(gene), len(gene) + len(downstream), color=color)

retval.extend(mpl.plot(x_upstream, upstream)) color = retval[0].get_color retval.extend(mpl.plot(x_gene, gene, color=color)) retval.extend(mpl.plot(x_downstream, downstream, color=color))

# Place a separator between the averaged upstream region and the gene, # and the gene and the downstream region ymin, ymax = mpl.gca.get_ybound retval.extend(mpl.vlines([0, len(gene)], ymin, ymax, linestyles='dashed')

return retval code

In the code here, I've done something more: I put in some vertical dashed lines to visually separate out the gene region from the UTRs. It also illustrates an important point: we can ask the stuff we've plotted things about how big it is, etc. Because we never explicitly said how tall to make the y axis, it could be anything, so it's useful to be able to go in and programmatically pull that out. If we switch back over to the interactive interpreter, we can see a lot of get_ methods on the current axis.

Informative Interlude: How Matplotlib is laid out
As long as we're grabbing information from the axes, it's worth spending a few moments talking about how Matplotlib is organized. We're going to use the code from the [|Log plots example] to make a pretty looking set of pixels:

The window that this is being plotted in corresponds to a **Figure**. This is everything inside the window (but not the tools on the bottom), and when you want to save an image to the disk (so you can include it in your manuscript), this is what actually gets saved. Figures control things like the size of the image if you print it out, and the resolution (for on screen, something like 72 dpi is fine, but if you're printing, you want it to be more like 300).

A **Figure** can contain zero or more sets of **Axes**, which are the subplots. In this case, we have four. A set of Axes is usually what you'll want to be trying to modify. Axes have properties like x and y limits, a set of major (and sometimes minor) ticks for the x and y axes, an optional legend, and lots more data. Furthermore, each axis has its own plotting commands, which are very similar to the top-level commands, but lets you be certain that they're going within the same Axes. This is particularly important if you have several subplots going on, each of which could conceivably receive the plotting data.

This is one of those places where we can't spend all our time talking about every single feature available, but the inline documentation is pretty good, and the gallery of examples is also really helpful if you kinda know what you want to do.

So now that we've got everything plotted together nicely, it would be nice to have the x-axis not be labelled wrong. The upstream region is fine, but it's not clear that the gene and the downstream are actually measured in two different units (percent of the gene, for the gene region, and bp for the downstream). So let's make a helper function:

code format="python" def relabel(tick, gene_length): if tick < 0: return "%d bp" % tick elif tick == 0: return "-0 bp/0%" elif tick < gene_length: return "%d%%" % (float(tick) / gene_length * 100) elif tick == gene_length: return '100%/+0 bp' else: return "+%d bp" % (tick - gene_length)

code

For this function, you pass in an x-value and the length of the gene, and it will determine whether that x-value is in the upstream, gene, or downstream region, and give you the appropriate string label in either basepairs or % of the gene.

Now that we have our helper function, we're just going to tack a bit of code on to the end of plot_averaged_genes:

code format="python" def plot_averaged_genes(upstream, gene, downstream): ax = mpl.gca # Get current Axes ...   ticks = list(ax.get_xticks) # Ensure we have at least the full gene labelled ticks.extend([0, len(gene) / 2.0, len(gene)]) ticks = np.unique(ticks) ticklabels = [relabel(tick, len(gene)) for tick in ticks] ax.set_xticks(ticks) ax.set_xticklabels(ticklabels) mpl.draw_if_interactive code

The final **mpl.draw_if_interactive** is because when we use the **ax.set_** functions, nothing actually gets updated on the screen. That tells matplotlib that it should be redrawn. However, if it's not in interactive mode, then it's okay that it doesn't get updated on the screen. Whenever someone calls **mpl.show** or **mpl.savefig**, it will actually calculate the pixels that need to be drawn. Before that, it's just represented internally as a collection of things that need to be drawn somehow.

Making a histogram of the 3' UTR coverages
So the other figure we said we'd make was a comparison of the different cases to see what the up-regulation of 3' UTRs looks like on a gene-by-gene basis. Earlier, we calculated the ratio of the number of reads in each of the drug conditions to the no-drug condition (and we used the no drug, 60 minute sample as a negative control). Making a histogram of these is really very easy:

code format="python" >>> hist(log10(ratios_0to0)) code

As we can see, that made a histogram with 10 bars. Of course, those 10 bars are all the same width, and the boundaries are just arbitrarily placed based on the highest and lowest data points. More often, what we want to do is have bars that come at regularly spaced, sane intervals:

code format="python" >>> clf >>> hist(log10(ratios_0to0), arange(-3, 4, .1)) code

From here, if we want to add more histograms on top of the current one, we can just call hist again with the new data.

It is possible to go back through the returned values from each of those histogram objects and make sure that at each position, the highest one is in the back and the shortest is in the front, but it's a pain in the butt. Having tried it before, I'm also unconvinced that it leads to clear, intelligible plots. You're free to check out my personal version over at [|GitHub].

Another approach would be to calculate the histogram values, then **plot** them as a sort of density function. You can use the return values of either histogram (which only calculates the histogram) or hist (which calculates **and draws** the histogram) to get out what you need.

Finally, you can draw partially transparent histogram bars. If you pass in the keyword argument "alpha" to **hist**, then the bars are drawn partially transparent, depending on what that argument is.

code format="python" hist(log10(ratios100to0), arange(-3, 4, .1), alpha=.4) code

Alpha can be any number between 0 and 1, where 0 is fully transparent (in which case why are you even drawing it?), and 1 is fully opaque (this is the default). Most of the matplotlib functions know how to deal with this alpha property, so that can sometimes be useful when your plots start getting visually busy.

=Exercises=


 * 1) Create two sets of paired data from a Poisson distribution with lamba=5 and n=1000: one that is correlated and one that is not. In the correlated set of data, the correlation should not be perfect. Then make a plot that looks like the following for each paired data set:
 * 1) Calculate a z-score for the values in each data set and color the points red if the absolute value of the z-score is less than 0.05. Then use the YlOrRd cmap to make the points with the largest z-scores Yellow and the smallest z-scores Red. Z-scores can be calculated using scipy.
 * 2) Any other analyses you can think of to do with our data? Come talk with us!