Session+3.2

=**Manipulating Files and Processing Text**=

**Topics:**

 * Basic text processing with split, join, and partition
 * Text testing with **endswith**, **startswith**, **find**
 * Text conversion with **swapcase**, **replace**, **upper**, and **lower**
 * Opening and closing filehandles
 * Reading from the filehandle with **read**, **readline**, and **readlines**
 * Reading from the filehandle iterable
 * Writing or appending to a file with **write** and **writelines**
 * Writing to a file with a loop

Introduction:
We've learned so far how we can write programs to make many, many decisions with an ordered logic to process information. What we've lacked thus far is how to input and output large tomes of data. In addition to manipulating large amounts of data with functions that open, read, write, and close files, we'll also benefit from learning about Python's marvelously powerful abilities to process text. Not to malign the now-dead king of text-processing languages, Perl (The King is Dead! Long Live the King!), Python really cleans house with it's unparalleled text-processing abilities with respect to both speed and ease of use.

**Basic Text Processing**
Systematically manipulating large text files is one of the most common tasks you will encounter. The most basic tools for this task are the built-in Python string **methods**. These allow us to convert between strings and lists, test the properties of strings, and modify strings.

Informative Interlude: getting ahead of ourselves with methods vs. functions
Tomorrow, we're going to learn all about writing our own **functions** to process information. These will be sets of logic that consider variables and manipulate them according to the logic that we assign. In a sense, the functions are formally encapsulated manifestations of the sorts of things we've been writing with our scripts all week.

But, as we're going to see with strings, many types of objects have special built-in functions. We call these endemic functions **methods**, and in a broader discussion of objected-oriented programming practice and theory, we would have much, much more to say about them. However, we're not getting into the object-oriented universe or philosophy here, so you'll have to take as explanation simply that some objects are so routinely manipulated with the same sorts of operations that it pays to have functions dedicated to their processing. In the case of strings and files today, we'll see the **methods** that routinely operate on these types.

Whereas a function is written to accept variables and arguments to manipulate those variables with, a method already exists for the object under manipulation and is called differently. Whereas a **function** such as **print** is called by typing **print(string_variable)**, etc, a **method** is called by typing a period and the name of the method the end of the object. For example, if **print** were a method, it would be called like this: **string_variable.print**. Notice that there are still **** at the end of the name of the method, and methods can accept arguments just like functions. If all this seems eerily familiar, it may be because we've already seen the **list** methods **append** and **extend** earlier in the week. All apologies if this seems out of order and confusing, but we'll see how these concepts interoperate in more detail as the week progresses. This is why these paragraphs are in an **I.I.** after all...

Split
Let's consider the task of converting a character string of a sentence into a list of words separated by spaces and punctuation marks:

code format="python"
 * 1) !/usr/bin/python

delimiter = "," string_to_split = "I am a well-written sentence, and so I \ dependably have punctuation. " list_from_string = string_to_split.split(delimiter) print "clause one %s" % list_from_string[0] print "clause two %s" % list_from_string[1] code

Note that as we've split with a comma, the comma doesn't appear in our list. We can try out what happens with different arguments to **split**.

code format="python"
 * 1) we don't need to specify the delimiter in a different variable

list_from_string = string_to_split.split(' ') for word in list_from_string: print word

list_from_string = string_to_split.split('a') for vowel_handicapped_lump in list_from_string: print vowel_handicapped_lump

code

You might also want to take a string and turn it letter-by-letter into a list. Although this isn't done by **split**, it fits nicely here:

code format="python" list_from_string = list(string_to_split) for letter in list_from_string: print letter code


 * split** also can take a second argument (see, as always, the [|string methods documentation] ): you can specify how many times you want to split.

code format="python" string_to_split = "I am a well-written sentence, and so I \ dependably have punctuation. " list_from_string = string_to_split.split(' ', 3) for item in list_from_string: print item

code

Now let's see what happens when two delimiters are next to each other:

code format="python" list_from_string = string_to_split.split('t') for consonant_crippled_lump in list_from_string: print consonant_crippled_lump code

We can see that we have a blank space in our list: "written," in particular, was split into three parts: ["...wri","","en..."]. If delimiters are adjacent to each other, it will find that empty string between them and give it to you at the appropriate spot. It's a very one-hand-clapping-in-a-forest sort of thing.

However, there is an exception to this. If you glanced at the split documentation, you might have noticed that all of its arguments are, in fact, in brackets. That means that it doesn't need arguments to run: it has a default behavior.

code format="python" list_from_string = string_to_split.split for item in list_from_string: print item
 * 1) this should look the same as splitting by spaces

string_to_split = "  this      is    a   different                         string" list_from_string = string_to_split.split for item in list_from_string: print item
 * 1) this is not the same as splitting by spaces -- no empty items!

string_to_split = '''  complete \t\t whitespace                     chaos !!!!!!!!!!!        ''' list_from_string = string_to_split.split for item in list_from_string: print item

code

We see that the default behavior of **split** is to:
 * 1) Remove all kinds of whitespace from the beginning and end of the string.
 * 2) Condense all adjacent whitespaces to single space characters.
 * 3) Split on those spaces.

This turns out to be really handy. For instance, if you're using someone else's table, and, as happens more often than you might want to think, they've done a poor job delimiting their fields systematically with whitespace, this cleans things up quickly and easily in just one line.

You'll learn to extend this power of whitespace to other characters, sets of characters, and all sorts of exotic delimiters.

The **split** method being popular, it has a few hangers-on:

code format="python" toes = '''went to the market stayed home had roast beef had none cried wee wee wee all the way home'''

list_from_string = toes.splitlines for toe in list_from_string: print "this little piggy %s" % toe
 * 1) splitlines splits on linebreaks

last_toe = "and _this_ little piggy went wee wee wee all the way home" list_from_string = last_toe.rsplit(' ',7) for item in list_from_string: print item code
 * 1) from the end of the string
 * 1) when given a second argument, reverse split counts

Though the **partition** method isn't named after **split**, it's a very similar method. **partition** works a lot like **split(delimiter,1)**, taking a delimiter and splitting at the first instance. However, while **split(delimiter,1)** will return either a list of length two (if it split successfully) or a list of length one (if it didn't), **partition** will always return a tuple of length three. Let's look at the output.

code format="python" rhyme = '''There was a crooked man Who walked a crooked mile. He found a crooked sixpence Against a crooked stile. He bought a crooked cat Which caught a crooked mouse, And they all lived together In a crooked little house.'''

split_list = rhyme.split('crooked',1)
 * 1) you can split on words as well as single letters and symbols

print "List output:" for item in split_list: print item

partition_list = rhyme.partition('crooked') print "Partition output:" for item in partition_list: print item code

What if the delimiter doesn't occur within the string?

code format="python" split_list = rhyme.split('happiness',1) print "List output:" for item in split_list: print item
 * 1) I mean, this is like the nursery-rhyme
 * 2) equivalent of hangin' under the BART tracks in
 * 3) west Oakland.

partition_list = rhyme.partition('happiness') print "Partition output:" for item in partition_list: print item code

This can be useful if you are looking for that second item, but you're not sure if it's going to be there. The string could be user generated or read in from a file, and you want to gracefully do one thing if it's there and another if it's not. **split** can be less than graceful about this:

code format="python" if rhyme.split('happiness')[1]: else:
 * 1) if it's there you're all good
 * 1) if it isn't your program will crash

if rhyme.partition('happiness')[2]: else: code
 * vs
 * 1) parse the wanted information out of it
 * 1) wait until the next line

Join
So now we're pretty good at splitting things up, but how do we put things together again? **join** takes care of that: **it turns lists into strings**. Surprisingly enough, it's not a method of lists. It's a string method, and it relies on the delimiter to know how to put lists together. This little surprise renders the syntax of **join** to be among the most unintuitive of all syntactic trifles, but we will persevere if we concentrate on the fact that just like **split**, **join** is a method of strings.

code format="python" broken = ['hu','m','pty',' du','mpty'] all_the_kings_horses = 'n~n*^' all_the_kings_men = '>+O' first_try = all_the_kings_horses.join(broken) second_try = all_the_kings_men.join(broken) if (first_try == 'humpty dumpty') or (second_try =='humpty dumpty'): print 'hooray!' else: print '''All the king's horses and all the king's men couldn't put Humpty together again'''

code

Like split, join can usefully use the empty string-- it glues the components of the list directly together.

code format="python" third_try = ''.join(broken) print third_try
 * 1) Paradoxically,'nothing' can put poor Humpty together again
 * 2) To summarize, the syntax of join is variable=''.join(list)

code

This is in fact the usual way to use **join** -- you don't need to declare a separate variable to act as the glue.

code format="python" fairy_tale_characters = ['witch','rapunzel','prince'] plot = 'hair'.join(fairy_tale_characters) print plot code

Testing Text: startswith, endswith, and find
We just saw how you can use an if statement to test for the presence of a delimiter with **partition**. There are other tests you will often be interested in, for example asking if a string begins with, ends with, or contains a substring of interest.

code format="python"
 * 1) !/usr/bin/env python

id_number = '1131431a'

if (id_number[0] == '1'): print "this id starts with a 1!"
 * 1) let's see if the id_number string starts with the number one

if ( id_number.startswith('1') ): print "this id starts with a 1!"
 * 1) now let's use the string method startswith(

if ( id_number.endswith('1') ): print "This id number ends with a 1!" else: print "This id number doesn't end with a 1 at all!"
 * 1) and here's the endswith method

if ( id_number.endswith( ('1', 'a') ) ): print "this id number ended with either an 'a' or a '1' " else: pass
 * 1) and these methods can get a little fancier by having multiple things to
 * 2) test for if you provide a tuple of characters

code

Or maybe we don't care what the string starts or ends with as long as it contains a substring of interest. For this, we can use the find method, which will return the index of the substring. But be careful when you write if tests using the find method, as it returns the index of the substring only if the substring is found. __Otherwise,__ **__find__** __returns the integer -1, which is not a zero, and thus will pass the **if** test as **True**__.

code format="python" beatles = "johnpaulgeorgeandringo"

if ( beatles.find('paul')): print "At least we've got a bassist." else: print "Anyone here play bass?"
 * 1) the wrong way

if not (beatles.find('paul') == -1): print "At least we've got a bassist" else: print "Well, I guess we're a three piece." code
 * 1) let's do a comparison for -1 instead

Text Conversions
Systematically replacing the instances of a substring with a replacement substring may be a familiar task of tedium. Python has several methods for systematically converting characters in strings. The most general is the method **replace.**

code format="python" beatles = 'johnpaulgeorgeandringo' beatles = beatles.replace('george', 'JUSTIN') print beatles


 * 1) YES! Justin's in!

beatles = beatles + "MOREJUSTIN!" print beatles.replace("JUSTIN", "DIANA!") print beatles

print beatles.replace("JUSTIN", "DIANA!", 1) print beatles
 * 1) and we can tell replace how many replacements to make, starting at the beginning


 * 1) but notice that replace does not change the string in place; you have
 * 2) to reassign the variable to "save" the change

code

Since Python is case sensitive, as are most UNIX-based bioinformatics programs you'll be interested in using, you may also find yourself wishing that all the text in your data was the same case. There are methods for both testing and converting cases.

code format="python" blast_hit = 'ACTGTCAGTACGTAGCATCGAaaatCGATCGACTGAatacgatCG'
 * 1) why not use something a touch relevant for a change

if ( blast_hit.isupper ): pass else: blast_hit = blast_hit.upper print blast_hit


 * 1) or if you prefer lower case

blast_hit = blast_hit.lower print blast_hit


 * 1) or if you are (or the program you're writing is) indecisive

blast_hit = blast_hit.swapcase print blast_hit


 * 1) and we might also be interested in these methods

if ( blast_hit.isalpha ): print "we got all letters here" else: print "whoa, something doesn't look like nucleotides!"

code

Files and Filehandles
Now that we can process text, all we need is... more text. And odds are, that text is going to come in the form of a file, so it's high time that we start using them.

Opening filehandles
A **filehandle** is an object that controls the stream of information between your program and a file stored somewhere on the computer. Filehandles are not filenames, and they are not the files themselves. They are a tool that your program uses to interact with files, nothing more (for instance, deleting a filehandle in your script using the del command does nothing to the file that handle refers to).

We create filehandles in the simplest sense with the open command:

fh = open('some_file')

where some_file is the path to a file (i.e. the filename) on your filesystem. In general, it is good practice to use absolute path nomenclature (e.g. /Users/aaron/some_file or /home/aaron/some_file), but you can be lazy if you know the file you want is going to be in the same directory as your program.

$ **touch hello.py**

code format="python"
 * 1) !/usr/bin/env python

fh = open('hello.py') contents = fh.read print contents fh.close code

$ **./hello.py** //#!/usr/bin/env python//

//fh = open('hello.py')// //contents = fh.read// //print contents// //fh.close//

As you can see, the **read** method of the filehandle just sucks in the whole file in a single string, newlines and all! This is quick and easy, for sure, but it's not necessarily the most orderly way to deal with the contents of a file.

**readline**, **readlines**, and strip
Copy the contents of the following snippet to a text file in your directory for this session, and save the file as **pdb_head**.

HEADER OXIDOREDUCTASE 08-JUL-97 1AOP TITLE SULFITE REDUCTASE STRUCTURE AT 1.6 ANGSTROM RESOLUTION COMPND MOL_ID: 1; COMPND 2 MOLECULE: SULFITE REDUCTASE HEMOPROTEIN; COMPND 3 CHAIN: A;

Then try the following:

code format="python"
 * 1) !/usr/bin/env python

filename = 'pdb_head' fh = open(filename, 'r')
 * 1) the 'r' is for 'read-only', which will keep us from being able to alter
 * 2) this file with the filehandle we just created

print fh.readline print fh.readline

lines = fh.readlines

fh.close

print lines

code

$ **./hello.py** //HEADER OXIDOREDUCTASE 08-JUL-97 1AOP//

//TITLE SULFITE REDUCTASE STRUCTURE AT 1.6 ANGSTROM RESOLUTION//

//['COMPND MOL_ID: 1; \n', 'COMPND 2 MOLECULE: SULFITE REDUCTASE HEMOPROTEIN; \n', 'COMPND 3 CHAIN: A; \n']//

While this is a bit of a mess, a few things should become apparent:
 * 1) fh.readline takes in one line (and since **print** also supplies a newline, we've got an extra linebreak after each of the first two print statements.
 * 2) fh.readlines (plural!) takes the entire file, from the current read position all the way to the end, giving back a list of lines (again, with newlines intact).
 * 3) This file has a bunch of whitespace cluttering things up at the end of each line.

All of these complications are easily resolved with the use of the **strip** method whenever we actually make use of the lines we read:

code format="python"
 * 1) !/usr/bin/env python

filename = 'pdb_head' fh = open(filename, 'r')

print fh.readline.strip print fh.readline.strip

lines = fh.readlines

fh.close

lines[0] = lines[0].strip

print lines

code


 * $ ./hello.py**

//HEADER OXIDOREDUCTASE 08-JUL-97 1AOP// //TITLE SULFITE REDUCTASE STRUCTURE AT 1.6 ANGSTROM RESOLUTION// //['COMPND MOL_ID: 1;', 'COMPND 2 MOLECULE: SULFITE REDUCTASE HEMOPROTEIN; \n', 'COMPND 3 CHAIN: A; \n']//

Now the spaces and newlines are gone from the first two, and from the 0th element of the list I printed in the last print statement (since I only bothered to **strip** and put back the 0th element).

One crucially important concept of file input in Python is that each time you read something by any of the three methods I've described, you advance the position of the filehandle in the file, which means that you never get the same character or characters twice (unless of course they're in the file twice!)

This is why reading from the filehandle with fh.readline twice in a row gave two different values; as soon as the line is read, the filehandle has moved to the next line, awaiting another read request. This is an example of an **iterable** type, meaning that the filehandle is a type of object that knows how to advance itself in anticipation of the next request. That means that to get back to the beginning of the file, you must either close the file with the **close** and reopen it, or use the **seek** method of the filehandle (which we don't have time to go into -- google is your friend!)

While potentially a bit odd now, this behavior will be essential when we discuss reading file contents with loops.... oh, speaking of...

**Reading files in a loop**
Certainly one of the most common contexts in which you'll encounter for loops is in working your way through a file. You can just put together two things we've already seen to get to where we need to be:

code format="python"
 * 1) !/usr/bin/env python

fh = open('pdb_head') lines = fh.readlines for line in lines: fields = [] fields.append(line[0:6].strip) fields.append(line[6:10].strip) print '0th field: %s, 1st field: %s' % (fields[0],fields[1])

code

$ **./hello.py** //0th field: HEADER, 1st// //field////: OXI// //0th// //field////: TITLE, 1st// //field////: SULF// //0th// //field////: COMPND, 1st// //field////: MOL// //0th// //field////: COMPND, 1st// //field////: 2 M// //0th// //field////: COMPND, 1st// //field////: 3 C//

This is starting to get a little fancier, but we're only doing things you've seen before: read all the lines in a file into a list, then iterate over the list, looking for a couple of different parts of the line, stripping off leading and trailing whitespace, then printing the first and second elements of the resulting list.

We can simplify this one more step using the fact that filehandles are **iterable**, and know what's being asked of them. So we can replace this:

code format="python" lines = fh.readlines for line in lines:

code

with

code format="python" for line in fh: code

to exactly the same end.

**Writing to Files**
Writing output is sorta like doing the dishes. You just did all this work to cook up a fancy program and analyze some data, and the last thing you want to do is put all your answers away into clean little output files. Fortunately, we'll learn about **pickle** files later, but for now, we'd best make sure you know how to write output to a file.

The default behavior of the filehandle is to open the file supplied in read mode. However, by giving an additional argument, you can either add lines to the bottom of the specified file, or overwrite it entirely:

code format="python"
 * 1) !/usr/bin/env python

filename = 'test_out' fh = open(filename, 'w')
 * 1) 'w' flag means "writeable"

fh.write('Historically, this lesson was used as a medium to hurtle insults between') fh.write(' Matt and our former labmate Brant.\n')
 * 1) note that we have to add the '\n' if we want it at the end of the line;
 * 2) this is in contrast to the print command's behavior.

fh.close

filename = 'test_out2' fh = open(filename, 'a')
 * 1) 'a' flag means "append"

fh.write("Unfortunately, I have no beef with Peter, so this section is a bit mundane.\n")

fh.close code

While this script doesn't print anything to the screen, if you run it a few times and look at the contents of test_out vs test_out2, the distinction between the 'w' and 'a' arguments to **open** should become clear.

When reading files, the **close** method is a good thing to keep in mind, but if you forget it, python will close the file at the end of the program's execution. With writing files, however, python may not make the changes you stipulate right away, so if you plan to evaluate the contents of the file you're writing in the same script (or for instance use that file for something else during the run of that script) it is wise to close the filehandle to ensure that all the write operations you've requested are performed.

While python has no writeline method, the other two read methods are mirrored for writing to files. The first, **write** you've already seen. It takes a string, and puts it in a file. The only difference between this and **writelines** is that **writelines** takes a list of strings, and writes them all (But beware! If you want those strings to appear on separate lines, they had best all end with a \n!)

code format="python"
 * 1) !/usr/bin/env python

filename = 'test_out' fh = open(filename, 'w') # 'w' flag means "writeable"

lines = ["Justin is a friendly dude.\n", "You'd better be one too.\n"] lines.extend(["Or next year, he might use this space\n",             "to write a phish song about you.\n"])

fh.writelines(lines)

fh.close

code

And check out the contents of test_out to see your many-line-writing machine in action!

**Exercises**

 * 1. Pile of basic split drills:**


 * Turn 'Humpty Dumpty sat on a wall' into ['Humpty','Dumpty','sat','on','a', 'wall']
 * Turn 'Humpty Dumpty had a great fall' into ['Humpty Dumpty had a ', ' fall']
 * Turn "All the King's horses" into ["All the King's hor",'e',''] (note: there is still an "s" at the end of "King's")
 * Turn "and all the King's men" into ['and a',''," the King's men"] (note: there is a space at the beginning of " the King's men")
 * Turn "couldn't put Humpty together again" into 'again' (using one line)


 * 2. Pile of basic split, join, and replacement drills:**


 * Turn ' Terry RichPrice Matt\n' into Chris\tAdamRoberts\tPeter'
 * Turn 'Matt,Nate,Aaron' into 'MATT\tNATE\tAARON\t'


 * 3. Using the names of all seven instructors and TA's (Nate, Matt, Aaron, Peter, Adam, Aisha, Chris), write each possible pair of names to a file, separated by a line of hyphens (i.e. '-')**


 * 4. Reopen the last output file, and read in the file, then write the lines back out** **(to a new file)** **in reverse order, in all capital letters.**


 * 5. Parse a FASTA file**

Copy the text below into a text file and save it as //seq.FASTA//

code format="python" >gene1 ATGAGACGTAGTGCCAGTAGCGCGATGTAGCG ATGACGCATGACGCGCGACGCGCGAGTGAGCC ATACGCACGCATTGGCA >gene2 ATGTTCGACGCATACGACGCGCAGTACCAGCA ATGACGCACCGGGATACACGACGCGGATTTTT ACGCACCGAGATAGCATAAAAGACCATTAG >gene3 TTATGGCACCCACTAGAGCCAGATTATTTTAAA code

Write a script called //read_fasta.py// that will open this file, read the lines, and store the data as a dictionary keyed by gene with values of the sequence. Make sure the sequences are contiguous (i.e. contain no endline characters), and make sure to remove the > from the names of the genes.

**Solutions**

 * 1. Pile of basic split drills:**
 * Turn 'Humpty Dumpty sat on a wall' into ['Humpty','Dumpty','sat','on','a', 'wall']
 * Turn 'Humpty Dumpty had a great fall' into ['Humpty Dumpty had a ', ' fall']
 * Turn "All the King's horses" into ["All the King's hor",'e',''] (note: there is still an "s" at the end of "King's")
 * Turn "and all the King's men" into ['and a',''," the King's men"] (note: there is a space at the beginning of " the King's men")
 * Turn "couldn't put Humpty together again" into 'again' (using one line)

code format="python"
 * 1) 1 Pile of basic split drills:
 * 2) Turn 'Humpty Dumpty sat on a wall' into ['Humpty','Dumpty','sat','on','a', 'wall']

s='Humpty Dumpty sat on a wall' split_string=s.split print split_string


 * 1) Turn 'Humpty Dumpty had a great fall' into ['Humpty Dumpty had a ', ' fall']

s2='Humpty Dumpty had a great fall' print s2 split2=s2.split('great') print split2


 * 1) Turn "All the King's horses" into ["All the King's hor",'e',''] (note: there is still an "s" at the end of "King's")

s3="All the King's horses" split3=s3.rsplit('s',2) print split3


 * 1) Turn "and all the King's men" into ['and a',''," the King's men"] (note: there is a space at the beginning of " the King's men")

s4="and all the King's men" split4=s4.split('l') print split4


 * 1) Turn "couldn't put Humpty together again" into 'again' (using one line)

s5="couldn't put Humpty together again" split5=s5.partition('again')[1] print split5

code


 * 2. Pile of basic split, join, and replacement drills:**


 * Turn ' Terry RichPrice Matt\n' into Chris\tAdamRoberts\tPeter'
 * Turn 'Matt,Nate,Aaron' into 'MATT\tNATE\tAARON\t'

code format="python" string2='Terry RichPrice Matt\n' print string2

string2=string2.replace('Terry ','Chris\t') string2=string2.replace('RichPrice','AdamRoberts\t') string2=string2.replace('Matt\n','Peter') print string2


 * 1) of course, this replacement could also be done in one step, but that sort of feels like cheating, doesn't it?:

string2=string2.replace(string2,'Chris\tAdamRoberts\tPeter') print string2


 * 1) For second part: Turn 'Matt,Nate,Aaron' into 'MATT\tNATE\tAARON\t'

names='Matt,Nate,Aaron' print names names=names.upper names=names.replace(',','\t')+'\t' code


 * 3. Using the names of all seven instructors and TA's (Nate, Matt, Aaron, Peter, Adam, Aisha, Chris), write each possible pair of names to a file, separated by a line of hyphens (i.e. '-').**

code format="python" fh=open('names','w') list_of_names=['Nate','Matt','Aaron','Peter','Adam','Aisha','Chris'] for name in list_of_names: for name2 in list_of_names: if name==name2: pass else: line=name+'-'+name2 fh.write(line+'\n')
 * 1) Use a for loop within a for loop to pair one name with every other name in the list.

fh.close

code


 * 4. Reopen the last output file, and read in the file, then write the lines back out** **(to a new file)** **in reverse order, in all capital letters.**

code format="python" fh=open('names','r') #Open the previous file 'names' with read-only status fh2=open('names_reverse','w') #Open a new file with the 'w' flag, in which you will write the reversed names lines=fh.readlines #Store the lines of fh into a list using readlines lines.reverse #Reverse the order of these lines using the reverse method of lists for line in lines:  #Loop through the list and re-rewrite each item as a new line; don't forget the '\n'! line=line.upper fh2.write(line+'\n')

fh.close fh2.close

code


 * 5. Parse a FASTA file**

code format="python" fh=open('seq.FASTA','r') #Open the fasta file for reading genes={} #Create an empty dictionary that will be populated with 'genes' as keys and sequences as values for line in fh:    #Parse each line of the fasta file by looping through the file line=line.strip #Strip to remove all whitespace and newline characters if line[0]=='>': #Search for the '>' as the first element of the string to tell you if you're dealing with a gene or a sequence gene_name=line[1:]  #Define a new variable, gene_name, that is equal to everything in the line except the '>'. genes[gene_name]= #Make gene_name a key in your dictionary, and use an empty string  as a value placeholder else: genes[gene_name]+=line #Add the subsequent sequence to the value fh.close

print genes

code