I. File input and output

In this section we deal with reading from and writing to files. The first thing we need to do is to make a file object that allows us to read or write (or both) from a file.

A. Open a file for writing & write to the file

>>>f = open('tmp_file.txt', 'w')

This little code snippet creates a file object named f that refers to a file on your computer named 'tmp_file.txt'. The built-in function open takes two arguments: the string name of the file, and a string describing the way in which the file will be used.

For our file object f, the file name is 'tmp_file.txt' and the second argument 'w' indicates that we can write to the file. The other arguments you can use for file objects instead of 'w' include:

'r'         opens a file for reading
'r+'       opens a file for reading and writing
'a'         appends to a file

B. Using file object methods (special functions for files)

Once you open a file for writing, the next thing you probably want to do is write something to a file! To write to a file, you simply use the special built-in python file object method called write().

>>>f.write('My name is George Castanza. \n')

Note: the write function only writes strings to a file, f.write(string), not other data types or objects. For instance, f.write(5) would not work. Of course, sometimes you might want to write a number to a file, or something else like a list. In this case check out the built-in repr() function.

Other useful file object methods:

f.readline()         reads from a single line of the file
f.readlines()       returns a list containing all the lines of a file
f.close()              closes a file (after reading and writing)

[Note: Remember to close the file after you are done writing and reading from it.]

File Exercise:
(1) Make a file object called outfile that opens a file called 'my_file.txt' for writing.

(2) Use the built-in functions range() and repr() to write the numbers 1 to 20 to my_file.txt.

(3) Close the file.

(4) Open my_file.txt for reading and print out every line of the file multiplied by 5.

[Hint: for line in input.readlines():]

What happens when you multiply the lines of the file by 5?

II. String and Re modules

Python has a whole library of modules with useful functions for all sorts of tasks such as processing string, opening and retrieving web pages, calling system commands and so forth. All these modules and descriptions are known as the Python Standard Library and details are found in the Python Library Reference on python.org (documents link).

Two of the most useful modules for processing biological data, and any text information, are the string and re modules.

string           processes string objects
re                regular expression module (pattern searching)

A. String Module

Using the functions included in these modules (also known as "attributes" and "methods") is just like using the attributes for file objects or list objects. The only difference is that you must first use the import function to bring in the modules.

Here is an example of using the string module to change a lowercase string to uppercase:

import string

sequence='agtttccagat'

new_seq=string.upper(sequence)

Not bad, eh? All the modules work this way - you use the "." operator after the name to access the methods of the module.

IMPORTANT: Before using an attribute of a module (or any function you use) there are three key things you need to know about the attribute. (1) How many arguments does it take? (2) What types of data do the arguments have to be? (3) What types of values does the function return?

In the above case the string.upper method takes one argument of type string and returns a string (uppercase). Here are a few useful string methods, using the terminology in the library reference:

strip(s)         Takes one string argument, returns a copy of s without leading or trailing whitespace characters.

replace(str, old, new)  Takes three arguments, all strings. Returns a copy of str with all occurrences of substring old replace with new.

split(s [, sep[,maxsplit]]) Take up to three arguments: two strings and a number. This function splits up a string and returns a list of strings. Unless the default sep is changed, the function splits by whitespaces (spaces, tabs, return characters).

Let's try out some of these functions to see how they work:

>>>sent = '   George Castanza is an idiot \n'

>>>new_sent=string.split(sent)

After these operations, what is the value of new_sent?
What happens if you include a second argument 'is'?

>>>new_sent = string.split(sent,'is')

String Exercise 1:
Try the following string attributes on the variable sent (above):

rstrip(s)
lower(s)
ljust(s)

String Exercise 2:
Use the string module to get the genus names from the following sequences and append them to a list called genus_names:

Titles = ['> Homo sapiens AC34550', '> Dendroctonus micans AC45560']


B. RE module

Another very useful tool from the Python standard library is the re (regular expression) module. The re module is an extremely useful collection of methods for text processing. (You can imagine how useful this can be in molecular biology for getting sequence data, or other types of data, out of files.)

Using re, you can easily find, search, or replace strings or portions of strings. But the real flexibility of the re module is the complex types of patterns it can search for, and some examples of that are shown below.

First, let's just try using a few basic re method calls. Just like the string module, you first have to import the module (import re). Here are a few useful re method calls:

sub(pattern, repl, str[,count=0)  This method takes three arguments (all strings) and searches for a substring pattern in str and substitutes a replacement (repl) string.

split(pattern, str[,maxsplit = 0]) This method splits up a string every place it finds a particular pattern and returns a list.

match(pattern, str[,flags])  This method searches the beginning of a string for a pattern and returns true if found, and false if not found.

search(patter, str) This method searches the entire string for a pattern and returns true if found, and false if not found.

Here is an example of using the re module to replace part of a string (a substring) with another string using the sub(pattern, replace):

>>>sent2 = "My dog is a hog."
>>>new_sent2=re.sub('dog','cat',sent2)
>>>new_sent2=re.sub('hog','rat',new_sent2)

What is new_sent2 now?

The re.search() and re.match() methods are esp. useful with if/else statments:

if re.search('cat', new_sent2):  #searches entire string for substring 'cat' - true if found
        print 'Cat found!'
else:
        print 'Stupid dog...'
        

C. Regular Expression Syntax

As mentioned above, the really useful part of the re module is the flexibility in the types of expressions it can search for. You can use it to find exact matches  of simple strings (like 'cat' in the above example), but there are also a huge list of special characters you can search, match and substitute using the re module.

Here are a few examples of special characters:

'.' (Dot) This matches any type of character.

        Ex: re.search('GA.T', sequence) will find any substring with the first two characters 'GA',         
        anything for the third charcter, followed by 'T'
        In this case, 'GATT', 'GACT', 'GA9T'  would all be found in a string.

'^' (Caret) Matches only at the start of a string.

        Ex: re.search('^My',  new_sent2) will return true because 'My' is at the beginning of
        new_sent2

'$' Matches at the end of a string

'\D' Matches any non-digit character
'\d' Matches any digit character
'\s' Matches any whitespace character (e.g., \t, \n, \f, \v)

'[]' Matches a set of characters included in the brackets. For example '[AG]' will match 'A' or 'G' and nothing else.

        Ex: re.search('c[agt9]t', new_sent2) will return true because it will find 'cat'
        
        
There is a long list of special character syntax in the library reference that you can refer too (also see pg. 227 of Learning Python).

Re and String Exercise:

First define: sequence = "GACCATTTACACTTCCGACATTACCA"

(1) Write a function with 2 arguments (both strings) that uses the re module to search for a pattern (first argument) in a sequence (second argument). The function should return 1 if found and 0 if not found.

        find_motif('GAAT', sequence)

(2) Alter the function so that it can work with either DNA or RNA. For instance, if the motif is RNA sequence 'GAAU', but the sequence is DNA you need to change the motif to be DNA. [Hint: use the re.search (with 'U' or 'T') and re.sub]. Make sure that it can handle lowercase strings ('gaau').

(3) Use your function to determine if the sequence motif ACANTW is in the above sequence, where N can be any base and W is either A or T or U.