I. File input and output
In this section we deal with reading from and writing to files. The
first thing we need to do is to make a file object that allows us to
read or write (or both) from a file.
A. Open a file for writing & write to the file
>>>f = open('tmp_file.txt', 'w')
This little code snippet creates a file object named f that refers to
a file on your computer named 'tmp_file.txt'. The built-in function
open takes two arguments: the string name of the file, and a string
describing the way in which the file will be used.
For our file object f, the file name is 'tmp_file.txt' and the
second argument 'w' indicates that we can write to the file. The
other arguments you can use for file objects instead of 'w'
include:
'r'
opens a
file for reading
'r+' opens a
file for reading and writing
'a'
appends to
a file
B. Using file object methods (special functions for files)
Once you open a file for writing, the next thing you probably want to
do is write something to a file! To write to a file, you simply use
the special built-in python file object method called write().
>>>f.write('My name is George Castanza. \n')
Note: the write function only writes strings
to a file, f.write(string), not other data types or objects. For
instance, f.write(5) would not work. Of course, sometimes you might
want to write a number to a file, or something else like a list. In
this case check out the built-in repr() function.
Other useful file object methods:
f.readline()
reads from a
single line of the file
f.readlines() returns
a list containing all the lines of a file
f.close() closes
a file (after reading and writing)
[Note: Remember to close the file after
you are done writing and reading from it.]
File Exercise:
(1) Make a file object called outfile that
opens a file called 'my_file.txt' for writing.
(2) Use the built-in functions range() and
repr() to write the numbers 1 to 20 to my_file.txt.
(3) Close the file.
(4) Open my_file.txt for reading and print
out every line of the file multiplied by 5.
[Hint: for line in
input.readlines():]
What happens when you multiply the lines of
the file by 5?
II. String and Re modules
Python has a whole library of modules with
useful functions for all sorts of tasks such as processing string,
opening and retrieving web pages, calling system commands and so
forth. All these modules and descriptions are known as the Python
Standard Library and details are found in the Python Library
Reference on python.org (documents link).
Two of the most useful modules for
processing biological data, and any text information, are the string
and re modules.
string processes
string objects
re regular
expression module (pattern searching)
A. String Module
Using the functions included in these
modules (also known as "attributes" and "methods") is just like using
the attributes for file objects or list objects. The only difference
is that you must first use the import function to bring in the
modules.
Here is an example of using the string
module to change a lowercase string to uppercase:
import string
sequence='agtttccagat'
new_seq=string.upper(sequence)
Not bad, eh? All the modules work this way -
you use the "." operator after the name to access the methods of the
module.
IMPORTANT: Before using an
attribute of a module (or any function you use) there are three key
things you need to know about the attribute. (1) How many arguments
does it take? (2) What types of data do the arguments have to be? (3)
What types of values does the function return?
In the above case the string.upper method
takes one argument of type string and returns a string (uppercase).
Here are a few useful string methods, using the terminology in the
library reference:
strip(s)
Takes one
string argument, returns a copy of s without leading or trailing
whitespace characters.
replace(str, old,
new) Takes three arguments, all strings.
Returns a copy of str with all occurrences of substring old replace
with new.
split(s [,
sep[,maxsplit]]) Take up to three arguments:
two strings and a number. This function splits up a string and
returns a list of strings. Unless the default sep is changed, the
function splits by whitespaces (spaces, tabs, return
characters).
Let's try out some of these functions to see
how they work:
>>>sent = ' George
Castanza is an idiot \n'
>>>new_sent=string.split(sent)
After these operations, what is the value of
new_sent?
What happens if you include a second
argument 'is'?
>>>new_sent =
string.split(sent,'is')
String Exercise 1:
Try the following string attributes on the
variable sent (above):
rstrip(s)
lower(s)
ljust(s)
String Exercise 2:
Use the string module to get the genus names
from the following sequences and append them to a list called
genus_names:
Titles = ['> Homo sapiens AC34550',
'> Dendroctonus micans AC45560']
B. RE module
Another very useful tool from the Python
standard library is the re (regular expression) module. The re module
is an extremely useful collection of methods for text processing.
(You can imagine how useful this can be in molecular biology for
getting sequence data, or other types of data, out of files.)
Using re, you can easily find, search, or
replace strings or portions of strings. But the real flexibility of
the re module is the complex types of patterns it can search for, and
some examples of that are shown below.
First, let's just try using a few basic re
method calls. Just like the string module, you first have to import
the module (import re). Here are a few useful re method calls:
sub(pattern, repl,
str[,count=0) This method takes three
arguments (all strings) and searches for a substring pattern in str
and substitutes a replacement (repl) string.
split(pattern, str[,maxsplit =
0]) This method splits up a string every place it
finds a particular pattern and returns a list.
match(pattern,
str[,flags]) This method searches the
beginning of a string for a pattern and returns true if found, and
false if not found.
search(patter, str) This
method searches the entire string for a pattern and returns true if
found, and false if not found.
Here is an example of using the re module to
replace part of a string (a substring) with another string using the
sub(pattern, replace):
>>>sent2 = "My dog is a hog."
>>>new_sent2=re.sub('dog','cat',sent2)
>>>new_sent2=re.sub('hog','rat',new_sent2)
What is new_sent2 now?
The re.search() and re.match() methods are
esp. useful with if/else statments:
if re.search('cat',
new_sent2): #searches entire string for substring 'cat' -
true if found
print
'Cat found!'
else:
print
'Stupid dog...'
C. Regular Expression Syntax
As mentioned above, the really useful part
of the re module is the flexibility in the types of expressions it
can search for. You can use it to find exact matches of
simple strings (like 'cat' in the above example), but there are also
a huge list of special characters you can search, match and
substitute using the re module.
Here are a few examples of special
characters:
'.' (Dot) This matches any type of
character.
Ex:
re.search('GA.T', sequence) will find any substring with the first
two characters 'GA',
anything
for the third charcter, followed by 'T'
In
this case, 'GATT', 'GACT', 'GA9T' would all be found in a
string.
'^' (Caret) Matches only at the start
of a string.
Ex:
re.search('^My', new_sent2) will return true because 'My'
is at the beginning of
new_sent2
'$' Matches at the end of a
string
'\D' Matches any non-digit
character
'\d' Matches any digit character
'\s' Matches any whitespace character
(e.g., \t, \n, \f, \v)
'[]' Matches a set of
characters included in the brackets. For example '[AG]' will
match 'A' or 'G' and nothing else.
Ex:
re.search('c[agt9]t', new_sent2) will return true because it
will find 'cat'
There is a long list of special character
syntax in the library reference that you can refer too (also see pg.
227 of Learning Python).
Re and String Exercise:
First define: sequence =
"GACCATTTACACTTCCGACATTACCA"
(1) Write a function with 2 arguments (both
strings) that uses the re module to search for a pattern (first
argument) in a sequence (second argument). The function should return
1 if found and 0 if not found.
find_motif('GAAT',
sequence)
(2) Alter the function so that it can work
with either DNA or RNA. For instance, if the motif is RNA sequence
'GAAU', but the sequence is DNA you need to change the motif to be
DNA. [Hint: use the re.search (with 'U' or 'T') and re.sub].
Make sure that it can handle lowercase strings ('gaau').
(3) Use your function to determine if the
sequence motif ACANTW is in the above sequence, where N can be any
base and W is either A or T or U.