Problem 6.9
Summarize FASTA file
Write a program fasta_summary.py
to summarize a FASTA file by printing the length of the sequence and the description for every record in the file.
The program is expected to take a FASTA file as a command-line argument and print the summary.
$ cat files/sample1.fasta
> SEQUENCE.1
AAGGTTCC
> SEQUENCE.2
AGTC
AGTC
AGTC
AGTC
Here is what is expected when the program is called with the above file as argument.
$ python fasta_summary.py files/sample1.fasta
8 SEQUENCE.1
16 SEQUENCE.2
Hint:
You can read a fasta file using SeqIO.parse
function from Biopython.
Solution
import sys
def read_fasta_file(filename):
description = ""
seq = ""
records = []
for line in open(filename):
if line.startswith(">"):
if description:
records.append((description, seq))
description = line[1:].strip()
seq = ""
else:
seq = seq + line.strip()
if description:
records.append((description, seq))
return records
filename = sys.argv[1]
for desc, seq in read_fasta_file(filename):
print(len(seq), desc)
# import re
# text = open(filename).read()
# rx = re.compile(r"> (.*)\n((?:[^>].*\n?)+)", re.M)
# for desc, seq in rx.findall(text):
# print(len(seq), desc)