Problem 6.9

Summarize FASTA file

Write a program fasta_summary.py to summarize a FASTA file by printing the length of the sequence and the description for every record in the file.

The program is expected to take a FASTA file as a command-line argument and print the summary.

$ cat files/sample1.fasta
> SEQUENCE.1
AAGGTTCC
> SEQUENCE.2
AGTC
AGTC
AGTC
AGTC

Here is what is expected when the program is called with the above file as argument.

$ python fasta_summary.py files/sample1.fasta
8 SEQUENCE.1
16 SEQUENCE.2

Hint:

You can read a fasta file using SeqIO.parse function from Biopython.

Solution

import sys

def read_fasta_file(filename):
    description = ""
    seq = ""
    records = []
    for line in open(filename):
        if line.startswith(">"):
            if description:
                records.append((description, seq))
            description = line[1:].strip()
            seq = ""
        else:
            seq = seq + line.strip()

    if description:
        records.append((description, seq))
    return records

filename = sys.argv[1]
for desc, seq in read_fasta_file(filename):
    print(len(seq), desc)

# import re
# text = open(filename).read()
# rx = re.compile(r"> (.*)\n((?:[^>].*\n?)+)", re.M)
# for desc, seq in rx.findall(text):
#     print(len(seq), desc)