Problem 7.4

Sort Sequences by GC Fraction

Sort sequences in a FASTA file by their GC fraction.

The program should take a fasta file as argument, sort the sequences in that file using their GC fraction and write them in FASTA format to file gcsort.fasta.

The progran should also support the following flags.

-o, --output to specify the file to write the output instead of writing to gcsort.fasta.

-r, --reverse to sort the sequences in the reverse order (descending order of gc fraction)

$ cat files/multi.fasta
> Seq1
AGTAGG
> Seq2
AGTCAGTC
AGTCATA
> Seq3
AGTCAGTCGC

$ python gcsort.py files/multi.fasta
Wrote the sorted sequences to gcsort.fasta

$ cat gcsort.fasta
> Seq2
AGTCAGTCAGTCATA
> Seq1
AGTAGG
> Seq3
AGTCAGTCGC

The -r option is used to sort in the reverse order.

$ python gcsort.py -r files/multi.fasta
Wrote the sorted sequences to gcsort.fasta

$ cat gcsort.fasta
> Seq3
AGTCAGTCGC
> Seq1
AGTAGG
> Seq2
AGTCAGTCAGTCATA

The -o flag is used to specify the output path.

$ python gcsort.py  files/multi.fasta -o a.fasta
Wrote the sorted sequences to a.fasta

$ cat a.fasta
> Seq2
AGTCAGTCAGTCATA
> Seq1
AGTAGG
> Seq3
AGTCAGTCGC

Solution

import argparse
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

p = argparse.ArgumentParser()
p.add_argument("filename", help="FASTA file to sort")
p.add_argument("-r", "--reverse", help="sort in reverse order", action="store_true", default=False)
p.add_argument("-o", "--output", help="path of the output file", default="gcsort.fasta")
args = p.parse_args()

def gcsort(source_path, dest_path, reverse=False):
    records = SeqIO.parse(source_path, "fasta")
    records = sorted(records, key=gc_fraction, reverse=reverse)
    SeqIO.write(records, dest_path, "fasta")
    print("Wrote the sorted sequences to", dest_path)

gcsort(args.filename, args.output, reverse=args.reverse)