Problem 7.4
Sort Sequences by GC Fraction
Sort sequences in a FASTA file by their GC fraction.
The program should take a fasta file as argument, sort the sequences in that file using their GC fraction and write them in FASTA format to file gcsort.fasta
.
The progran should also support the following flags.
-o
, --output
to specify the file to write the output instead of writing to gcsort.fasta
.
-r
, --reverse
to sort the sequences in the reverse order (descending order of gc fraction)
$ cat files/multi.fasta
> Seq1
AGTAGG
> Seq2
AGTCAGTC
AGTCATA
> Seq3
AGTCAGTCGC
$ python gcsort.py files/multi.fasta
Wrote the sorted sequences to gcsort.fasta
$ cat gcsort.fasta
> Seq2
AGTCAGTCAGTCATA
> Seq1
AGTAGG
> Seq3
AGTCAGTCGC
The -r
option is used to sort in the reverse order.
$ python gcsort.py -r files/multi.fasta
Wrote the sorted sequences to gcsort.fasta
$ cat gcsort.fasta
> Seq3
AGTCAGTCGC
> Seq1
AGTAGG
> Seq2
AGTCAGTCAGTCATA
The -o
flag is used to specify the output path.
$ python gcsort.py files/multi.fasta -o a.fasta
Wrote the sorted sequences to a.fasta
$ cat a.fasta
> Seq2
AGTCAGTCAGTCATA
> Seq1
AGTAGG
> Seq3
AGTCAGTCGC
Solution
import argparse
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction
p = argparse.ArgumentParser()
p.add_argument("filename", help="FASTA file to sort")
p.add_argument("-r", "--reverse", help="sort in reverse order", action="store_true", default=False)
p.add_argument("-o", "--output", help="path of the output file", default="gcsort.fasta")
args = p.parse_args()
def gcsort(source_path, dest_path, reverse=False):
records = SeqIO.parse(source_path, "fasta")
records = sorted(records, key=gc_fraction, reverse=reverse)
SeqIO.write(records, dest_path, "fasta")
print("Wrote the sorted sequences to", dest_path)
gcsort(args.filename, args.output, reverse=args.reverse)