Python Forum

I am using biopython for dna sequences. I am new in this python library. I have a .fasta file that has a 4-letters dna code, and I want to convert it in 2-letters purines and pyrimidines binary code. So I merge all the segments/records of the .fasta file and I take the full_sequence of 4-letters alphabet. Then I have to convert this alphabet into two letters alphabet new_sequence. And here is the problem! When I am doing the conversion it takes hours to run. The sequence's length is 119750280, so it's a very long sequence. Any ideas to make my program run faster?

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# merge all the records

full_seq=Seq("")

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    full_seq+=seq_record.seq

# convert the 4-letters alphabet into binary alphabet

new_seq=Seq("")

for i in range(0,len(full_seq)):
    if (full_seq[i]=="A") or (full_seq[i]=="G"):
        new_seq+=Seq("-")
    else:
        new_seq+=Seq("+")

print("Binary sequence", repr(new_seq))

You can see if this helps. Your code has to find the offset in the list each time, so if the offset is 10,000, it has to start at the beginning of the list and move forward to the 10,000 record, and then do it all over again for 10,001. This is not terrible for 10,000 records, but you have millions so it does have an effect. The other option is to break full_seq into smaller bites and then combine the resulting lists.

for rec in full_seq:  ## assumes full_seq is iterable
    if rec.startswith(("A", "G")):

I finally found that a very fast way to do it is to use something like that:

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    new_str=str(seq_record.seq).replace("A","+");
    new_str=new_str.replace("G","+");
    new_str=new_str.replace("C","-");
    new_str=new_str.replace("T","-");

You can also try

table = {ord(k): ord(v) for k, v in {'A': '+', 'G': '+', 'C': '-', 'T': '-'}.items()}
new_str = new_str.translate(table)

import re
table = {'A': '+', 'G': '+', 'C': '-', 'T': '-'}
new_str = re.sub(r'[AGCT]', lambda m: table[m.group()], new_str)

pianistseb

woooee

pianistseb

Gribouillis