To make an algorithm work faster

pianistseb · (This post was last modified: Mar-28-2019, 06:58 PM by pianistseb.)

I am using biopython for dna sequences. I am new in this python library. I have a .fasta file that has a 4-letters dna code, and I want to convert it in 2-letters purines and pyrimidines binary code. So I merge all the segments/records of the .fasta file and I take the full_sequence of 4-letters alphabet. Then I have to convert this alphabet into two letters alphabet new_sequence. And here is the problem! When I am doing the conversion it takes hours to run. The sequence's length is 119750280, so it's a very long sequence. Any ideas to make my program run faster?

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# merge all the records

full_seq=Seq("")

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    full_seq+=seq_record.seq

# convert the 4-letters alphabet into binary alphabet

new_seq=Seq("")

for i in range(0,len(full_seq)):
    if (full_seq[i]=="A") or (full_seq[i]=="G"):
        new_seq+=Seq("-")
    else:
        new_seq+=Seq("+")

print("Binary sequence", repr(new_seq))

woooee · (This post was last modified: Mar-28-2019, 07:41 PM by woooee.)

You can see if this helps. Your code has to find the offset in the list each time, so if the offset is 10,000, it has to start at the beginning of the list and move forward to the 10,000 record, and then do it all over again for 10,001. This is not terrible for 10,000 records, but you have millions so it does have an effect. The other option is to break full_seq into smaller bites and then combine the resulting lists.

for rec in full_seq:  ## assumes full_seq is iterable
    if rec.startswith(("A", "G")):

pianistseb · Apr-01-2019, 07:54 AM

I finally found that a very fast way to do it is to use something like that:

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    new_str=str(seq_record.seq).replace("A","+");
    new_str=new_str.replace("G","+");
    new_str=new_str.replace("C","-");
    new_str=new_str.replace("T","-");

**Gribouillis** · (This post was last modified: Apr-01-2019, 08:47 AM by Gribouillis.)

You can also try

table = {ord(k): ord(v) for k, v in {'A': '+', 'G': '+', 'C': '-', 'T': '-'}.items()}
new_str = new_str.translate(table)

or

import re
table = {'A': '+', 'G': '+', 'C': '-', 'T': '-'}
new_str = re.sub(r'[AGCT]', lambda m: table[m.group()], new_str)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	hi need help to make this code work correctly	atulkul1985	5	783	Nov-20-2023, 04:38 PM Last Post: deanhystad
	newbie question - can't make code work	tronic72	2	689	Oct-22-2023, 09:08 PM Last Post: tronic72
	Why do I have to repeat items in list slices in order to make this work?	Pythonica	7	1,329	May-22-2023, 10:39 PM Last Post: ICanIBB
	Make my py script work only on 1 compter	tomtom	14	3,852	Feb-20-2022, 06:19 PM Last Post: DPaul
	Cannot make 'pandas' module to work...	ellie145	2	4,197	Jan-05-2021, 09:38 PM Last Post: ellie145
	Is there anyway to make this work?	dre	3	2,168	Nov-26-2020, 12:40 PM Last Post: jefsummers
	Cannot Make the python Code work	ErnestTBass	4	2,676	Apr-23-2020, 02:42 PM Last Post: snippsat
	if, or, in, else in 1 line - how to make it work?	zarize	2	1,853	Sep-10-2019, 04:51 PM Last Post: zarize
	How can I make a faster search algorithm	pianistseb	19	6,578	Apr-18-2019, 05:48 PM Last Post: Larz60+
	Rewrite a function to make it work with 'bottle-pymysql'	nikos	1	1,976	Feb-26-2019, 02:59 PM Last Post: nikos

To make an algorithm work faster

User Panel Messages

Announcements