To make an algorithm work faster

pianistseb · (This post was last modified: Mar-28-2019, 06:58 PM by pianistseb.)

I am using biopython for dna sequences. I am new in this python library. I have a .fasta file that has a 4-letters dna code, and I want to convert it in 2-letters purines and pyrimidines binary code. So I merge all the segments/records of the .fasta file and I take the full_sequence of 4-letters alphabet. Then I have to convert this alphabet into two letters alphabet new_sequence. And here is the problem! When I am doing the conversion it takes hours to run. The sequence's length is 119750280, so it's a very long sequence. Any ideas to make my program run faster?

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# merge all the records

full_seq=Seq("")

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    full_seq+=seq_record.seq

# convert the 4-letters alphabet into binary alphabet

new_seq=Seq("")

for i in range(0,len(full_seq)):
    if (full_seq[i]=="A") or (full_seq[i]=="G"):
        new_seq+=Seq("-")
    else:
        new_seq+=Seq("+")

print("Binary sequence", repr(new_seq))

woooee · (This post was last modified: Mar-28-2019, 07:41 PM by woooee.)

You can see if this helps. Your code has to find the offset in the list each time, so if the offset is 10,000, it has to start at the beginning of the list and move forward to the 10,000 record, and then do it all over again for 10,001. This is not terrible for 10,000 records, but you have millions so it does have an effect. The other option is to break full_seq into smaller bites and then combine the resulting lists.

for rec in full_seq:  ## assumes full_seq is iterable
    if rec.startswith(("A", "G")):

pianistseb · Apr-01-2019, 07:54 AM

I finally found that a very fast way to do it is to use something like that:

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    new_str=str(seq_record.seq).replace("A","+");
    new_str=new_str.replace("G","+");
    new_str=new_str.replace("C","-");
    new_str=new_str.replace("T","-");

**Gribouillis** · (This post was last modified: Apr-01-2019, 08:47 AM by Gribouillis.)

You can also try

table = {ord(k): ord(v) for k, v in {'A': '+', 'G': '+', 'C': '-', 'T': '-'}.items()}
new_str = new_str.translate(table)

or

import re
table = {'A': '+', 'G': '+', 'C': '-', 'T': '-'}
new_str = re.sub(r'[AGCT]', lambda m: table[m.group()], new_str)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Make code run faster: point within polygon lookups	Bennygib	2	584	Apr-19-2025, 09:33 AM Last Post: Larz60+
	How can I make this code more efficient and process faster?	steven_ximen	0	470	Dec-17-2024, 04:27 PM Last Post: steven_ximen
	Trying to Make Steganography Program Work For All Payload Types	Stegosaurus	0	1,369	Sep-26-2024, 12:43 PM Last Post: Stegosaurus
	How to make my Telegram bot stop working at 16:15 and not work on Fridays?	hus73	2	1,587	Aug-10-2024, 12:06 PM Last Post: hus73
	hi need help to make this code work correctly	atulkul1985	5	2,146	Nov-20-2023, 04:38 PM Last Post: deanhystad
	newbie question - can't make code work	tronic72	2	1,653	Oct-22-2023, 09:08 PM Last Post: tronic72
	Why do I have to repeat items in list slices in order to make this work?	Pythonica	7	3,186	May-22-2023, 10:39 PM Last Post: ICanIBB
	Make my py script work only on 1 compter	tomtom	14	6,775	Feb-20-2022, 06:19 PM Last Post: DPaul
	Cannot make 'pandas' module to work...	ellie145	2	5,380	Jan-05-2021, 09:38 PM Last Post: ellie145
	Is there anyway to make this work?	dre	3	3,259	Nov-26-2020, 12:40 PM Last Post: jefsummers

To make an algorithm work faster

User Panel Messages

Announcements