Apr-14-2019, 07:42 AM
I am using biopython for dna sequences. I want to make an algorithm which searches a big motif (about 1000 letters). Because it is impossible to find this 1000-letters motif, I accept also motifs which have 30% error. So I combine every 1000 letters chromosome subsequence of the whole dna sequence, with this motif and I compute how many letters are different. The problem is that the algorithm is too slow. It needs about one day to run. I work in the binary purines-pyrimidines alphabet. Generally speaking I think that python is too slow to deal with big strings or lists about 1 million or billion letters lenght. Why?
I am running it in linux terminal, because I want the maximum speed. Is really C faster than python?
Here is my code:



Here is my code:
from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord for seq_record in SeqIO.parse("CM000665.fasta", "fasta"): new_str=str(seq_record.seq).replace("A","1"); new_str=new_str.replace("G","1"); new_str=new_str.replace("C","0"); new_str=new_str.replace("T","0"); # Big Erdos Input big_Erdos1=listbig_Erdos2="" for i in range(0,len(big_Erdos1)): if big_Erdos1[i]=="0": big_Erdos2+="1" else: big_Erdos2+="0" big_Erdos2=list(big_Erdos2) chromosome=list(new_str) # Search big Erdos erratta big_erdos_number=0 file = open("results30.txt", "w") file.write("Erdos block positions:") for i in range(0,len(chromosome)-len(big_Erdos1)): S1=0 S2=0 for j in range(0,len(big_Erdos1)): if chromosome[j+i]!=big_Erdos1[j]: S1+=1 for j in range(0,len(big_Erdos2)): if chromosome[j+i]!=big_Erdos2[j]: S2+=1 error1=S1/len(big_Erdos1) error2=S2/len(big_Erdos2) #print(error) if error1<0.3 or error2<0.3: big_erdos_number+=1 print("erdos block position:",i) file.write("%d,"%i) #print("progress:",i/(len(chromosome)-len(big_Erdos1))*100,"%") file.write("\n") file.write("Big erdos number:%d"%big_erdos_number) print("Big erdos number:",big_erdos_number) file.close()