Apr-14-2019, 07:42 AM
I am using biopython for dna sequences. I want to make an algorithm which searches a big motif (about 1000 letters). Because it is impossible to find this 1000-letters motif, I accept also motifs which have 30% error. So I combine every 1000 letters chromosome subsequence of the whole dna sequence, with this motif and I compute how many letters are different. The problem is that the algorithm is too slow. It needs about one day to run. I work in the binary purines-pyrimidines alphabet. Generally speaking I think that python is too slow to deal with big strings or lists about 1 million or billion letters lenght. Why?
I am running it in linux terminal, because I want the maximum speed. Is really C faster than python?
Here is my code:



Here is my code:
from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord for seq_record in SeqIO.parse("CM000665.fasta", "fasta"): new_str=str(seq_record.seq).replace("A","1"); new_str=new_str.replace("G","1"); new_str=new_str.replace("C","0"); new_str=new_str.replace("T","0"); # Big Erdos Input big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011") big_Erdos2="" for i in range(0,len(big_Erdos1)): if big_Erdos1[i]=="0": big_Erdos2+="1" else: big_Erdos2+="0" big_Erdos2=list(big_Erdos2) chromosome=list(new_str) # Search big Erdos erratta big_erdos_number=0 file = open("results30.txt", "w") file.write("Erdos block positions:") for i in range(0,len(chromosome)-len(big_Erdos1)): S1=0 S2=0 for j in range(0,len(big_Erdos1)): if chromosome[j+i]!=big_Erdos1[j]: S1+=1 for j in range(0,len(big_Erdos2)): if chromosome[j+i]!=big_Erdos2[j]: S2+=1 error1=S1/len(big_Erdos1) error2=S2/len(big_Erdos2) #print(error) if error1<0.3 or error2<0.3: big_erdos_number+=1 print("erdos block position:",i) file.write("%d,"%i) #print("progress:",i/(len(chromosome)-len(big_Erdos1))*100,"%") file.write("\n") file.write("Big erdos number:%d"%big_erdos_number) print("Big erdos number:",big_erdos_number) file.close()