How can I make a faster search algorithm - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How can I make a faster search algorithm (/thread-17510.html) Pages:
1
2
|
How can I make a faster search algorithm - pianistseb - Apr-14-2019 I am using biopython for dna sequences. I want to make an algorithm which searches a big motif (about 1000 letters). Because it is impossible to find this 1000-letters motif, I accept also motifs which have 30% error. So I combine every 1000 letters chromosome subsequence of the whole dna sequence, with this motif and I compute how many letters are different. The problem is that the algorithm is too slow. It needs about one day to run. I work in the binary purines-pyrimidines alphabet. Generally speaking I think that python is too slow to deal with big strings or lists about 1 million or billion letters lenght. Why? I am running it in linux terminal, because I want the maximum speed. Is really C faster than python? Here is my code: from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord for seq_record in SeqIO.parse("CM000665.fasta", "fasta"): new_str=str(seq_record.seq).replace("A","1"); new_str=new_str.replace("G","1"); new_str=new_str.replace("C","0"); new_str=new_str.replace("T","0"); # Big Erdos Input big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011") big_Erdos2="" for i in range(0,len(big_Erdos1)): if big_Erdos1[i]=="0": big_Erdos2+="1" else: big_Erdos2+="0" big_Erdos2=list(big_Erdos2) chromosome=list(new_str) # Search big Erdos erratta big_erdos_number=0 file = open("results30.txt", "w") file.write("Erdos block positions:") for i in range(0,len(chromosome)-len(big_Erdos1)): S1=0 S2=0 for j in range(0,len(big_Erdos1)): if chromosome[j+i]!=big_Erdos1[j]: S1+=1 for j in range(0,len(big_Erdos2)): if chromosome[j+i]!=big_Erdos2[j]: S2+=1 error1=S1/len(big_Erdos1) error2=S2/len(big_Erdos2) #print(error) if error1<0.3 or error2<0.3: big_erdos_number+=1 print("erdos block position:",i) file.write("%d,"%i) #print("progress:",i/(len(chromosome)-len(big_Erdos1))*100,"%") file.write("\n") file.write("Big erdos number:%d"%big_erdos_number) print("Big erdos number:",big_erdos_number) file.close() RE: How can I make a faster search algorithm - Larz60+ - Apr-14-2019 Python is not too slow. You need to fix the algorithm. where can I get the Bio package? I know how to get the CM000665.fasta file from NCBI run a profiler on your code see: https://docs.python.org/3/library/debug.html RE: How can I make a faster search algorithm - pianistseb - Apr-14-2019 There is the biopython package. https://biopython.org/wiki/Download RE: How can I make a faster search algorithm - scidam - Apr-14-2019 The only function you are using from the biopython package is a helper function for decoding fasta-formatted file. Are you sure that the algorithm you are trying to implement, isn't already implemented in biopython ? Raw Python loops are slow in Python (in comparison to those in C, for example).Nevertheless, this is fundamentally wrong to say that C is faster than Python. C or Python are both programming languages. The main difference is in a way how a program being written in either C or Python is executed. A C-program is compiled into machine code that is further executed by CPU; A Python program (if we're talking about standard execution model by means of CPython interpreter) is translated to bytecode that is executed/interpreted by the CPython interpreter (CPython stands for Python implemented in C). In general, a program is executed inside another program. So, what you need to do... 1) check if biopython already has the algorithm you are trying to implement. If so, this algorithm is likely implemented in C (or Cython) and will be executed much faster, than it would be implemented using raw Python loops. 2) You can try to implement your algorithm using Cython. In this case changes of your code would be minimal. Cython syntax, especially concerning loops, is almost the same as that in native Python. RE: How can I make a faster search algorithm - pianistseb - Apr-15-2019 If for loops don't work fast in python, what work fast? Biopython have some functions, but if someone want to make something new must use for loops. I started programming in python because I thought that it would be faster than matlab and easier than C, but now I found that is the slower programming language. I can't say that isn't useful, but it's not for all jobs. RE: How can I make a faster search algorithm - Larz60+ - Apr-15-2019 I'll take a look at speeding this up. First I need some info:
RE: How can I make a faster search algorithm - scidam - Apr-15-2019 Look at the following code: # big_Erdos1=list(".... long string.... import numpy as np def one(): big_Erdos2="" for i in range(0,len(big_Erdos1)): if big_Erdos1[i]=="0": big_Erdos2+="1" else: big_Erdos2+="0" return big_Erdos2 res = np.array(["0"]*len(big_Erdos1)) data = np.array(big_Erdos1) def two(): res[ data == "0"] = "1" return res.tolist()If you measure time of execution for one() and two() , you get result like this: The second solution is about 5x times faster than previous. This is because I used numpy and assigned values to the array without any loops, but using vectorized notation. Numpy is implemented in C and such operations executed much faster, than they would implemented as raw python loops. So, you need to rewrite your data processing algorithm using vectorized notation of numpy (avoiding loops), or, if you need to find some substring in another string, using regular expressions (re module). RE: How can I make a faster search algorithm - pianistseb - Apr-15-2019 Thank you very much! I started learn python some months ago, but I haven't problems with speed because I didn't deal with big data. Now as my job is in dna sequences, the data structures are very big. In dna science is important to find some motif effects. Erdos motifs are some new mathematical motifs. I change into the binary alphabet because the first research in erdos motifs was in binary alphabet. I want to generalize this research in four-letters alphabet. The erdos_motif, where is also in this program, is a mathematical pattern that has the minimum repeats in it's elements. I have found some effects of this model in an hypothetical dna and now I want to found this motif in a natural dna. This is the reason why I put errors. It's impossible to find this motif in dna except if I put errors. My goal is to search this motif with errors and see if these effects that I have found are valid. Thank you very much guys! Your help is very important to me! There is the site you can download any fasta file you want! https://www.ebi.ac.uk/genomes/eukaryota.html Now I am learning numpy to make my codes faster! I feel very relieved because using this package I will find a solution to my problem. RE: How can I make a faster search algorithm - Larz60+ - Apr-15-2019 Quote:There is the site you can download any fasta file you want!I saw where you could display as text, but couldn't find download option You can do the same with NCBI, and they do have ftp sites, but couldn't find one for CM000665.fasta again having only display. I know there's a way to download as I've done it before. RE: How can I make a faster search algorithm - paul18fr - Apr-15-2019 Hi Interesting topic (briefly read I confess); let me doing some remarks: - if you're dealing with numbers (0 and 1 typically), then you can use numpy and vectorization to speed-up your code - it can be combined with "slicing" to get and compare arrays/vectors - have a look to numpy.array_equal to compare 2 arrays/vectors (never tested so far but I can imagine the size of the vector is not an issue), If you cannot avoid loops, have a look to Numba library (http://numba.pydata.org) Finally, interesting benchmarking has been done by the Nasa to compare C, Python, Julia languages among others, and C is much faster than Python (type "Basic Comparison of Python, Julia, Matlab, IDL and Java (2018 Edition)" + Nasa on Google); it's also interesting to note than vectorization is very powerfull Paul n = 1000000; vect1 = np.random.randint(2,size = n); vect2 = np.copy(vect1); identical = np.array_equiv(vect1,vect2); ## now I change 1 value to 2 chang = np.random.randint(n, size = 1); vect2[chang] = 2; identical2 = np.array_equiv(vect1,vect2); extract_part = vect2[100 : 350]; [python]PUT CODE HERE |