How can I make a faster search algorithm - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How can I make a faster search algorithm (/thread-17510.html) Pages:
1
2
|
RE: How can I make a faster search algorithm - pianistseb - Apr-15-2019 For example! Here is the bottom part chromosome of Arabidopsis: https://www.ebi.ac.uk/ena/data/view/AE005173 (you can reach this page if you press the right button AE005173 in the eucariotic chromosome list). To download the fasta file just press: "download fasta". RE: How can I make a faster search algorithm - paul18fr - Apr-15-2019 your topic is quite challenging; I'm not a specialist but it piqued my interest Find here after another code (among others ) to change "0" to "1" in your "big_Erdos1" list: t0 = time.time() big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011") big_Erdos1 = np.asarray(big_Erdos1, dtype = int); ones_ = np.where(big_Erdos1 == 1); zeros_ = np.where(big_Erdos1 == 0); big_Erdos1[ones_] = 0; big_Erdos1[zeros_] = 1; t1 = time.time() print("Duration step 1 : ", t1-t0)But I do not understand the second part; in your first post you said it's difficult to look for thousands of letters ; I played with the following code in order to look for a pattern of 100 000 numbers; I cannot avoid loops but it takes about 18 seconds to find 1 occurence I've created (it can probably improved). P from numba import jit import numpy as np import time, os, re n = 100000; vect3 = np.random.randint(2,size = 2*n); vect4 = np.random.randint(2,size = n); vect3[1506:1506+n] = vect4; ## initialization / No match by default check = np.full(2*n, False, dtype = bool); ## partially vectorized t0 = time.time() for i in range(2*n): check[i] = np.array_equiv(vect3[i:n+i],vect4); sol = np.where(check == True); ## we're lokking for True occurences if (sol == []): print("No match") else: l = np.size(sol); print("There are %d matche(s)" %l) t1 = time.time() print("Duration method 1: ", t1-t0) ## using numba t0 = time.time() check2 = np.full(2*n, False, dtype = bool); ## function for Numba #@jit(nopython=True, parallel = True) ## numpy.array_equiv seems not to be supproted :-( @jit(parallel = True) def function_check(n, check2, vect3, vect4): for i in range(2*n): check2[i] = np.array_equiv(vect3[i:n+i],vect4); pass; ## function call function_check(n, check2, vect3, vect4); sol = np.where(check2 == True); if (sol == []): print("No match") else: l = np.size(sol); print("There are %d matche(s)" %l) t1 = time.time() print("Duration method 2: ", t1-t0)I do not master Numba enough to avoid the error RE: How can I make a faster search algorithm - Larz60+ - Apr-15-2019 Ok have all of the data. I would like to request once more the ultimate goal of the exercise. What is the output, and it's structure? Knowing the goal can produce a different way at solving the problem. RE: How can I make a faster search algorithm - pianistseb - Apr-15-2019 My goal is to detect the positions that are similar (similar means that I accept an error like 10% that I can change in the program) to Erdos sequence in DNA or chromosome and put it in a text file. I want the position because, then I will make a graph in a neighborhood near these positions. I want also to compute how many times I detect this sequence because it's important for my research. My problem is not the first part, where I change the 4-letters alphabet to binary, but the second part where I am searching to find in DNA this sequence. I don't say that it's impossible to run fast a program that does that in python, I just said that my program is too slow for the job I want to do. So: -Input: Erdos pattern and dna dequence. -Output: The position of erdos pattern with an error in the dna sequence and the number of times that we found it. Note 1: paul18fr, I am not sure if I understand your code for the second part, but I will check it. I am searching a 1000 (or more) letters pattern in a DNA sequence. A DNA sequence is longer. A whole genome can be about 3000000000 letters. The DNA, chromosome or genome sequence is included in the fasta file. To read a fasta file you need biopython. Note 2: I don't know how many letters would be different from Erdos sequence. So I don't know the error that I have to put. I must try for different errors and genomes. RE: How can I make a faster search algorithm - paul18fr - Apr-15-2019 In my code:
Some remarks:
hope it helps I noticed some mistake in my prevous code; please consider the following one that is looking for a pattern of 1000 components in a vector of 3 million lines (it took 62 sec on my old laptop) import numpy as np import time, os, re n = 1000; vect3 = np.random.randint(2,size = 3000*n); vect4 = np.random.randint(2,size = n); ## 4 occurences are manually created vect3[1:1+n] = vect4; vect3[6596:6596+n] = vect4; vect3[10023:10023+n] = vect4; vect3[569872:569872+n] = vect4; ## initialization / No match by default check = np.full(3000*n, False, dtype = bool); ## partially vectorized t0 = time.time() for i in range(3000*n): check[i] = np.array_equiv(vect3[i:n+i],vect4); sol = np.where(check == True); ## we're lokking for True occurences and its positions (see Tuple) if (sol == []): print("No match") else: l = np.size(sol); print("There are %d matche(s)" %l) print("Positions: %s" %sol) t1 = time.time() print("Duration method: ", t1-t0) RE: How can I make a faster search algorithm - Larz60+ - Apr-15-2019 This works, and it's fast! my directory structure is like: You can modify code for whatever structure you're usingI thing it will return multiple matches as well. For control, I tried first with an exact match, then messed up one base and tried for second (fuzzy) match Hope this is helpful: from fuzzysearch import find_near_matches from pathlib import Path import re import os class FuzzyFind: ''' This is a very simplistic example, but can be used as a starting point ''' def __init__(self): # Create directory anchor to find Sequence files os.chdir(os.path.abspath(os.path.dirname(__file__))) # Get path to files: homepath = Path('.') seqpath = homepath / '..' / 'data' / 'fasta' CM000665 = seqpath / 'CM000665.fasta' with CM000665.open() as fp: bigseq = fp.read() ex_dna = 'GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA' # Test exact match first result = self.fuzzy_find(bigseq, ex_dna) for item in result: print(f'\nLooking for: {ex_dna}') print(f'item: {item}, type: {type(item)}') nums = re.findall(r'[\d\.]+', str(result)) start = int(nums[0]) end = int(nums[1]) found_seq = bigseq[start:end] print(f'exact match start: {start}, end: {end}: {found_seq}') # Now mess up a base and try again ex_dna = 'AGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA' result = self.fuzzy_find(bigseq, ex_dna) for item in result: print(f'\nLooking for: {ex_dna}') print(f'item: {item}, type: {type(item)}') nums = re.findall(r'[\d\.]+', str(result)) start = int(nums[0]) end = int(nums[1]) found_seq = bigseq[start:end] print(f'fuzzy match: {start}, end: {end}: {found_seq}') def fuzzy_find(self, main_seq, find_seq): matches = find_near_matches(find_seq, main_seq, max_l_dist=2) return matches if __name__ == '__main__': FuzzyFind()results: you will need to import fuzzysearch:pip install fuzzysearch RE: How can I make a faster search algorithm - pianistseb - Apr-16-2019 Thank you very much Larz60+ ! I think you found the solution to my problem! RE: How can I make a faster search algorithm - Larz60+ - Apr-16-2019 Glad to hear it RE: How can I make a faster search algorithm - pianistseb - Apr-17-2019 The only problem with this function from fuzzysearch library is that it works fast only for small errors. If I put in my code max_substitutions=100, it takes approximately 1 minute to run, if I put max_substitutions=120 it takes about 5-10minutes. But if I put max_substitutions=200, it stacks. RE: How can I make a faster search algorithm - Larz60+ - Apr-18-2019 for 100 substitutions, considering the number of permutations, I think that's quite reasonable! the others not so good, but again, remember it's not a linear search, adding 1 substitution increases search time on a log scale. |