How can I make a faster search algorithm

How can I make a faster search algorithm - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How can I make a faster search algorithm (/thread-17510.html)

Pages: 1 2

RE: How can I make a faster search algorithm - pianistseb - Apr-15-2019

For example! Here is the bottom part chromosome of Arabidopsis:
https://www.ebi.ac.uk/ena/data/view/AE005173
(you can reach this page if you press the right button AE005173 in the eucariotic chromosome list). To download the fasta file just press: "download fasta".

RE: How can I make a faster search algorithm - paul18fr - Apr-15-2019

your topic is quite challenging; I'm not a specialist but it piqued my interest Big Grin

Find here after another code (among others ) to change "0" to "1" in your "big_Erdos1" list:

t0 = time.time()
big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011")
big_Erdos1 = np.asarray(big_Erdos1, dtype = int);
ones_ = np.where(big_Erdos1 == 1);
zeros_ = np.where(big_Erdos1 == 0);
big_Erdos1[ones_] = 0;
big_Erdos1[zeros_] = 1;
t1 = time.time()
print("Duration step 1 : ", t1-t0)

But I do not understand the second part; in your first post you said it's difficult to look for thousands of letters ; I played with the following code in order to look for a pattern of 100 000 numbers; I cannot avoid loops but it takes about 18 seconds to find 1 occurence I've created (it can probably improved).

P
from numba import jit
import numpy as np
import time, os, re

n = 100000;
vect3 = np.random.randint(2,size = 2*n);
vect4 = np.random.randint(2,size = n);
vect3[1506:1506+n] = vect4;

## initialization / No match by default
check = np.full(2*n, False, dtype = bool);

## partially vectorized
t0 = time.time()
for i in range(2*n):
    check[i] = np.array_equiv(vect3[i:n+i],vect4);


sol = np.where(check == True); ## we're lokking for True occurences
if (sol == []):
    print("No match")
else:
    l = np.size(sol);
    print("There are %d matche(s)" %l)

t1 = time.time()    
print("Duration method 1: ", t1-t0)


## using numba
t0 = time.time()
check2 = np.full(2*n, False, dtype = bool);

## function for Numba
#@jit(nopython=True, parallel = True) ## numpy.array_equiv seems not to be supproted :-(
@jit(parallel = True) 
def function_check(n, check2, vect3, vect4):         
    for i in range(2*n):
        check2[i] = np.array_equiv(vect3[i:n+i],vect4);       
    pass;



## function call
function_check(n, check2, vect3, vect4);

sol = np.where(check2 == True);
if (sol == []):
    print("No match")
else:
    l = np.size(sol);
    print("There are %d matche(s)" %l)

t1 = time.time()    
print("Duration method 2: ", t1-t0)

I do not master Numba enough to avoid the error Blush

RE: How can I make a faster search algorithm - Larz60+ - Apr-15-2019

Ok have all of the data.
I would like to request once more the ultimate goal of the exercise.
What is the output, and it's structure?
Knowing the goal can produce a different way at solving the problem.

RE: How can I make a faster search algorithm - pianistseb - Apr-15-2019

My goal is to detect the positions that are similar (similar means that I accept an error like 10% that I can change in the program) to Erdos sequence in DNA or chromosome and put it in a text file. I want the position because, then I will make a graph in a neighborhood near these positions. I want also to compute how many times I detect this sequence because it's important for my research.
My problem is not the first part, where I change the 4-letters alphabet to binary, but the second part where I am searching to find in DNA this sequence. I don't say that it's impossible to run fast a program that does that in python, I just said that my program is too slow for the job I want to do.
So:
-Input: Erdos pattern and dna dequence.
-Output: The position of erdos pattern with an error in the dna sequence and the number of times that we found it.

Note 1: paul18fr, I am not sure if I understand your code for the second part, but I will check it. I am searching a 1000 (or more) letters pattern in a DNA sequence. A DNA sequence is longer. A whole genome can be about 3000000000 letters. The DNA, chromosome or genome sequence is included in the fasta file. To read a fasta file you need biopython.
Note 2: I don't know how many letters would be different from Erdos sequence. So I don't know the error that I have to put. I must try for different errors and genomes.

RE: How can I make a faster search algorithm - paul18fr - Apr-15-2019

In my code:

vect4 is the pattern
in the loop, I'm extracting parts of vect3 having the size of the pattern, and then I'm comparing it to vect4 (True or False)
at the same time the "check" vector (composed of booleans) is updated
At the end, I'm looking for "True" values in order to get both the position and the occurence(s) of the pattern

Some remarks:

I've not been able to avoid the loop
it's true in Python but not only: avoid (memory) dynamic allocation to speed up your code
I guess we can adapt it for

hope it helps

I noticed some mistake in my prevous code; please consider the following one that is looking for a pattern of 1000 components in a vector of 3 million lines (it took 62 sec on my old laptop)

import numpy as np
import time, os, re

n = 1000;
vect3 = np.random.randint(2,size = 3000*n);
vect4 = np.random.randint(2,size = n);
## 4 occurences are manually created
vect3[1:1+n] = vect4;
vect3[6596:6596+n] = vect4;
vect3[10023:10023+n] = vect4;
vect3[569872:569872+n] = vect4;

## initialization / No match by default
check = np.full(3000*n, False, dtype = bool);

## partially vectorized
t0 = time.time()
for i in range(3000*n):
    check[i] = np.array_equiv(vect3[i:n+i],vect4);

sol = np.where(check == True); ## we're lokking for True occurences and its positions (see Tuple)
if (sol == []):
    print("No match")
else:
    l = np.size(sol);
    print("There are %d matche(s)" %l)
    print("Positions: %s" %sol)

t1 = time.time()    
print("Duration method: ", t1-t0)

RE: How can I make a faster search algorithm - Larz60+ - Apr-15-2019

This works, and it's fast!
my directory structure is like:

Output:Genetics
   |_ data
        |_ fasta
           CM000665.fasta
   |_ src
      This program

You can modify code for whatever structure you're using

I thing it will return multiple matches as well.

For control, I tried first with an exact match, then messed up one base and tried for second (fuzzy) match

Hope this is helpful:

from fuzzysearch import find_near_matches
from pathlib import Path
import re
import os


class FuzzyFind:
    '''
    This is a very simplistic example, but can be used as a starting point
    '''
    def __init__(self):
        # Create directory anchor to find Sequence files
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        # Get path to files:
        homepath = Path('.')
        seqpath = homepath / '..' / 'data' / 'fasta'
        CM000665 = seqpath / 'CM000665.fasta'
        with CM000665.open() as fp:            
            bigseq = fp.read()
        
        ex_dna = 'GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA'

        # Test exact match first
        result = self.fuzzy_find(bigseq, ex_dna)
        for item in result:
            print(f'\nLooking for: {ex_dna}')
            print(f'item: {item}, type: {type(item)}')
            nums = re.findall(r'[\d\.]+', str(result))
            start = int(nums[0])
            end = int(nums[1])
            found_seq = bigseq[start:end]
            print(f'exact match start: {start}, end: {end}: {found_seq}')

        # Now mess up a base and try again
        ex_dna = 'AGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA'
        result = self.fuzzy_find(bigseq, ex_dna)
        for item in result:
            print(f'\nLooking for: {ex_dna}')
            print(f'item: {item}, type: {type(item)}')
            nums = re.findall(r'[\d\.]+', str(result))
            start = int(nums[0])
            end = int(nums[1])
            found_seq = bigseq[start:end]
            print(f'fuzzy match: {start}, end: {end}: {found_seq}')

    def fuzzy_find(self, main_seq, find_seq):
        matches = find_near_matches(find_seq, main_seq, max_l_dist=2)
        return matches

if __name__ == '__main__':
    FuzzyFind()

results:

Output:Looking for: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA
item: Match(start=31203914, end=31203990, dist=1), type: <class 'fuzzysearch.common.Match'>
exact match start: 31203914, end: 31203990: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAA
CTCTCCCTCAACCTAGTTTAGAAA

Looking for: AGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA
item: Match(start=31203914, end=31203990, dist=2), type: <class 'fuzzysearch.common.Match'>
fuzzy match: 31203914, end: 31203990: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAA
CTCTCCCTCAACCTAGTTTAGAAA

you will need to import fuzzysearch:

pip install fuzzysearch

RE: How can I make a faster search algorithm - pianistseb - Apr-16-2019

Thank you very much Larz60+ ! I think you found the solution to my problem!

RE: How can I make a faster search algorithm - Larz60+ - Apr-16-2019

Glad to hear it

RE: How can I make a faster search algorithm - pianistseb - Apr-17-2019

The only problem with this function from fuzzysearch library is that it works fast only for small errors. If I put in my code max_substitutions=100, it takes approximately 1 minute to run, if I put max_substitutions=120 it takes about 5-10minutes. But if I put max_substitutions=200, it stacks.

RE: How can I make a faster search algorithm - Larz60+ - Apr-18-2019

for 100 substitutions, considering the number of permutations, I think that's quite reasonable! the others not so good, but again, remember it's not a linear search, adding 1 substitution increases search time on a log scale.