Python Forum
How can I make a faster search algorithm
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I make a faster search algorithm
#11
For example! Here is the bottom part chromosome of Arabidopsis:
https://www.ebi.ac.uk/ena/data/view/AE005173
(you can reach this page if you press the right button AE005173 in the eucariotic chromosome list). To download the fasta file just press: "download fasta".
Reply
#12
your topic is quite challenging; I'm not a specialist but it piqued my interest Big Grin

Find here after another code (among others ) to change "0" to "1" in your "big_Erdos1" list:

t0 = time.time()
big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011")
big_Erdos1 = np.asarray(big_Erdos1, dtype = int);
ones_ = np.where(big_Erdos1 == 1);
zeros_ = np.where(big_Erdos1 == 0);
big_Erdos1[ones_] = 0;
big_Erdos1[zeros_] = 1;
t1 = time.time()
print("Duration step 1 : ", t1-t0)
But I do not understand the second part; in your first post you said it's difficult to look for thousands of letters ; I played with the following code in order to look for a pattern of 100 000 numbers; I cannot avoid loops but it takes about 18 seconds to find 1 occurence I've created (it can probably improved).

P
from numba import jit
import numpy as np
import time, os, re

n = 100000;
vect3 = np.random.randint(2,size = 2*n);
vect4 = np.random.randint(2,size = n);
vect3[1506:1506+n] = vect4;

## initialization / No match by default
check = np.full(2*n, False, dtype = bool);

## partially vectorized
t0 = time.time()
for i in range(2*n):
    check[i] = np.array_equiv(vect3[i:n+i],vect4);


sol = np.where(check == True); ## we're lokking for True occurences
if (sol == []):
    print("No match")
else:
    l = np.size(sol);
    print("There are %d matche(s)" %l)

t1 = time.time()    
print("Duration method 1: ", t1-t0)


## using numba
t0 = time.time()
check2 = np.full(2*n, False, dtype = bool);

## function for Numba
#@jit(nopython=True, parallel = True) ## numpy.array_equiv seems not to be supproted :-(
@jit(parallel = True) 
def function_check(n, check2, vect3, vect4):         
    for i in range(2*n):
        check2[i] = np.array_equiv(vect3[i:n+i],vect4);       
    pass;



## function call
function_check(n, check2, vect3, vect4);

sol = np.where(check2 == True);
if (sol == []):
    print("No match")
else:
    l = np.size(sol);
    print("There are %d matche(s)" %l)

t1 = time.time()    
print("Duration method 2: ", t1-t0)
I do not master Numba enough to avoid the error Blush
Reply
#13
Ok have all of the data.
I would like to request once more the ultimate goal of the exercise.
What is the output, and it's structure?
Knowing the goal can produce a different way at solving the problem.
Reply
#14
My goal is to detect the positions that are similar (similar means that I accept an error like 10% that I can change in the program) to Erdos sequence in DNA or chromosome and put it in a text file. I want the position because, then I will make a graph in a neighborhood near these positions. I want also to compute how many times I detect this sequence because it's important for my research.
My problem is not the first part, where I change the 4-letters alphabet to binary, but the second part where I am searching to find in DNA this sequence. I don't say that it's impossible to run fast a program that does that in python, I just said that my program is too slow for the job I want to do.
So:
-Input: Erdos pattern and dna dequence.
-Output: The position of erdos pattern with an error in the dna sequence and the number of times that we found it.

Note 1: paul18fr, I am not sure if I understand your code for the second part, but I will check it. I am searching a 1000 (or more) letters pattern in a DNA sequence. A DNA sequence is longer. A whole genome can be about 3000000000 letters. The DNA, chromosome or genome sequence is included in the fasta file. To read a fasta file you need biopython.
Note 2: I don't know how many letters would be different from Erdos sequence. So I don't know the error that I have to put. I must try for different errors and genomes.
Reply
#15
In my code:
  • vect4 is the pattern
  • in the loop, I'm extracting parts of vect3 having the size of the pattern, and then I'm comparing it to vect4 (True or False)
  • at the same time the "check" vector (composed of booleans) is updated
  • At the end, I'm looking for "True" values in order to get both the position and the occurence(s) of the pattern

Some remarks:
  1. I've not been able to avoid the loop
  2. it's true in Python but not only: avoid (memory) dynamic allocation to speed up your code
  3. I guess we can adapt it for

hope it helps

I noticed some mistake in my prevous code; please consider the following one that is looking for a pattern of 1000 components in a vector of 3 million lines (it took 62 sec on my old laptop)
import numpy as np
import time, os, re

n = 1000;
vect3 = np.random.randint(2,size = 3000*n);
vect4 = np.random.randint(2,size = n);
## 4 occurences are manually created
vect3[1:1+n] = vect4;
vect3[6596:6596+n] = vect4;
vect3[10023:10023+n] = vect4;
vect3[569872:569872+n] = vect4;

## initialization / No match by default
check = np.full(3000*n, False, dtype = bool);

## partially vectorized
t0 = time.time()
for i in range(3000*n):
    check[i] = np.array_equiv(vect3[i:n+i],vect4);

sol = np.where(check == True); ## we're lokking for True occurences and its positions (see Tuple)
if (sol == []):
    print("No match")
else:
    l = np.size(sol);
    print("There are %d matche(s)" %l)
    print("Positions: %s" %sol)

t1 = time.time()    
print("Duration method: ", t1-t0)
Reply
#16
This works, and it's fast!
my directory structure is like:
Output:
Genetics |_ data |_ fasta CM000665.fasta |_ src This program
You can modify code for whatever structure you're using

I thing it will return multiple matches as well.

For control, I tried first with an exact match, then messed up one base and tried for second (fuzzy) match

Hope this is helpful:
from fuzzysearch import find_near_matches
from pathlib import Path
import re
import os


class FuzzyFind:
    '''
    This is a very simplistic example, but can be used as a starting point
    '''
    def __init__(self):
        # Create directory anchor to find Sequence files
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        # Get path to files:
        homepath = Path('.')
        seqpath = homepath / '..' / 'data' / 'fasta'
        CM000665 = seqpath / 'CM000665.fasta'
        with CM000665.open() as fp:            
            bigseq = fp.read()
        
        ex_dna = 'GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA'

        # Test exact match first
        result = self.fuzzy_find(bigseq, ex_dna)
        for item in result:
            print(f'\nLooking for: {ex_dna}')
            print(f'item: {item}, type: {type(item)}')
            nums = re.findall(r'[\d\.]+', str(result))
            start = int(nums[0])
            end = int(nums[1])
            found_seq = bigseq[start:end]
            print(f'exact match start: {start}, end: {end}: {found_seq}')

        # Now mess up a base and try again
        ex_dna = 'AGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA'
        result = self.fuzzy_find(bigseq, ex_dna)
        for item in result:
            print(f'\nLooking for: {ex_dna}')
            print(f'item: {item}, type: {type(item)}')
            nums = re.findall(r'[\d\.]+', str(result))
            start = int(nums[0])
            end = int(nums[1])
            found_seq = bigseq[start:end]
            print(f'fuzzy match: {start}, end: {end}: {found_seq}')

    def fuzzy_find(self, main_seq, find_seq):
        matches = find_near_matches(find_seq, main_seq, max_l_dist=2)
        return matches

if __name__ == '__main__':
    FuzzyFind()
results:
Output:
Looking for: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA item: Match(start=31203914, end=31203990, dist=1), type: <class 'fuzzysearch.common.Match'> exact match start: 31203914, end: 31203990: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAA CTCTCCCTCAACCTAGTTTAGAAA Looking for: AGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA item: Match(start=31203914, end=31203990, dist=2), type: <class 'fuzzysearch.common.Match'> fuzzy match: 31203914, end: 31203990: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAA CTCTCCCTCAACCTAGTTTAGAAA
you will need to import fuzzysearch:
pip install fuzzysearch
Reply
#17
Thank you very much Larz60+ ! I think you found the solution to my problem!
Reply
#18
Glad to hear it
Reply
#19
The only problem with this function from fuzzysearch library is that it works fast only for small errors. If I put in my code max_substitutions=100, it takes approximately 1 minute to run, if I put max_substitutions=120 it takes about 5-10minutes. But if I put max_substitutions=200, it stacks.
Reply
#20
for 100 substitutions, considering the number of permutations, I think that's quite reasonable! the others not so good, but again, remember it's not a linear search, adding 1 substitution increases search time on a log scale.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Basic binary search algorithm - using a while loop Drone4four 1 308 Jan-22-2024, 06:34 PM
Last Post: deanhystad
  Writing a Linear Search algorithm - malformed string representation Drone4four 10 833 Jan-10-2024, 08:39 AM
Last Post: gulshan212
  go over and search in numpy array faster caro 7 1,691 Jun-20-2022, 04:54 PM
Last Post: deanhystad
  Trying to search and make new column from the keyword. dgarg 1 1,452 Dec-20-2021, 08:41 PM
Last Post: deanhystad
  Search faster? Pedroski55 1 1,829 Dec-06-2020, 10:03 PM
Last Post: Larz60+
  how to make iterative search more efficient renergy 2 2,234 Jan-03-2020, 03:43 PM
Last Post: stullis
  To make an algorithm work faster pianistseb 3 2,757 Apr-01-2019, 08:42 AM
Last Post: Gribouillis
  How can I make this function faster? Brennan 10 6,064 Jun-29-2018, 08:33 PM
Last Post: ichabod801
  How do I make this code run faster? DontHurtMe 0 2,393 Nov-04-2017, 12:12 PM
Last Post: DontHurtMe

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020