Posts: 21
Threads: 7
Joined: Nov 2018
Apr-15-2019, 12:19 PM
(This post was last modified: Apr-15-2019, 12:19 PM by pianistseb.)
For example! Here is the bottom part chromosome of Arabidopsis:
https://www.ebi.ac.uk/ena/data/view/AE005173
(you can reach this page if you press the right button AE005173 in the eucariotic chromosome list). To download the fasta file just press: "download fasta".
Posts: 299
Threads: 72
Joined: Apr 2019
Apr-15-2019, 05:42 PM
(This post was last modified: Apr-15-2019, 08:12 PM by paul18fr.)
your topic is quite challenging; I'm not a specialist but it piqued my interest
Find here after another code (among others ) to change "0" to "1" in your "big_Erdos1" list:
t0 = time.time()
big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011")
big_Erdos1 = np.asarray(big_Erdos1, dtype = int);
ones_ = np.where(big_Erdos1 == 1);
zeros_ = np.where(big_Erdos1 == 0);
big_Erdos1[ones_] = 0;
big_Erdos1[zeros_] = 1;
t1 = time.time()
print("Duration step 1 : ", t1-t0) But I do not understand the second part; in your first post you said it's difficult to look for thousands of letters ; I played with the following code in order to look for a pattern of 100 000 numbers; I cannot avoid loops but it takes about 18 seconds to find 1 occurence I've created (it can probably improved).
P
from numba import jit
import numpy as np
import time, os, re
n = 100000;
vect3 = np.random.randint(2,size = 2*n);
vect4 = np.random.randint(2,size = n);
vect3[1506:1506+n] = vect4;
## initialization / No match by default
check = np.full(2*n, False, dtype = bool);
## partially vectorized
t0 = time.time()
for i in range(2*n):
check[i] = np.array_equiv(vect3[i:n+i],vect4);
sol = np.where(check == True); ## we're lokking for True occurences
if (sol == []):
print("No match")
else:
l = np.size(sol);
print("There are %d matche(s)" %l)
t1 = time.time()
print("Duration method 1: ", t1-t0)
## using numba
t0 = time.time()
check2 = np.full(2*n, False, dtype = bool);
## function for Numba
#@jit(nopython=True, parallel = True) ## numpy.array_equiv seems not to be supproted :-(
@jit(parallel = True)
def function_check(n, check2, vect3, vect4):
for i in range(2*n):
check2[i] = np.array_equiv(vect3[i:n+i],vect4);
pass;
## function call
function_check(n, check2, vect3, vect4);
sol = np.where(check2 == True);
if (sol == []):
print("No match")
else:
l = np.size(sol);
print("There are %d matche(s)" %l)
t1 = time.time()
print("Duration method 2: ", t1-t0) I do not master Numba enough to avoid the error
Posts: 12,021
Threads: 484
Joined: Sep 2016
Ok have all of the data.
I would like to request once more the ultimate goal of the exercise.
What is the output, and it's structure?
Knowing the goal can produce a different way at solving the problem.
Posts: 21
Threads: 7
Joined: Nov 2018
Apr-15-2019, 07:36 PM
(This post was last modified: Apr-15-2019, 07:36 PM by pianistseb.)
My goal is to detect the positions that are similar (similar means that I accept an error like 10% that I can change in the program) to Erdos sequence in DNA or chromosome and put it in a text file. I want the position because, then I will make a graph in a neighborhood near these positions. I want also to compute how many times I detect this sequence because it's important for my research.
My problem is not the first part, where I change the 4-letters alphabet to binary, but the second part where I am searching to find in DNA this sequence. I don't say that it's impossible to run fast a program that does that in python, I just said that my program is too slow for the job I want to do.
So:
-Input: Erdos pattern and dna dequence.
-Output: The position of erdos pattern with an error in the dna sequence and the number of times that we found it.
Note 1: paul18fr, I am not sure if I understand your code for the second part, but I will check it. I am searching a 1000 (or more) letters pattern in a DNA sequence. A DNA sequence is longer. A whole genome can be about 3000000000 letters. The DNA, chromosome or genome sequence is included in the fasta file. To read a fasta file you need biopython.
Note 2: I don't know how many letters would be different from Erdos sequence. So I don't know the error that I have to put. I must try for different errors and genomes.
Posts: 299
Threads: 72
Joined: Apr 2019
Apr-15-2019, 08:14 PM
(This post was last modified: Apr-15-2019, 08:14 PM by paul18fr.)
In my code:
- vect4 is the pattern
- in the loop, I'm extracting parts of vect3 having the size of the pattern, and then I'm comparing it to vect4 (True or False)
- at the same time the "check" vector (composed of booleans) is updated
- At the end, I'm looking for "True" values in order to get both the position and the occurence(s) of the pattern
Some remarks:
- I've not been able to avoid the loop
- it's true in Python but not only: avoid (memory) dynamic allocation to speed up your code
- I guess we can adapt it for
hope it helps
I noticed some mistake in my prevous code; please consider the following one that is looking for a pattern of 1000 components in a vector of 3 million lines (it took 62 sec on my old laptop)
import numpy as np
import time, os, re
n = 1000;
vect3 = np.random.randint(2,size = 3000*n);
vect4 = np.random.randint(2,size = n);
## 4 occurences are manually created
vect3[1:1+n] = vect4;
vect3[6596:6596+n] = vect4;
vect3[10023:10023+n] = vect4;
vect3[569872:569872+n] = vect4;
## initialization / No match by default
check = np.full(3000*n, False, dtype = bool);
## partially vectorized
t0 = time.time()
for i in range(3000*n):
check[i] = np.array_equiv(vect3[i:n+i],vect4);
sol = np.where(check == True); ## we're lokking for True occurences and its positions (see Tuple)
if (sol == []):
print("No match")
else:
l = np.size(sol);
print("There are %d matche(s)" %l)
print("Positions: %s" %sol)
t1 = time.time()
print("Duration method: ", t1-t0)
Posts: 12,021
Threads: 484
Joined: Sep 2016
Apr-15-2019, 08:34 PM
(This post was last modified: Apr-15-2019, 08:34 PM by Larz60+.)
This works, and it's fast!
my directory structure is like:
Output: Genetics
|_ data
|_ fasta
CM000665.fasta
|_ src
This program
You can modify code for whatever structure you're using
I thing it will return multiple matches as well.
For control, I tried first with an exact match, then messed up one base and tried for second (fuzzy) match
Hope this is helpful:
from fuzzysearch import find_near_matches
from pathlib import Path
import re
import os
class FuzzyFind:
'''
This is a very simplistic example, but can be used as a starting point
'''
def __init__(self):
# Create directory anchor to find Sequence files
os.chdir(os.path.abspath(os.path.dirname(__file__)))
# Get path to files:
homepath = Path('.')
seqpath = homepath / '..' / 'data' / 'fasta'
CM000665 = seqpath / 'CM000665.fasta'
with CM000665.open() as fp:
bigseq = fp.read()
ex_dna = 'GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA'
# Test exact match first
result = self.fuzzy_find(bigseq, ex_dna)
for item in result:
print(f'\nLooking for: {ex_dna}')
print(f'item: {item}, type: {type(item)}')
nums = re.findall(r'[\d\.]+', str(result))
start = int(nums[0])
end = int(nums[1])
found_seq = bigseq[start:end]
print(f'exact match start: {start}, end: {end}: {found_seq}')
# Now mess up a base and try again
ex_dna = 'AGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA'
result = self.fuzzy_find(bigseq, ex_dna)
for item in result:
print(f'\nLooking for: {ex_dna}')
print(f'item: {item}, type: {type(item)}')
nums = re.findall(r'[\d\.]+', str(result))
start = int(nums[0])
end = int(nums[1])
found_seq = bigseq[start:end]
print(f'fuzzy match: {start}, end: {end}: {found_seq}')
def fuzzy_find(self, main_seq, find_seq):
matches = find_near_matches(find_seq, main_seq, max_l_dist=2)
return matches
if __name__ == '__main__':
FuzzyFind() results:
Output: Looking for: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA
item: Match(start=31203914, end=31203990, dist=1), type: <class 'fuzzysearch.common.Match'>
exact match start: 31203914, end: 31203990: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAA
CTCTCCCTCAACCTAGTTTAGAAA
Looking for: AGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAACTCTCCCTCAACCTAGTTTAGAAA
item: Match(start=31203914, end=31203990, dist=2), type: <class 'fuzzysearch.common.Match'>
fuzzy match: 31203914, end: 31203990: GGACTTTTCATTTAAGCTCCAAGCCCCAAATCTGGGGGGCTAGTTTAGAAA
CTCTCCCTCAACCTAGTTTAGAAA
you will need to import fuzzysearch:
pip install fuzzysearch
Posts: 21
Threads: 7
Joined: Nov 2018
Thank you very much Larz60+ ! I think you found the solution to my problem!
Posts: 12,021
Threads: 484
Joined: Sep 2016
Posts: 21
Threads: 7
Joined: Nov 2018
The only problem with this function from fuzzysearch library is that it works fast only for small errors. If I put in my code max_substitutions=100, it takes approximately 1 minute to run, if I put max_substitutions=120 it takes about 5-10minutes. But if I put max_substitutions=200, it stacks.
Posts: 12,021
Threads: 484
Joined: Sep 2016
Apr-18-2019, 05:48 PM
(This post was last modified: Apr-18-2019, 05:48 PM by Larz60+.)
for 100 substitutions, considering the number of permutations, I think that's quite reasonable! the others not so good, but again, remember it's not a linear search, adding 1 substitution increases search time on a log scale.
|