Python Forum

Pages: 1 2

I am using biopython for dna sequences. I want to make an algorithm which searches a big motif (about 1000 letters). Because it is impossible to find this 1000-letters motif, I accept also motifs which have 30% error. So I combine every 1000 letters chromosome subsequence of the whole dna sequence, with this motif and I compute how many letters are different. The problem is that the algorithm is too slow. It needs about one day to run. I work in the binary purines-pyrimidines alphabet. Generally speaking I think that python is too slow to deal with big strings or lists about 1 million or billion letters lenght. Why? Huh

I am running it in linux terminal, because I want the maximum speed. Is really C faster than python? Confused

Here is my code:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

for seq_record in SeqIO.parse("CM000665.fasta", "fasta"):
    new_str=str(seq_record.seq).replace("A","1");
    new_str=new_str.replace("G","1");
    new_str=new_str.replace("C","0");
    new_str=new_str.replace("T","0");
    
# Big Erdos Input

big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011")
big_Erdos2=""
for i in range(0,len(big_Erdos1)):
    if big_Erdos1[i]=="0":
        big_Erdos2+="1"
    else:
        big_Erdos2+="0"
big_Erdos2=list(big_Erdos2)
chromosome=list(new_str)

# Search big Erdos erratta

big_erdos_number=0

file = open("results30.txt", "w") 
file.write("Erdos block positions:") 

for i in range(0,len(chromosome)-len(big_Erdos1)):
    S1=0
    S2=0
    for j in range(0,len(big_Erdos1)):
        if chromosome[j+i]!=big_Erdos1[j]:
            S1+=1
    for j in range(0,len(big_Erdos2)):
        if chromosome[j+i]!=big_Erdos2[j]:
            S2+=1
    error1=S1/len(big_Erdos1)
    error2=S2/len(big_Erdos2)
    #print(error)
    if error1<0.3 or error2<0.3:
        big_erdos_number+=1
        print("erdos block position:",i)
        file.write("%d,"%i) 
    #print("progress:",i/(len(chromosome)-len(big_Erdos1))*100,"%")

file.write("\n") 
file.write("Big erdos number:%d"%big_erdos_number) 
print("Big erdos number:",big_erdos_number)

file.close()

Python is not too slow.
You need to fix the algorithm.
where can I get the Bio package?
I know how to get the CM000665.fasta file from NCBI
run a profiler on your code see: https://docs.python.org/3/library/debug.html

There is the biopython package.

https://biopython.org/wiki/Download

The only function you are using from the biopython package is a helper function for decoding fasta-formatted file. Are you sure that the algorithm you are trying to implement, isn't already implemented in biopython? Raw Python loops are slow in Python (in comparison to those in C, for example).
Nevertheless, this is fundamentally wrong to say that C is faster than Python. C or Python are both programming languages. The main difference is in a way how a program being written in either C or Python is executed. A C-program is compiled into machine code that is further executed by CPU; A Python program (if we're talking about standard execution model by means of CPython interpreter) is translated to bytecode that is executed/interpreted by the CPython interpreter (CPython stands for Python implemented in C). In general, a program is executed inside another program.
So, what you need to do...
1) check if biopython already has the algorithm you are trying to implement. If so, this algorithm is likely implemented in C (or Cython) and will be executed much faster, than it would be implemented using raw Python loops.
2) You can try to implement your algorithm using Cython. In this case changes of your code would be minimal. Cython syntax, especially concerning loops, is almost the same as that in native Python.

If for loops don't work fast in python, what work fast? Biopython have some functions, but if someone want to make something new must use for loops. I started programming in python because I thought that it would be faster than matlab and easier than C, but now I found that is the slower programming language. I can't say that isn't useful, but it's not for all jobs.

I'll take a look at speeding this up.
First I need some info:

You have a huge bottleneck at the beginning (lines 13 through 19)
Curious, why change from DNA to 0 & 1? are you trying to count base 'G'?
Could you please explain what the goal is here, I'm having difficulty understanding.
Do you know the ftp site for downloading fasta file (CM000665.fasta)? something like ftp://ftp.ncbi.nih.gov/snp/organisms/hum.../rs_fasta/

Look at the following code:

# big_Erdos1=list(".... long string....
import numpy as np
def one():
    big_Erdos2=""
    for i in range(0,len(big_Erdos1)):
        if big_Erdos1[i]=="0":
            big_Erdos2+="1"
        else:
            big_Erdos2+="0"
    return big_Erdos2


res = np.array(["0"]*len(big_Erdos1))
data = np.array(big_Erdos1)  
def two():
    res[ data == "0"] = "1"
    return res.tolist()

If you measure time of execution for one() and two(), you get result like this:

Output:> %timeit -n 1000 one()
258 µs ± 5.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

> %timeit -n 1000 two()
49.3 µs ± 2.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The second solution is about 5x times faster than previous. This is because I used numpy
and assigned values to the array without any loops, but using vectorized notation. Numpy is implemented
in C and such operations executed much faster, than they would implemented as raw python loops.

So, you need to rewrite your data processing algorithm using vectorized notation of numpy (avoiding loops), or, if you need to find some substring in another string, using regular expressions (re module).

Thank you very much! I started learn python some months ago, but I haven't problems with speed because I didn't deal with big data. Now as my job is in dna sequences, the data structures are very big. In dna science is important to find some motif effects. Erdos motifs are some new mathematical motifs. I change into the binary alphabet because the first research in erdos motifs was in binary alphabet. I want to generalize this research in four-letters alphabet. The erdos_motif, where is also in this program, is a mathematical pattern that has the minimum repeats in it's elements. I have found some effects of this model in an hypothetical dna and now I want to found this motif in a natural dna. This is the reason why I put errors. It's impossible to find this motif in dna except if I put errors. My goal is to search this motif with errors and see if these effects that I have found are valid. Thank you very much guys! Your help is very important to me!
There is the site you can download any fasta file you want!
https://www.ebi.ac.uk/genomes/eukaryota.html
Now I am learning numpy to make my codes faster! I feel very relieved because using this package I will find a solution to my problem.

Quote:There is the site you can download any fasta file you want!

I saw where you could display as text, but couldn't find download option
You can do the same with NCBI, and they do have ftp sites, but couldn't find one for CM000665.fasta
again having only display. I know there's a way to download as I've done it before.

Hi

Interesting topic (briefly read I confess); let me doing some remarks:
- if you're dealing with numbers (0 and 1 typically), then you can use numpy and vectorization to speed-up your code
- it can be combined with "slicing" to get and compare arrays/vectors
- have a look to numpy.array_equal to compare 2 arrays/vectors (never tested so far but I can imagine the size of the vector is not an issue),

If you cannot avoid loops, have a look to Numba library (http://numba.pydata.org)

Finally, interesting benchmarking has been done by the Nasa to compare C, Python, Julia languages among others, and C is much faster than Python
(type "Basic Comparison of Python, Julia, Matlab, IDL and Java (2018 Edition)" + Nasa on Google); it's also interesting to note than vectorization is very powerfull

Paul

n = 1000000;
vect1 = np.random.randint(2,size = n);
vect2 = np.copy(vect1);
identical = np.array_equiv(vect1,vect2);

## now I change 1 value to 2
chang = np.random.randint(n, size = 1);
vect2[chang] = 2;
identical2 = np.array_equiv(vect1,vect2);

extract_part = vect2[100 : 350];

[python]PUT CODE HERE

Pages: 1 2

pianistseb

Larz60+

pianistseb

scidam

pianistseb

Larz60+

scidam

pianistseb

Larz60+

paul18fr