Posts: 2
Threads: 1
Joined: Aug 2018
Aug-29-2018, 10:46 AM
(This post was last modified: Aug-29-2018, 11:41 AM by Destry23000.)
Hi,
I'm working with DNA and am trying to run a simple program to clean up the sequences I get from online databases. I need to convert all lower case letters to uppercase ones and ignore anything that isn't an A, T, C, or G. I want this new string in a new global variable. Unfortunately the code I currently have (below), which does work, is EXTREMELY slow. Does anyone know a better/faster/more efficient means of doing this? I currently have it running over the human genome which is 6.6 billion characters long and it has been going for 7 days and counting.
def CleanText(Text):
Gen = ""
global Genome
Genome = Gen
for i in Text:
if i == "a":
Gen += "A"
elif i == "t":
Gen += "T"
elif i == "c":
Gen += "C"
elif i == "g":
Gen += "G"
Genome = Gen
Sorry, I forgot to format it correctly.
def CleanText(Text):
Gen = ""
global Genome
Genome = Gen
for i in Text:
if i == "a":
Gen += "A"
elif i == "t":
Gen += "T"
elif i == "c":
Gen += "C"
elif i == "g":
Gen += "G"
Genome = Gen
Posts: 4,781
Threads: 76
Joined: Jan 2018
Aug-29-2018, 12:21 PM
(This post was last modified: Aug-29-2018, 12:35 PM by Gribouillis.)
I suggest to convert the file externally by a C program for speed
#include<stdio.h>
#include<stdlib.h>
#include<ctype.h>
/* This is a C program */
int main()
{
FILE *fp1, *fp2;
char ch;
fp1 = fopen("source.txt", "r");
if (fp1 == NULL)
{
puts("Cannot open source.txt");
exit(1);
}
fp2 = fopen("target.txt", "w");
if (fp2 == NULL)
{
puts("Cannot open target.txt");
fclose(fp1);
exit(1);
}
while((ch=fgetc(fp1))!=EOF)
{
switch(ch){
case 'a':
case 'c':
case 'g':
case 't':
case 'A':
case 'C':
case 'G':
case 'T':
ch = toupper(ch);
fputc(ch,fp2);
}
}
printf("File successfully converted.\n");
return 0;
} Edit: Actually, using fgetc() is much slower than using fread(). This can be improved a lot but on my computer it takes only 0.1 second for 10 MB..
Posts: 8,151
Threads: 160
Joined: Sep 2016
Aug-29-2018, 12:55 PM
(This post was last modified: Aug-29-2018, 12:55 PM by buran.)
Where do you get the genome (i.e. Text variable that you pass as argument)? Can you share sample file or even the full one? Is everything lowercase in the input?
anyway, something like
from string import ascii_lowercase
from random import choice
from timeit import timeit
import time
test_input = ''.join(choice(ascii_lowercase) for _ in range(1000000))
print('Length of genome: {}'.format(len(test_input)))
print('First 100 chars: {}'.format(test_input[:100]))
def clean_genome(genome):
return ''.join(ch for ch in genome if ch in 'acgt').upper()
start = time.time()
genome = clean_genome(test_input)
print('Elapsed time: {:.3f}'.format(time.time()-start))
print('Length of genome: {}'.format(len(genome)))
print('First 100 chars: {}'.format(genome[:100])) Output: Length of test input: 1000000
First 100 chars of test input: vxzjjyuhsiigegteehbagvwxgjklewxdouakcgtikgvufkngnfindcnwxqhnvnastbknrvygiznrcszbltamwwkmbyxsglrlgczi
Elapsed time: 0.055
Length of clean genome: 154155
First 100 chars of clean genome: GGTAGGACGTGGCATGCTAGGCTGATCTACTATATGCCTGCCTCAAGACGGATATGTCAGGTATCATACCGGAAGTTTATCGCCTGTGCTAATAACCCTT
as you can see for a string with length 1 000 000 it takes about 0.055 seconds to clean it. i.e. for 10 billion chars it should take some 10 minutes to process Note that it may be better to iterate over the input (file) and clean, instead of reading everything in memory. I decided to run my code with test input of 10 bln. chars. It consumed all of my 12 GB RAM + up to now 7GB of swap memory and continue before even call the function.
Posts: 8,151
Threads: 160
Joined: Sep 2016
To elaborate further, it may be slow, not due to calculations, but because you keep the whole string in memory and insufficient resources to handle this. I'm running new test with generator that yields one char at a time.
Posts: 4,781
Threads: 76
Joined: Jan 2018
Here is a python version which is only two times slower than the C version (with the same result)
import re
pat = re.compile(r'[^aAcCtTgG]+')
def main():
with open('source.txt', 'r') as infile:
with open('target2.txt', 'w') as outfile:
while True:
b = infile.read(4096)
if b:
outfile.write(pat.sub('', b).upper())
else:
return
if __name__ == '__main__':
main()
Posts: 2
Threads: 1
Joined: Aug 2018
Aug-29-2018, 02:46 PM
(This post was last modified: Aug-29-2018, 02:46 PM by Destry23000.)
How do you iterate over an input from a file without fully importing it to memory first? Sorry, I'm clearly very new at this.
Nevermind I sorted it myself. It just took a bit of common sense
Posts: 4,781
Threads: 76
Joined: Jan 2018
(Aug-29-2018, 02:38 PM)Destry23000 Wrote: How do you iterate over an input from a file without fully importing it to memory first? Sorry, I'm clearly very new at this. See my example above with the while True loop and the call to infile.read(4096) which reads at most 4096 characters in the file.
Posts: 4,781
Threads: 76
Joined: Jan 2018
Here is a new version (for python 3) using bytes.translate() instead of the re module. It is almost as fast as the above C code on a 10 MB file
x = bytearray(b' ' * 256)
for u, v in zip(b'actg', b'ACTG'):
x[u] = x[v] = v
table = bytes.maketrans(bytes(range(256)), x)
def main():
with open('source.txt', 'rb') as infile, open('target2.txt', 'wb') as outfile:
while True:
b = infile.read(4096)
if b:
outfile.write(b.translate(table).replace(b' ',b''))
else:
return
if __name__ == '__main__':
main()
|