Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extremely Long String Help
#1
Hi,
I'm working with DNA and am trying to run a simple program to clean up the sequences I get from online databases. I need to convert all lower case letters to uppercase ones and ignore anything that isn't an A, T, C, or G. I want this new string in a new global variable. Unfortunately the code I currently have (below), which does work, is EXTREMELY slow. Does anyone know a better/faster/more efficient means of doing this? I currently have it running over the human genome which is 6.6 billion characters long and it has been going for 7 days and counting.

def CleanText(Text):
Gen = ""
global Genome
Genome = Gen
for i in Text:
if i == "a":
Gen += "A"
elif i == "t":
Gen += "T"
elif i == "c":
Gen += "C"
elif i == "g":
Gen += "G"
Genome = Gen

Sorry, I forgot to format it correctly.

def CleanText(Text):
Gen = ""
global Genome
Genome = Gen
for i in Text:
if i == "a":
Gen += "A"
elif i == "t":
Gen += "T"
elif i == "c":
Gen += "C"
elif i == "g":
Gen += "G"
Genome = Gen 
Reply
#2
I suggest to convert the file externally by a C program for speed
#include<stdio.h>
#include<stdlib.h>
#include<ctype.h>
/* This is a C program */
int main()
{
      FILE *fp1, *fp2;
      char ch;
      fp1 = fopen("source.txt", "r");
      if (fp1 == NULL)
      {
            puts("Cannot open source.txt");
            exit(1);
      }
      fp2 = fopen("target.txt", "w");
      if (fp2 == NULL)
      {
            puts("Cannot open target.txt");
            fclose(fp1);
            exit(1);
      }
      while((ch=fgetc(fp1))!=EOF)
      {
          switch(ch){
              case 'a':
              case 'c':
              case 'g':
              case 't':
              case 'A':
              case 'C':
              case 'G':
              case 'T':
                ch = toupper(ch);
                fputc(ch,fp2);
          }
      }
      printf("File successfully converted.\n");
      return 0;
}
Edit: Actually, using fgetc() is much slower than using fread(). This can be improved a lot but on my computer it takes only 0.1 second for 10 MB..
Reply
#3
Where do you get the genome (i.e. Text variable that you pass as argument)? Can you share sample file or even the full one? Is everything lowercase in the input?
anyway, something like
from string import ascii_lowercase
from random import choice
from timeit import timeit
import time



test_input = ''.join(choice(ascii_lowercase) for _ in range(1000000))
print('Length of genome: {}'.format(len(test_input)))
print('First 100 chars: {}'.format(test_input[:100]))
def clean_genome(genome):
    return ''.join(ch for ch in genome if ch in 'acgt').upper()

start = time.time()
genome = clean_genome(test_input)
print('Elapsed time: {:.3f}'.format(time.time()-start))
print('Length of genome: {}'.format(len(genome)))
print('First 100 chars: {}'.format(genome[:100]))
Output:
Length of test input: 1000000 First 100 chars of test input: vxzjjyuhsiigegteehbagvwxgjklewxdouakcgtikgvufkngnfindcnwxqhnvnastbknrvygiznrcszbltamwwkmbyxsglrlgczi Elapsed time: 0.055 Length of clean genome: 154155 First 100 chars of clean genome: GGTAGGACGTGGCATGCTAGGCTGATCTACTATATGCCTGCCTCAAGACGGATATGTCAGGTATCATACCGGAAGTTTATCGCCTGTGCTAATAACCCTT
as you can see for a string with length 1 000 000 it takes about 0.055 seconds to clean it. i.e. for 10 billion chars it should take some 10 minutes to process Note that it may be better to iterate over the input (file) and clean, instead of reading everything in memory. I decided to run my code with test input of 10 bln. chars. It consumed all of my 12 GB RAM + up to now 7GB of swap memory and continue before even call the function.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
To elaborate further, it may be slow, not due to calculations, but because you keep the whole string in memory and insufficient resources to handle this. I'm running new test with generator that yields one char at a time.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Here is a python version which is only two times slower than the C version (with the same result)
import re

pat = re.compile(r'[^aAcCtTgG]+')

def main():
    with open('source.txt', 'r') as infile:
        with open('target2.txt', 'w') as outfile:
            while True:
                b = infile.read(4096)
                if b:
                    outfile.write(pat.sub('', b).upper())
                else:
                    return
                
if __name__ == '__main__':
    main()
Reply
#6
How do you iterate over an input from a file without fully importing it to memory first? Sorry, I'm clearly very new at this.

Nevermind I sorted it myself. It just took a bit of common sense
Reply
#7
(Aug-29-2018, 02:38 PM)Destry23000 Wrote: How do you iterate over an input from a file without fully importing it to memory first? Sorry, I'm clearly very new at this.
See my example above with the while True loop and the call to infile.read(4096) which reads at most 4096 characters in the file.
Reply
#8
Here is a new version (for python 3) using bytes.translate() instead of the re module. It is almost as fast as the above C code on a 10 MB file
x = bytearray(b' ' * 256)
for u, v in zip(b'actg', b'ACTG'):
     x[u] = x[v] = v
table = bytes.maketrans(bytes(range(256)), x)

def main():
    with open('source.txt', 'rb') as infile, open('target2.txt', 'wb') as outfile:
        while True:
            b = infile.read(4096)
            if b:
                outfile.write(b.translate(table).replace(b' ',b''))
            else:
                return
                
if __name__ == '__main__':
    main()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Isolate a word from a long string nicocorico 2 1,546 Feb-25-2022, 01:12 PM
Last Post: nicocorico
  Split a long string into other strings with no delimiters/characters krewlaz 4 2,802 Nov-15-2019, 02:48 PM
Last Post: ichabod801
  How to create a long string by concatenating two lists nikhilkumar 11 7,559 Jul-12-2017, 05:36 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020