Python Forum
Extracting a portion of a text document
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting a portion of a text document
#1
I am working in a small project where I need to extract a nucleotide portion from a bacterial genome. The bacterial genome has 30000000 characters and need to extract from nucleotide 94442 to 95255. I have no programming experience but I am learning. I used the following code to perform the extraction


>>> first = open(r"C:\Users\cepo\Desktop\Python\AvinosumDSM180.txt","r")
>>> first.seek(94443)
94443
>>> sep = first.read(95255-94443)
>>> print(sep)

This code seemed to work and I got the following result from it
CCGACTGCCATGTCCTCGG
GCGTTTGCCCGCGACCCATCTGCTGCTGCATCGCAACGGCGCCGCGCCCTGGTTCATCCTGGTTCCTGAA
ACCGATCTGGCCAACCTCCTGGATCTGCCGGCCGCGCACCGTGATGCCGTCCTAGCCGACTGCACGCGCG
TTTCGGATGCACTGGGCACGCTGGGTTATCCCAAGATCAACGTCGCCTGGATCGGTAATCTGGTGCCACA
GCTCCACATCCATGTCATCGGGCGTCGTCCCGGCGATGCCTGTTGGCCGCGACCGGTGTGGGGGCATCTG
CCGGCAGAGCGGGACTATGCCGAGCACGAAATCACGGCGCTCCGCGCGGCGGTCCTGGATTGAGAGCGCC
GGCTCCATCGTCCACTGACCTGTTCAGACGCAACGGAGGAACCGCGCGTTCTGACCGGCCATCACCCCAG
CTCGCCATCGAGATAGAACCAGCGCCCGTGCTCGCGCACGAAGCGACTGCGCTCCTGGAGGCGCTGGGCA
CGGCCCTGGAGCTTGGAGCGGGCCACGAACGTCACCCAGCCCTCCTGGTCCGTTGCGCCTCCGGCTTCGG
TGCTCAGGATCTTGAGACCGAGCCAGCGAAGTCCCGGCTCCAGGGTCAGCGTGGCCGGACGGGTTGTCGG
ATGCCAGGTGGCGAGCAGATAGTCAGCCTGCCCGGTGGCAAAGGCGCTGTAGCGCGAGCGCATCAGGGCC
TCGGCTGTCGGTGCGATGGTACGGGCGGACAGATGAGGACCGCAGCAGTCGTCGAAAGGGCGGCCGGAGC
CGCAGAGACAG

The problem is that 95255-94443 is equal to 812 characters so I should
have gotten 812 characters extraction and instead, I got 800 only. I
am at a complete loss as to why is python discarding 12 characters, which
I need to be able to find the protein this DNA sequence encodes for.

Please advice.
Reply
#2
It's not clear from what you've shown, but I'm guessing your genome file has multiple lines. If so, for each new line there would be a special character read ('\n') to indicate the end of the line. Your output does show 13 lines that would use 12 new line characters.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
where can this sequence be downloaded?
Reply
#4
Thank you ichabod801 and Larz60+ for taking the time to answer me. The genome does not have a /n to indicate the end of the line. The genome can be found here https://www.ncbi.nlm.nih.gov/nuccore/NC_...port=fasta. What I did is o copy only the genomic sequence into a notepad and gave it a name. The text file I used to run the code is here http://s000.tinyupload.com/?file_id=1106...4885590464
I hope is not an easy fix and thanks again for helping me.
Reply
#5
It has new line characters ('\n'). I just copied it to a text file, read the text into Python, and printed out the repr() of the text. It has '\n' in it several times.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#6
I wrote the following code which extracts a slice, but it doesn't match yours.
Since each line has a linefeed, that has to be removes from that starting and ending index correct?
I haven't done that here yet, let me know how you'd like to handle that.
import os
from pathlib import Path


class SliceGenome:
    def __init__(self):
        # Make sure starting directory is source directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        self.homepath = Path('.')
        self.datapath = self.homepath / '../data'
        self.datapath.mkdir(exist_ok=True)

        self.sequencedir = self.datapath / 'sequences'
        self.sequencedir.mkdir(exist_ok=True)

        self.sequence_slicedir = self.datapath / 'sequence_slicedir'
        self.sequence_slicedir.mkdir(exist_ok=True)

        self.sequence = None

    def read_genome(self, filename):
        '''
        Initialization created a data directory one level above src code directory,
        with a sequences directory under that. Place all sequence files in this directory
        '''
        # want 8 bit ascii encoding, and skip 1st line
        with filename.open(encoding='ascii') as fp:
            self.sequence = fp.read()

    def get_savefilename(self, filename, start_idx, end_idx):
        return self.sequence_slicedir / f"{filename.name.split('.')[0]}{end_idx}-{start_idx}.txt"

    def get_slice(self, fn, start_idx, end_idx):
        filename = self.sequencedir / fn
        self.read_genome(filename)

        #account for zero base
        start_idx = start_idx -1
        end_idx = end_idx - 1

        idx1 = start_idx + (end_idx  - start_idx)
        slice = self.sequence[start_idx:idx1]
        savefilename = self.get_savefilename(filename, start_idx, end_idx)
        with savefilename.open('w') as fp:
            fp.write(slice)
        print(f'length: {len(slice)}, slice: \n{slice}')

def main():
    # change id for different sequence, and start and end indexes
    # slice will have name of sequence file, + indexes and found in ../data/sequence_slicedir
    sequence_filename = 'AvinosumDSM180.txt'
    start_idx = 94443
    end_idx = 95255
    sg = SliceGenome()
    sg.get_slice(sequence_filename, start_idx, end_idx)

if __name__ == '__main__':
    main()
Current results (which don't match yours):
Output:
length: 812, slice: CTGATCGCTCTTGGCGCCGACGGCTAATACCCATCCTCGCCACTATCCTGGTAGCCGT CGTTGTGTTCGTCTGGACGCTGAATCTGACCGACTGGCTGATGCGCCCACCGCCAGCCTCCACGGTCGAG TATCTGCCACATACCGACGGCGCCGAGAGGCTGGTTCCGCCACCCGTCAGCGAAGCCCTGAGCCTGGAAC GCTTCCAGGCCGCCCGGCGCGCGTGCGATGGTCCTTGTGTCACCGACTTCGGCACACCGCTGGGTCGCGC CAACGGCGTCGAGGCCCGCTCCAACTGTGCCTCGCTCTGCGTGCGGCTCGAATCGAGCTTTGTCGATCCC GACTCTGGCCGGATCTGGATCGCCCGCTCGGGCGAGCACCCCGAGCCGCTGGAATATTCCGGTCTGGCCT ATCAGTGCGTCGAGTATGCGCGGCGCTGGTGGATCCAGACGCTCGGCCTGACCTTCGGCGACGTGCCGAC GGCCGCCGACATCCTGCGCCTCACCGAGGGCCGGCGTCTGTCCGATCAGGCGGTCATTCCGCTCGGTCGG TCGCTCAACGGACACGCCCGCCGCGCCCCCGAACGCGGTGATCTGGTGATCTATGCCGCCGATCCCAACG ACCCGGAGTGGCGCGCCGGGCATGTCGCCGTTGTGGTCGACACCGATCTCGAACAGGGCTGGGTCGCACT GGCCGAGCAGAACTACGACAACCGTCCCTGGAGCGATCCGGAGTACCATGCCAGGCGTATCCGAATCGTG CGTATAGGCGAACGCTATAGCCTGCTCGACGTCGCCCAAGATC
95255-9443 = 812 (I sliced 812 characters)
Note that I use f-string which requires python 3.6 or newer. If you can't handle that let me know
This code was edited at 11:19 P.M. EST
Reply
#7
In this post, I have allowed both counting newline, and eliminating newline
Note this code can be reused for different sequences.
It could be easily modified to get the sequence from fasta
The results are in ../data/sequence_slicedir/
import os
from pathlib import Path


class SliceGenome:
    def __init__(self):
        # Make sure starting directory is source directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        self.homepath = Path('.')
        self.datapath = self.homepath / '../data'
        self.datapath.mkdir(exist_ok=True)

        self.sequencedir = self.datapath / 'sequences'
        self.sequencedir.mkdir(exist_ok=True)

        self.sequence_slicedir = self.datapath / 'sequence_slicedir'
        self.sequence_slicedir.mkdir(exist_ok=True)

        self.sequence = None

    def read_genome(self, filename, remove_lf=True):
        '''
        Initialization created a data directory one level above src code directory,
        with a sequences directory under that. Place all sequence files in this directory
        '''
        # want 8 bit ascii encoding, and skip 1st line
        with filename.open(encoding='ascii') as fp:
            self.sequence = ''
            if remove_lf:
                for line in fp:
                    self.sequence = self.sequence + line.strip()
            else:
                self.sequence = fp.read()

    def get_savefilename(self, filename, start_idx, end_idx):
        return self.sequence_slicedir / f"{filename.name.split('.')[0]}{end_idx}-{start_idx}.txt"

    def get_slice(self, fn, start_idx, end_idx, remove_lf=True):
        filename = self.sequencedir / fn
        self.read_genome(filename, remove_lf)

        #account for zero base
        start_idx = start_idx -1
        end_idx = end_idx - 1

        idx1 = start_idx + (end_idx  - start_idx)
        slice = self.sequence[start_idx:idx1]
        savefilename = self.get_savefilename(filename, start_idx, end_idx)
        with savefilename.open('w') as fp:
            fp.write(slice)
        print(f'\nlength: {len(slice)}, slice: \n{slice}')

def main():
    # change id for different sequence, and start and end indexes
    # slice will have name of sequence file, + indexes and found in ../data/sequence_slicedir
    sequence_filename = 'AvinosumDSM180.txt'
    start_idx = 94443
    end_idx = 95255
    sg = SliceGenome()
    # without LF removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=False)
    # with lf removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=True)


if __name__ == '__main__':
    main()
with newline
Output:
length: 812, slice: CTGATCGCTCTTGGCGCCGACGGCTAATACCCATCCTCGCCACTATCCTGGTAGCCGT CGTTGTGTTCGTCTGGACGCTGAATCTGACCGACTGGCTGATGCGCCCACCGCCAGCCTCCACGGTCGAG TATCTGCCACATACCGACGGCGCCGAGAGGCTGGTTCCGCCACCCGTCAGCGAAGCCCTGAGCCTGGAAC GCTTCCAGGCCGCCCGGCGCGCGTGCGATGGTCCTTGTGTCACCGACTTCGGCACACCGCTGGGTCGCGC CAACGGCGTCGAGGCCCGCTCCAACTGTGCCTCGCTCTGCGTGCGGCTCGAATCGAGCTTTGTCGATCCC GACTCTGGCCGGATCTGGATCGCCCGCTCGGGCGAGCACCCCGAGCCGCTGGAATATTCCGGTCTGGCCT ATCAGTGCGTCGAGTATGCGCGGCGCTGGTGGATCCAGACGCTCGGCCTGACCTTCGGCGACGTGCCGAC GGCCGCCGACATCCTGCGCCTCACCGAGGGCCGGCGTCTGTCCGATCAGGCGGTCATTCCGCTCGGTCGG TCGCTCAACGGACACGCCCGCCGCGCCCCCGAACGCGGTGATCTGGTGATCTATGCCGCCGATCCCAACG ACCCGGAGTGGCGCGCCGGGCATGTCGCCGTTGTGGTCGACACCGATCTCGAACAGGGCTGGGTCGCACT GGCCGAGCAGAACTACGACAACCGTCCCTGGAGCGATCCGGAGTACCATGCCAGGCGTATCCGAATCGTG CGTATAGGCGAACGCTATAGCCTGCTCGACGTCGCCCAAGATC
without newline
Output:
length: 812, slice: TGCGCGGTGATGTAAGGGTGCATACCGCGAGTCCGATCGCCGCCGCGTGGCTGCTCGCCGTCGGTCTGGTGGCCCACGCCGAAGAGCCGCCGACCGTCGCTCTGACGGTTCCGGCGGCCGCGCTGCTCCCTGACGGCGCACTCGGTGAGAGCATCGTCCGTGGTCGGCGCTATCTGTCGGATACGCCGGCTCAGTTGCCCGACTTCGTTGGCAATGGACTGGCCTGCCGACACTGCCATCCCGGCCGAGACGGGGAGGTCGGCACCGAAGCCAATGCGGCCCCCTTCGTCGGCGTCGTCGGACGCTTTCCGCAGTACAGCGCCCGACATGGCCGCCTCATCACGCTCGAACAGCGCATCGGCGATTGTTTCGAGCGCAGTCTCAACGGTCGAGCGCTCGCGCTCGATCACCCCGCCCTGATCGACATGCTGGCCTACATGAGCTGGCTGTCGCAGGGCGTGCCCGTGGGCGCTGTCGTAGCGGGACATGGCATCCCGACGCTGACGCTGGAGCGCGAACCGGATGGGGTGCATGGGGAGGCGCTCTACCAGGCCAGGTGTCTGGCCTGTCATGGAGCCGACGGGAGCGGCACGCTGGACGCCGATGGACGCTATCTTTTCCCGCCTCTGTGGGGGCCGCGTTCGTTCAACACCGGCGCGGGGATGAACCGTCAGGCCACGGCCGCCGGGTTCATCAAGCACAAGATGCCGTTAGGCGCCGATGACTCGCTCAGCGATGAAGAGGCGTGGGACGTGGCCGGTTTCGTGCTCACGCATCCGCGTCCACTGTTCCAGGAGCCGACGGGTGACTGA
This can be easily broken into 70 character lines.

I need to get some sleep so won't see any comments for several hours.
Reply
#8
Thank you Ichabod801 it seems pretty obvious I am a total beginner because I still can't see the /n page breaks, but understanding they have to do with the error is good enough. Thanks for taking the time I will that as a point of reference to understand what went wrong.

Hey there Larz60+ nothing more but admiration. I understand that I don't know much, but you wrapped 68 lines of code for a project you have nothing to do with like it was nothing. Thank you for giving me the code that you used to match the 812 nucleotide output. To be honest, this is pretty much like Chinese to me. Still it deals with the issue I was working to solve an this would help me tremendously in trying to understand python programming. It's going to take me a while to understand it but I have a template to study, hopefully I wont need any more help. I really thank you for taking the time I hope you had good sleep yesterday.
Reply
#9
I know that you said you weren't a programmer, and so I thought I'd crank this out for you.
Note that the second slice in post seven is the correct data from the file position that you
gave in post one. It eliminates the line feeds.

With a bit of modification, you can use it to search for sequences.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to remove footer from PDF when extracting to text jh67 3 4,849 Dec-13-2022, 06:52 AM
Last Post: DPaul
  Extracting Specific Lines from text file based on content. jokerfmj 8 2,857 Mar-28-2022, 03:38 PM
Last Post: snippsat
  How to delete portion of file already processed? Mark17 13 2,635 Jan-22-2022, 09:24 AM
Last Post: Pedroski55
  Extracting all text from a video jehoshua 2 2,147 Nov-14-2021, 09:54 PM
Last Post: jehoshua
  Extracting the text between each "i class" knight2000 4 2,275 May-26-2021, 09:55 AM
Last Post: knight2000
  Extracting data based on specific patterns in a text file K11 1 2,179 Aug-28-2020, 09:00 AM
Last Post: Gribouillis
  code not writing to projNameVal portion of code. umkc1 1 1,641 Feb-05-2020, 10:05 PM
Last Post: Larz60+
  Extracting Text Evil_Patrick 6 2,817 Nov-13-2019, 08:51 AM
Last Post: buran
  How to transfer Text from one Word Document to anouther konsular 11 4,326 Oct-09-2019, 07:00 PM
Last Post: buran
  Help Understanding Portion of Code caroline_d_124 3 2,673 Jan-15-2019, 12:12 AM
Last Post: caroline_d_124

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020