Extracting a portion of a text document

Extracting a portion of a text document - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Extracting a portion of a text document (/thread-15420.html)

Extracting a portion of a text document - alarcon032002 - Jan-16-2019

I am working in a small project where I need to extract a nucleotide portion from a bacterial genome. The bacterial genome has 30000000 characters and need to extract from nucleotide 94442 to 95255. I have no programming experience but I am learning. I used the following code to perform the extraction

>>> first = open(r"C:\Users\cepo\Desktop\Python\AvinosumDSM180.txt","r")
>>> first.seek(94443)
94443
>>> sep = first.read(95255-94443)
>>> print(sep)

This code seemed to work and I got the following result from it
CCGACTGCCATGTCCTCGG
GCGTTTGCCCGCGACCCATCTGCTGCTGCATCGCAACGGCGCCGCGCCCTGGTTCATCCTGGTTCCTGAA
ACCGATCTGGCCAACCTCCTGGATCTGCCGGCCGCGCACCGTGATGCCGTCCTAGCCGACTGCACGCGCG
TTTCGGATGCACTGGGCACGCTGGGTTATCCCAAGATCAACGTCGCCTGGATCGGTAATCTGGTGCCACA
GCTCCACATCCATGTCATCGGGCGTCGTCCCGGCGATGCCTGTTGGCCGCGACCGGTGTGGGGGCATCTG
CCGGCAGAGCGGGACTATGCCGAGCACGAAATCACGGCGCTCCGCGCGGCGGTCCTGGATTGAGAGCGCC
GGCTCCATCGTCCACTGACCTGTTCAGACGCAACGGAGGAACCGCGCGTTCTGACCGGCCATCACCCCAG
CTCGCCATCGAGATAGAACCAGCGCCCGTGCTCGCGCACGAAGCGACTGCGCTCCTGGAGGCGCTGGGCA
CGGCCCTGGAGCTTGGAGCGGGCCACGAACGTCACCCAGCCCTCCTGGTCCGTTGCGCCTCCGGCTTCGG
TGCTCAGGATCTTGAGACCGAGCCAGCGAAGTCCCGGCTCCAGGGTCAGCGTGGCCGGACGGGTTGTCGG
ATGCCAGGTGGCGAGCAGATAGTCAGCCTGCCCGGTGGCAAAGGCGCTGTAGCGCGAGCGCATCAGGGCC
TCGGCTGTCGGTGCGATGGTACGGGCGGACAGATGAGGACCGCAGCAGTCGTCGAAAGGGCGGCCGGAGC
CGCAGAGACAG

The problem is that 95255-94443 is equal to 812 characters so I should
have gotten 812 characters extraction and instead, I got 800 only. I
am at a complete loss as to why is python discarding 12 characters, which
I need to be able to find the protein this DNA sequence encodes for.

Please advice.

RE: Extracting a portion of a text document - ichabod801 - Jan-16-2019

It's not clear from what you've shown, but I'm guessing your genome file has multiple lines. If so, for each new line there would be a special character read ('\n') to indicate the end of the line. Your output does show 13 lines that would use 12 new line characters.

RE: Extracting a portion of a text document - Larz60+ - Jan-16-2019

where can this sequence be downloaded?

RE: Extracting a portion of a text document - alarcon032002 - Jan-16-2019

Thank you ichabod801 and Larz60+ for taking the time to answer me. The genome does not have a /n to indicate the end of the line. The genome can be found here https://www.ncbi.nlm.nih.gov/nuccore/NC_013851.1/?report=fasta. What I did is o copy only the genomic sequence into a notepad and gave it a name. The text file I used to run the code is here http://s000.tinyupload.com/?file_id=11067149734885590464
I hope is not an easy fix and thanks again for helping me.

RE: Extracting a portion of a text document - ichabod801 - Jan-17-2019

It has new line characters ('\n'). I just copied it to a text file, read the text into Python, and printed out the repr() of the text. It has '\n' in it several times.

RE: Extracting a portion of a text document - Larz60+ - Jan-17-2019

I wrote the following code which extracts a slice, but it doesn't match yours.
Since each line has a linefeed, that has to be removes from that starting and ending index correct?
I haven't done that here yet, let me know how you'd like to handle that.

import os
from pathlib import Path


class SliceGenome:
    def __init__(self):
        # Make sure starting directory is source directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        self.homepath = Path('.')
        self.datapath = self.homepath / '../data'
        self.datapath.mkdir(exist_ok=True)

        self.sequencedir = self.datapath / 'sequences'
        self.sequencedir.mkdir(exist_ok=True)

        self.sequence_slicedir = self.datapath / 'sequence_slicedir'
        self.sequence_slicedir.mkdir(exist_ok=True)

        self.sequence = None

    def read_genome(self, filename):
        '''
        Initialization created a data directory one level above src code directory,
        with a sequences directory under that. Place all sequence files in this directory
        '''
        # want 8 bit ascii encoding, and skip 1st line
        with filename.open(encoding='ascii') as fp:
            self.sequence = fp.read()

    def get_savefilename(self, filename, start_idx, end_idx):
        return self.sequence_slicedir / f"{filename.name.split('.')[0]}{end_idx}-{start_idx}.txt"

    def get_slice(self, fn, start_idx, end_idx):
        filename = self.sequencedir / fn
        self.read_genome(filename)

        #account for zero base
        start_idx = start_idx -1
        end_idx = end_idx - 1

        idx1 = start_idx + (end_idx  - start_idx)
        slice = self.sequence[start_idx:idx1]
        savefilename = self.get_savefilename(filename, start_idx, end_idx)
        with savefilename.open('w') as fp:
            fp.write(slice)
        print(f'length: {len(slice)}, slice: \n{slice}')

def main():
    # change id for different sequence, and start and end indexes
    # slice will have name of sequence file, + indexes and found in ../data/sequence_slicedir
    sequence_filename = 'AvinosumDSM180.txt'
    start_idx = 94443
    end_idx = 95255
    sg = SliceGenome()
    sg.get_slice(sequence_filename, start_idx, end_idx)

if __name__ == '__main__':
    main()

Current results (which don't match yours):

Output:length: 812, slice:
CTGATCGCTCTTGGCGCCGACGGCTAATACCCATCCTCGCCACTATCCTGGTAGCCGT
CGTTGTGTTCGTCTGGACGCTGAATCTGACCGACTGGCTGATGCGCCCACCGCCAGCCTCCACGGTCGAG
TATCTGCCACATACCGACGGCGCCGAGAGGCTGGTTCCGCCACCCGTCAGCGAAGCCCTGAGCCTGGAAC
GCTTCCAGGCCGCCCGGCGCGCGTGCGATGGTCCTTGTGTCACCGACTTCGGCACACCGCTGGGTCGCGC
CAACGGCGTCGAGGCCCGCTCCAACTGTGCCTCGCTCTGCGTGCGGCTCGAATCGAGCTTTGTCGATCCC
GACTCTGGCCGGATCTGGATCGCCCGCTCGGGCGAGCACCCCGAGCCGCTGGAATATTCCGGTCTGGCCT
ATCAGTGCGTCGAGTATGCGCGGCGCTGGTGGATCCAGACGCTCGGCCTGACCTTCGGCGACGTGCCGAC
GGCCGCCGACATCCTGCGCCTCACCGAGGGCCGGCGTCTGTCCGATCAGGCGGTCATTCCGCTCGGTCGG
TCGCTCAACGGACACGCCCGCCGCGCCCCCGAACGCGGTGATCTGGTGATCTATGCCGCCGATCCCAACG
ACCCGGAGTGGCGCGCCGGGCATGTCGCCGTTGTGGTCGACACCGATCTCGAACAGGGCTGGGTCGCACT
GGCCGAGCAGAACTACGACAACCGTCCCTGGAGCGATCCGGAGTACCATGCCAGGCGTATCCGAATCGTG
CGTATAGGCGAACGCTATAGCCTGCTCGACGTCGCCCAAGATC

95255-9443 = 812 (I sliced 812 characters)
Note that I use f-string which requires python 3.6 or newer. If you can't handle that let me know
This code was edited at 11:19 P.M. EST

RE: Extracting a portion of a text document - Larz60+ - Jan-17-2019

In this post, I have allowed both counting newline, and eliminating newline
Note this code can be reused for different sequences.
It could be easily modified to get the sequence from fasta
The results are in ../data/sequence_slicedir/

import os
from pathlib import Path


class SliceGenome:
    def __init__(self):
        # Make sure starting directory is source directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        self.homepath = Path('.')
        self.datapath = self.homepath / '../data'
        self.datapath.mkdir(exist_ok=True)

        self.sequencedir = self.datapath / 'sequences'
        self.sequencedir.mkdir(exist_ok=True)

        self.sequence_slicedir = self.datapath / 'sequence_slicedir'
        self.sequence_slicedir.mkdir(exist_ok=True)

        self.sequence = None

    def read_genome(self, filename, remove_lf=True):
        '''
        Initialization created a data directory one level above src code directory,
        with a sequences directory under that. Place all sequence files in this directory
        '''
        # want 8 bit ascii encoding, and skip 1st line
        with filename.open(encoding='ascii') as fp:
            self.sequence = ''
            if remove_lf:
                for line in fp:
                    self.sequence = self.sequence + line.strip()
            else:
                self.sequence = fp.read()

    def get_savefilename(self, filename, start_idx, end_idx):
        return self.sequence_slicedir / f"{filename.name.split('.')[0]}{end_idx}-{start_idx}.txt"

    def get_slice(self, fn, start_idx, end_idx, remove_lf=True):
        filename = self.sequencedir / fn
        self.read_genome(filename, remove_lf)

        #account for zero base
        start_idx = start_idx -1
        end_idx = end_idx - 1

        idx1 = start_idx + (end_idx  - start_idx)
        slice = self.sequence[start_idx:idx1]
        savefilename = self.get_savefilename(filename, start_idx, end_idx)
        with savefilename.open('w') as fp:
            fp.write(slice)
        print(f'\nlength: {len(slice)}, slice: \n{slice}')

def main():
    # change id for different sequence, and start and end indexes
    # slice will have name of sequence file, + indexes and found in ../data/sequence_slicedir
    sequence_filename = 'AvinosumDSM180.txt'
    start_idx = 94443
    end_idx = 95255
    sg = SliceGenome()
    # without LF removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=False)
    # with lf removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=True)


if __name__ == '__main__':
    main()

with newline

Output:length: 812, slice:
CTGATCGCTCTTGGCGCCGACGGCTAATACCCATCCTCGCCACTATCCTGGTAGCCGT
CGTTGTGTTCGTCTGGACGCTGAATCTGACCGACTGGCTGATGCGCCCACCGCCAGCCTCCACGGTCGAG
TATCTGCCACATACCGACGGCGCCGAGAGGCTGGTTCCGCCACCCGTCAGCGAAGCCCTGAGCCTGGAAC
GCTTCCAGGCCGCCCGGCGCGCGTGCGATGGTCCTTGTGTCACCGACTTCGGCACACCGCTGGGTCGCGC
CAACGGCGTCGAGGCCCGCTCCAACTGTGCCTCGCTCTGCGTGCGGCTCGAATCGAGCTTTGTCGATCCC
GACTCTGGCCGGATCTGGATCGCCCGCTCGGGCGAGCACCCCGAGCCGCTGGAATATTCCGGTCTGGCCT
ATCAGTGCGTCGAGTATGCGCGGCGCTGGTGGATCCAGACGCTCGGCCTGACCTTCGGCGACGTGCCGAC
GGCCGCCGACATCCTGCGCCTCACCGAGGGCCGGCGTCTGTCCGATCAGGCGGTCATTCCGCTCGGTCGG
TCGCTCAACGGACACGCCCGCCGCGCCCCCGAACGCGGTGATCTGGTGATCTATGCCGCCGATCCCAACG
ACCCGGAGTGGCGCGCCGGGCATGTCGCCGTTGTGGTCGACACCGATCTCGAACAGGGCTGGGTCGCACT
GGCCGAGCAGAACTACGACAACCGTCCCTGGAGCGATCCGGAGTACCATGCCAGGCGTATCCGAATCGTG
CGTATAGGCGAACGCTATAGCCTGCTCGACGTCGCCCAAGATC

without newline

Output:length: 812, slice:
TGCGCGGTGATGTAAGGGTGCATACCGCGAGTCCGATCGCCGCCGCGTGGCTGCTCGCCGTCGGTCTGGTGGCCCACGCCGAAGAGCCGCCGACCGTCGCTCTGACGGTTCCGGCGGCCGCGCTGCTCCCTGACGGCGCACTCGGTGAGAGCATCGTCCGTGGTCGGCGCTATCTGTCGGATACGCCGGCTCAGTTGCCCGACTTCGTTGGCAATGGACTGGCCTGCCGACACTGCCATCCCGGCCGAGACGGGGAGGTCGGCACCGAAGCCAATGCGGCCCCCTTCGTCGGCGTCGTCGGACGCTTTCCGCAGTACAGCGCCCGACATGGCCGCCTCATCACGCTCGAACAGCGCATCGGCGATTGTTTCGAGCGCAGTCTCAACGGTCGAGCGCTCGCGCTCGATCACCCCGCCCTGATCGACATGCTGGCCTACATGAGCTGGCTGTCGCAGGGCGTGCCCGTGGGCGCTGTCGTAGCGGGACATGGCATCCCGACGCTGACGCTGGAGCGCGAACCGGATGGGGTGCATGGGGAGGCGCTCTACCAGGCCAGGTGTCTGGCCTGTCATGGAGCCGACGGGAGCGGCACGCTGGACGCCGATGGACGCTATCTTTTCCCGCCTCTGTGGGGGCCGCGTTCGTTCAACACCGGCGCGGGGATGAACCGTCAGGCCACGGCCGCCGGGTTCATCAAGCACAAGATGCCGTTAGGCGCCGATGACTCGCTCAGCGATGAAGAGGCGTGGGACGTGGCCGGTTTCGTGCTCACGCATCCGCGTCCACTGTTCCAGGAGCCGACGGGTGACTGA

This can be easily broken into 70 character lines.

I need to get some sleep so won't see any comments for several hours.

RE: Extracting a portion of a text document - alarcon032002 - Jan-17-2019

Thank you Ichabod801 it seems pretty obvious I am a total beginner because I still can't see the /n page breaks, but understanding they have to do with the error is good enough. Thanks for taking the time I will that as a point of reference to understand what went wrong.

Hey there Larz60+ nothing more but admiration. I understand that I don't know much, but you wrapped 68 lines of code for a project you have nothing to do with like it was nothing. Thank you for giving me the code that you used to match the 812 nucleotide output. To be honest, this is pretty much like Chinese to me. Still it deals with the issue I was working to solve an this would help me tremendously in trying to understand python programming. It's going to take me a while to understand it but I have a template to study, hopefully I wont need any more help. I really thank you for taking the time I hope you had good sleep yesterday.

RE: Extracting a portion of a text document - Larz60+ - Jan-17-2019

I know that you said you weren't a programmer, and so I thought I'd crank this out for you.
Note that the second slice in post seven is the correct data from the file position that you
gave in post one. It eliminates the line feeds.

With a bit of modification, you can use it to search for sequences.