Extracting a portion of a text document

alarcon032002 · (This post was last modified: Jan-16-2019, 08:40 PM by alarcon032002.)

I am working in a small project where I need to extract a nucleotide portion from a bacterial genome. The bacterial genome has 30000000 characters and need to extract from nucleotide 94442 to 95255. I have no programming experience but I am learning. I used the following code to perform the extraction

>>> first = open(r"C:\Users\cepo\Desktop\Python\AvinosumDSM180.txt","r")
>>> first.seek(94443)
94443
>>> sep = first.read(95255-94443)
>>> print(sep)

This code seemed to work and I got the following result from it
CCGACTGCCATGTCCTCGG
GCGTTTGCCCGCGACCCATCTGCTGCTGCATCGCAACGGCGCCGCGCCCTGGTTCATCCTGGTTCCTGAA
ACCGATCTGGCCAACCTCCTGGATCTGCCGGCCGCGCACCGTGATGCCGTCCTAGCCGACTGCACGCGCG
TTTCGGATGCACTGGGCACGCTGGGTTATCCCAAGATCAACGTCGCCTGGATCGGTAATCTGGTGCCACA
GCTCCACATCCATGTCATCGGGCGTCGTCCCGGCGATGCCTGTTGGCCGCGACCGGTGTGGGGGCATCTG
CCGGCAGAGCGGGACTATGCCGAGCACGAAATCACGGCGCTCCGCGCGGCGGTCCTGGATTGAGAGCGCC
GGCTCCATCGTCCACTGACCTGTTCAGACGCAACGGAGGAACCGCGCGTTCTGACCGGCCATCACCCCAG
CTCGCCATCGAGATAGAACCAGCGCCCGTGCTCGCGCACGAAGCGACTGCGCTCCTGGAGGCGCTGGGCA
CGGCCCTGGAGCTTGGAGCGGGCCACGAACGTCACCCAGCCCTCCTGGTCCGTTGCGCCTCCGGCTTCGG
TGCTCAGGATCTTGAGACCGAGCCAGCGAAGTCCCGGCTCCAGGGTCAGCGTGGCCGGACGGGTTGTCGG
ATGCCAGGTGGCGAGCAGATAGTCAGCCTGCCCGGTGGCAAAGGCGCTGTAGCGCGAGCGCATCAGGGCC
TCGGCTGTCGGTGCGATGGTACGGGCGGACAGATGAGGACCGCAGCAGTCGTCGAAAGGGCGGCCGGAGC
CGCAGAGACAG

The problem is that 95255-94443 is equal to 812 characters so I should
have gotten 812 characters extraction and instead, I got 800 only. I
am at a complete loss as to why is python discarding 12 characters, which
I need to be able to find the protein this DNA sequence encodes for.

Please advice.

***ichabod801*** · Jan-16-2019, 08:46 PM

It's not clear from what you've shown, but I'm guessing your genome file has multiple lines. If so, for each new line there would be a special character read ('\n') to indicate the end of the line. Your output does show 13 lines that would use 12 new line characters.

**Larz60+** · Jan-16-2019, 10:31 PM

where can this sequence be downloaded?

alarcon032002 · Jan-16-2019, 10:48 PM

Thank you ichabod801 and Larz60+ for taking the time to answer me. The genome does not have a /n to indicate the end of the line. The genome can be found here https://www.ncbi.nlm.nih.gov/nuccore/NC_...port=fasta. What I did is o copy only the genomic sequence into a notepad and gave it a name. The text file I used to run the code is here http://s000.tinyupload.com/?file_id=1106...4885590464
I hope is not an easy fix and thanks again for helping me.

***ichabod801*** · Jan-17-2019, 03:50 AM

It has new line characters ('\n'). I just copied it to a text file, read the text into Python, and printed out the repr() of the text. It has '\n' in it several times.

**Larz60+** · (This post was last modified: Jan-17-2019, 04:20 AM by Larz60+.)

I wrote the following code which extracts a slice, but it doesn't match yours.
Since each line has a linefeed, that has to be removes from that starting and ending index correct?
I haven't done that here yet, let me know how you'd like to handle that.

import os
from pathlib import Path


class SliceGenome:
    def __init__(self):
        # Make sure starting directory is source directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        self.homepath = Path('.')
        self.datapath = self.homepath / '../data'
        self.datapath.mkdir(exist_ok=True)

        self.sequencedir = self.datapath / 'sequences'
        self.sequencedir.mkdir(exist_ok=True)

        self.sequence_slicedir = self.datapath / 'sequence_slicedir'
        self.sequence_slicedir.mkdir(exist_ok=True)

        self.sequence = None

    def read_genome(self, filename):
        '''
        Initialization created a data directory one level above src code directory,
        with a sequences directory under that. Place all sequence files in this directory
        '''
        # want 8 bit ascii encoding, and skip 1st line
        with filename.open(encoding='ascii') as fp:
            self.sequence = fp.read()

    def get_savefilename(self, filename, start_idx, end_idx):
        return self.sequence_slicedir / f"{filename.name.split('.')[0]}{end_idx}-{start_idx}.txt"

    def get_slice(self, fn, start_idx, end_idx):
        filename = self.sequencedir / fn
        self.read_genome(filename)

        #account for zero base
        start_idx = start_idx -1
        end_idx = end_idx - 1

        idx1 = start_idx + (end_idx  - start_idx)
        slice = self.sequence[start_idx:idx1]
        savefilename = self.get_savefilename(filename, start_idx, end_idx)
        with savefilename.open('w') as fp:
            fp.write(slice)
        print(f'length: {len(slice)}, slice: \n{slice}')

def main():
    # change id for different sequence, and start and end indexes
    # slice will have name of sequence file, + indexes and found in ../data/sequence_slicedir
    sequence_filename = 'AvinosumDSM180.txt'
    start_idx = 94443
    end_idx = 95255
    sg = SliceGenome()
    sg.get_slice(sequence_filename, start_idx, end_idx)

if __name__ == '__main__':
    main()

Current results (which don't match yours):

Output:length: 812, slice:
CTGATCGCTCTTGGCGCCGACGGCTAATACCCATCCTCGCCACTATCCTGGTAGCCGT
CGTTGTGTTCGTCTGGACGCTGAATCTGACCGACTGGCTGATGCGCCCACCGCCAGCCTCCACGGTCGAG
TATCTGCCACATACCGACGGCGCCGAGAGGCTGGTTCCGCCACCCGTCAGCGAAGCCCTGAGCCTGGAAC
GCTTCCAGGCCGCCCGGCGCGCGTGCGATGGTCCTTGTGTCACCGACTTCGGCACACCGCTGGGTCGCGC
CAACGGCGTCGAGGCCCGCTCCAACTGTGCCTCGCTCTGCGTGCGGCTCGAATCGAGCTTTGTCGATCCC
GACTCTGGCCGGATCTGGATCGCCCGCTCGGGCGAGCACCCCGAGCCGCTGGAATATTCCGGTCTGGCCT
ATCAGTGCGTCGAGTATGCGCGGCGCTGGTGGATCCAGACGCTCGGCCTGACCTTCGGCGACGTGCCGAC
GGCCGCCGACATCCTGCGCCTCACCGAGGGCCGGCGTCTGTCCGATCAGGCGGTCATTCCGCTCGGTCGG
TCGCTCAACGGACACGCCCGCCGCGCCCCCGAACGCGGTGATCTGGTGATCTATGCCGCCGATCCCAACG
ACCCGGAGTGGCGCGCCGGGCATGTCGCCGTTGTGGTCGACACCGATCTCGAACAGGGCTGGGTCGCACT
GGCCGAGCAGAACTACGACAACCGTCCCTGGAGCGATCCGGAGTACCATGCCAGGCGTATCCGAATCGTG
CGTATAGGCGAACGCTATAGCCTGCTCGACGTCGCCCAAGATC

95255-9443 = 812 (I sliced 812 characters)
Note that I use f-string which requires python 3.6 or newer. If you can't handle that let me know
This code was edited at 11:19 P.M. EST

**Larz60+** · Jan-17-2019, 04:57 AM

In this post, I have allowed both counting newline, and eliminating newline
Note this code can be reused for different sequences.
It could be easily modified to get the sequence from fasta
The results are in ../data/sequence_slicedir/

import os
from pathlib import Path


class SliceGenome:
    def __init__(self):
        # Make sure starting directory is source directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        self.homepath = Path('.')
        self.datapath = self.homepath / '../data'
        self.datapath.mkdir(exist_ok=True)

        self.sequencedir = self.datapath / 'sequences'
        self.sequencedir.mkdir(exist_ok=True)

        self.sequence_slicedir = self.datapath / 'sequence_slicedir'
        self.sequence_slicedir.mkdir(exist_ok=True)

        self.sequence = None

    def read_genome(self, filename, remove_lf=True):
        '''
        Initialization created a data directory one level above src code directory,
        with a sequences directory under that. Place all sequence files in this directory
        '''
        # want 8 bit ascii encoding, and skip 1st line
        with filename.open(encoding='ascii') as fp:
            self.sequence = ''
            if remove_lf:
                for line in fp:
                    self.sequence = self.sequence + line.strip()
            else:
                self.sequence = fp.read()

    def get_savefilename(self, filename, start_idx, end_idx):
        return self.sequence_slicedir / f"{filename.name.split('.')[0]}{end_idx}-{start_idx}.txt"

    def get_slice(self, fn, start_idx, end_idx, remove_lf=True):
        filename = self.sequencedir / fn
        self.read_genome(filename, remove_lf)

        #account for zero base
        start_idx = start_idx -1
        end_idx = end_idx - 1

        idx1 = start_idx + (end_idx  - start_idx)
        slice = self.sequence[start_idx:idx1]
        savefilename = self.get_savefilename(filename, start_idx, end_idx)
        with savefilename.open('w') as fp:
            fp.write(slice)
        print(f'\nlength: {len(slice)}, slice: \n{slice}')

def main():
    # change id for different sequence, and start and end indexes
    # slice will have name of sequence file, + indexes and found in ../data/sequence_slicedir
    sequence_filename = 'AvinosumDSM180.txt'
    start_idx = 94443
    end_idx = 95255
    sg = SliceGenome()
    # without LF removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=False)
    # with lf removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=True)


if __name__ == '__main__':
    main()

with newline

Output:length: 812, slice:
CTGATCGCTCTTGGCGCCGACGGCTAATACCCATCCTCGCCACTATCCTGGTAGCCGT
CGTTGTGTTCGTCTGGACGCTGAATCTGACCGACTGGCTGATGCGCCCACCGCCAGCCTCCACGGTCGAG
TATCTGCCACATACCGACGGCGCCGAGAGGCTGGTTCCGCCACCCGTCAGCGAAGCCCTGAGCCTGGAAC
GCTTCCAGGCCGCCCGGCGCGCGTGCGATGGTCCTTGTGTCACCGACTTCGGCACACCGCTGGGTCGCGC
CAACGGCGTCGAGGCCCGCTCCAACTGTGCCTCGCTCTGCGTGCGGCTCGAATCGAGCTTTGTCGATCCC
GACTCTGGCCGGATCTGGATCGCCCGCTCGGGCGAGCACCCCGAGCCGCTGGAATATTCCGGTCTGGCCT
ATCAGTGCGTCGAGTATGCGCGGCGCTGGTGGATCCAGACGCTCGGCCTGACCTTCGGCGACGTGCCGAC
GGCCGCCGACATCCTGCGCCTCACCGAGGGCCGGCGTCTGTCCGATCAGGCGGTCATTCCGCTCGGTCGG
TCGCTCAACGGACACGCCCGCCGCGCCCCCGAACGCGGTGATCTGGTGATCTATGCCGCCGATCCCAACG
ACCCGGAGTGGCGCGCCGGGCATGTCGCCGTTGTGGTCGACACCGATCTCGAACAGGGCTGGGTCGCACT
GGCCGAGCAGAACTACGACAACCGTCCCTGGAGCGATCCGGAGTACCATGCCAGGCGTATCCGAATCGTG
CGTATAGGCGAACGCTATAGCCTGCTCGACGTCGCCCAAGATC

without newline

Output:length: 812, slice:
TGCGCGGTGATGTAAGGGTGCATACCGCGAGTCCGATCGCCGCCGCGTGGCTGCTCGCCGTCGGTCTGGTGGCCCACGCCGAAGAGCCGCCGACCGTCGCTCTGACGGTTCCGGCGGCCGCGCTGCTCCCTGACGGCGCACTCGGTGAGAGCATCGTCCGTGGTCGGCGCTATCTGTCGGATACGCCGGCTCAGTTGCCCGACTTCGTTGGCAATGGACTGGCCTGCCGACACTGCCATCCCGGCCGAGACGGGGAGGTCGGCACCGAAGCCAATGCGGCCCCCTTCGTCGGCGTCGTCGGACGCTTTCCGCAGTACAGCGCCCGACATGGCCGCCTCATCACGCTCGAACAGCGCATCGGCGATTGTTTCGAGCGCAGTCTCAACGGTCGAGCGCTCGCGCTCGATCACCCCGCCCTGATCGACATGCTGGCCTACATGAGCTGGCTGTCGCAGGGCGTGCCCGTGGGCGCTGTCGTAGCGGGACATGGCATCCCGACGCTGACGCTGGAGCGCGAACCGGATGGGGTGCATGGGGAGGCGCTCTACCAGGCCAGGTGTCTGGCCTGTCATGGAGCCGACGGGAGCGGCACGCTGGACGCCGATGGACGCTATCTTTTCCCGCCTCTGTGGGGGCCGCGTTCGTTCAACACCGGCGCGGGGATGAACCGTCAGGCCACGGCCGCCGGGTTCATCAAGCACAAGATGCCGTTAGGCGCCGATGACTCGCTCAGCGATGAAGAGGCGTGGGACGTGGCCGGTTTCGTGCTCACGCATCCGCGTCCACTGTTCCAGGAGCCGACGGGTGACTGA

This can be easily broken into 70 character lines.

I need to get some sleep so won't see any comments for several hours.

alarcon032002 · (This post was last modified: Jan-17-2019, 08:40 PM by alarcon032002.)

Thank you Ichabod801 it seems pretty obvious I am a total beginner because I still can't see the /n page breaks, but understanding they have to do with the error is good enough. Thanks for taking the time I will that as a point of reference to understand what went wrong.

Hey there Larz60+ nothing more but admiration. I understand that I don't know much, but you wrapped 68 lines of code for a project you have nothing to do with like it was nothing. Thank you for giving me the code that you used to match the 812 nucleotide output. To be honest, this is pretty much like Chinese to me. Still it deals with the issue I was working to solve an this would help me tremendously in trying to understand python programming. It's going to take me a while to understand it but I have a template to study, hopefully I wont need any more help. I really thank you for taking the time I hope you had good sleep yesterday.

**Larz60+** · Jan-17-2019, 10:35 PM

I know that you said you weren't a programmer, and so I thought I'd crank this out for you.
Note that the second slice in post seven is the correct data from the file position that you
gave in post one. It eliminates the line feeds.

With a bit of modification, you can use it to search for sequences.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to remove footer from PDF when extracting to text	jh67	3	9,357	Dec-13-2022, 06:52 AM Last Post: DPaul
	Extracting Specific Lines from text file based on content.	jokerfmj	8	5,436	Mar-28-2022, 03:38 PM Last Post: snippsat
	How to delete portion of file already processed?	Mark17	13	4,982	Jan-22-2022, 09:24 AM Last Post: Pedroski55
	Extracting all text from a video	jehoshua	2	2,907	Nov-14-2021, 09:54 PM Last Post: jehoshua
	Extracting the text between each "i class"	knight2000	4	3,343	May-26-2021, 09:55 AM Last Post: knight2000
	Extracting data based on specific patterns in a text file	K11	1	2,837	Aug-28-2020, 09:00 AM Last Post: Gribouillis
	code not writing to projNameVal portion of code.	umkc1	1	2,172	Feb-05-2020, 10:05 PM Last Post: Larz60+
	Extracting Text	Evil_Patrick	6	4,041	Nov-13-2019, 08:51 AM Last Post: buran
	How to transfer Text from one Word Document to anouther	konsular	11	6,024	Oct-09-2019, 07:00 PM Last Post: buran
	Help Understanding Portion of Code	caroline_d_124	3	3,601	Jan-15-2019, 12:12 AM Last Post: caroline_d_124

Extracting a portion of a text document

User Panel Messages

Announcements