Python Forum
Extracting a portion of a text document
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting a portion of a text document
#7
In this post, I have allowed both counting newline, and eliminating newline
Note this code can be reused for different sequences.
It could be easily modified to get the sequence from fasta
The results are in ../data/sequence_slicedir/
import os
from pathlib import Path


class SliceGenome:
    def __init__(self):
        # Make sure starting directory is source directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        self.homepath = Path('.')
        self.datapath = self.homepath / '../data'
        self.datapath.mkdir(exist_ok=True)

        self.sequencedir = self.datapath / 'sequences'
        self.sequencedir.mkdir(exist_ok=True)

        self.sequence_slicedir = self.datapath / 'sequence_slicedir'
        self.sequence_slicedir.mkdir(exist_ok=True)

        self.sequence = None

    def read_genome(self, filename, remove_lf=True):
        '''
        Initialization created a data directory one level above src code directory,
        with a sequences directory under that. Place all sequence files in this directory
        '''
        # want 8 bit ascii encoding, and skip 1st line
        with filename.open(encoding='ascii') as fp:
            self.sequence = ''
            if remove_lf:
                for line in fp:
                    self.sequence = self.sequence + line.strip()
            else:
                self.sequence = fp.read()

    def get_savefilename(self, filename, start_idx, end_idx):
        return self.sequence_slicedir / f"{filename.name.split('.')[0]}{end_idx}-{start_idx}.txt"

    def get_slice(self, fn, start_idx, end_idx, remove_lf=True):
        filename = self.sequencedir / fn
        self.read_genome(filename, remove_lf)

        #account for zero base
        start_idx = start_idx -1
        end_idx = end_idx - 1

        idx1 = start_idx + (end_idx  - start_idx)
        slice = self.sequence[start_idx:idx1]
        savefilename = self.get_savefilename(filename, start_idx, end_idx)
        with savefilename.open('w') as fp:
            fp.write(slice)
        print(f'\nlength: {len(slice)}, slice: \n{slice}')

def main():
    # change id for different sequence, and start and end indexes
    # slice will have name of sequence file, + indexes and found in ../data/sequence_slicedir
    sequence_filename = 'AvinosumDSM180.txt'
    start_idx = 94443
    end_idx = 95255
    sg = SliceGenome()
    # without LF removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=False)
    # with lf removal
    sg.get_slice(sequence_filename, start_idx, end_idx, remove_lf=True)


if __name__ == '__main__':
    main()
with newline
Output:
length: 812, slice: CTGATCGCTCTTGGCGCCGACGGCTAATACCCATCCTCGCCACTATCCTGGTAGCCGT CGTTGTGTTCGTCTGGACGCTGAATCTGACCGACTGGCTGATGCGCCCACCGCCAGCCTCCACGGTCGAG TATCTGCCACATACCGACGGCGCCGAGAGGCTGGTTCCGCCACCCGTCAGCGAAGCCCTGAGCCTGGAAC GCTTCCAGGCCGCCCGGCGCGCGTGCGATGGTCCTTGTGTCACCGACTTCGGCACACCGCTGGGTCGCGC CAACGGCGTCGAGGCCCGCTCCAACTGTGCCTCGCTCTGCGTGCGGCTCGAATCGAGCTTTGTCGATCCC GACTCTGGCCGGATCTGGATCGCCCGCTCGGGCGAGCACCCCGAGCCGCTGGAATATTCCGGTCTGGCCT ATCAGTGCGTCGAGTATGCGCGGCGCTGGTGGATCCAGACGCTCGGCCTGACCTTCGGCGACGTGCCGAC GGCCGCCGACATCCTGCGCCTCACCGAGGGCCGGCGTCTGTCCGATCAGGCGGTCATTCCGCTCGGTCGG TCGCTCAACGGACACGCCCGCCGCGCCCCCGAACGCGGTGATCTGGTGATCTATGCCGCCGATCCCAACG ACCCGGAGTGGCGCGCCGGGCATGTCGCCGTTGTGGTCGACACCGATCTCGAACAGGGCTGGGTCGCACT GGCCGAGCAGAACTACGACAACCGTCCCTGGAGCGATCCGGAGTACCATGCCAGGCGTATCCGAATCGTG CGTATAGGCGAACGCTATAGCCTGCTCGACGTCGCCCAAGATC
without newline
Output:
length: 812, slice: TGCGCGGTGATGTAAGGGTGCATACCGCGAGTCCGATCGCCGCCGCGTGGCTGCTCGCCGTCGGTCTGGTGGCCCACGCCGAAGAGCCGCCGACCGTCGCTCTGACGGTTCCGGCGGCCGCGCTGCTCCCTGACGGCGCACTCGGTGAGAGCATCGTCCGTGGTCGGCGCTATCTGTCGGATACGCCGGCTCAGTTGCCCGACTTCGTTGGCAATGGACTGGCCTGCCGACACTGCCATCCCGGCCGAGACGGGGAGGTCGGCACCGAAGCCAATGCGGCCCCCTTCGTCGGCGTCGTCGGACGCTTTCCGCAGTACAGCGCCCGACATGGCCGCCTCATCACGCTCGAACAGCGCATCGGCGATTGTTTCGAGCGCAGTCTCAACGGTCGAGCGCTCGCGCTCGATCACCCCGCCCTGATCGACATGCTGGCCTACATGAGCTGGCTGTCGCAGGGCGTGCCCGTGGGCGCTGTCGTAGCGGGACATGGCATCCCGACGCTGACGCTGGAGCGCGAACCGGATGGGGTGCATGGGGAGGCGCTCTACCAGGCCAGGTGTCTGGCCTGTCATGGAGCCGACGGGAGCGGCACGCTGGACGCCGATGGACGCTATCTTTTCCCGCCTCTGTGGGGGCCGCGTTCGTTCAACACCGGCGCGGGGATGAACCGTCAGGCCACGGCCGCCGGGTTCATCAAGCACAAGATGCCGTTAGGCGCCGATGACTCGCTCAGCGATGAAGAGGCGTGGGACGTGGCCGGTTTCGTGCTCACGCATCCGCGTCCACTGTTCCAGGAGCCGACGGGTGACTGA
This can be easily broken into 70 character lines.

I need to get some sleep so won't see any comments for several hours.
Reply


Messages In This Thread
RE: Extracting a portion of a text document - by Larz60+ - Jan-17-2019, 04:57 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  How to remove footer from PDF when extracting to text jh67 3 5,352 Dec-13-2022, 06:52 AM
Last Post: DPaul
  Extracting Specific Lines from text file based on content. jokerfmj 8 3,169 Mar-28-2022, 03:38 PM
Last Post: snippsat
  How to delete portion of file already processed? Mark17 13 2,906 Jan-22-2022, 09:24 AM
Last Post: Pedroski55
  Extracting all text from a video jehoshua 2 2,240 Nov-14-2021, 09:54 PM
Last Post: jehoshua
  Extracting the text between each "i class" knight2000 4 2,407 May-26-2021, 09:55 AM
Last Post: knight2000
  Extracting data based on specific patterns in a text file K11 1 2,263 Aug-28-2020, 09:00 AM
Last Post: Gribouillis
  code not writing to projNameVal portion of code. umkc1 1 1,721 Feb-05-2020, 10:05 PM
Last Post: Larz60+
  Extracting Text Evil_Patrick 6 3,029 Nov-13-2019, 08:51 AM
Last Post: buran
  How to transfer Text from one Word Document to anouther konsular 11 4,564 Oct-09-2019, 07:00 PM
Last Post: buran
  Help Understanding Portion of Code caroline_d_124 3 2,819 Jan-15-2019, 12:12 AM
Last Post: caroline_d_124

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020