Python Forum
Help with python code to search string in one file & replace with line in other file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help with python code to search string in one file & replace with line in other file
#1
I have a fairly complicated task that, in short, requires me to take specific strings in the header lines of one DNA sequence fasta file and replace the entire line with corresponding header lines from another file that possesses that string (this other file just has more information that I need). For simplicity, I'll just call the fist referenced file 'file1' and the second one with more info 'file2'.

Example of how file1 looks is below. You will see that the header lines (those that start with '>') in general have varying pieces of information and are differently formatted. The header lines that I would like to target for replacement with headers from another file are indicated in bold. You will notice that there are some lines that are similar to the ones I've bolded but that I'm not targeting, i.e., those that have an '_A_' or '_B_' right before the ending digit (in some cases ending digit and '_rc').

Quote:>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC

>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>Alydus_pilosus_comp17655_c0_seq1_44
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>Boisea_trivittata_comp12490_c0_seq1_0
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT

>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA

Example of how file2 looks is below. You will see that the header lines (those that start with '>') have more information, but include very similar information from file1 that can help with matching (see bolded text for areas that match between the two files). You will notice that this file does not include any file1 headers that are similarly formatted (i.e., those that start with 'uce' or 'ENSOFAS' after the '>'. You might also note that multiple file1 headers match to a single file2 header, which is perfectly fine!

Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGCCTCTTGAATTAAATGCATGAGACGTGCACTTTGCAAACCAAAAGCATTATTGACCAAATGTGGAATGTTTTGTCTAGAACAGAGGCTTGCGATGTGCTCAAGGGAATCACAAGCCCTGGGGGCATAACAGCTAGTTGTTGAAAGAACACATACAATGTTATTAGAACCTAAAGTGTTTATTTGTGTTTCCATTCCATGAATATCAGTTCGGAGTTCGTCTCCACTGGAAATCGGTTCCACTATGATTGGCTGA
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
GTAGATTATTCTCTAACTGTCTATGGGTTTCGGAGACGAGGCTCTGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCTAGAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTGGAGCCTTTTTTCCTAGCACACTGAGTTTTTCTT

Expected output: I want file1 to be modified so that the header lines that match between file1 and file2 are replaced with the corresponding file2 headers. The order of the headers between files are not the same. I also do not want to alter the sequence lines in file1 (i.e., these should be ignored in search and replace). Below is an example of what I would expect file1 to look like after processing, with the modified headers bolded:

Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA

The python script I have been working on is below. Currently, this will modify the targeted file1 headers, but what it does is delete the '>' (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the file1 header with the corresponding file2 header yet. If the code doesn't make sense, just know that I'm not very knowledgable with Python.

#!/usr/bin/env python

import sys
import re

original_fn = sys.argv[1]
company_fn = sys.argv[2]

pattern = '(uce.+$|ENSOFAS.+$|[AB]_[0-9]+$)'

map = {}

with open(original_fn, "r") as original_fh:
    for line in original_fh:
        if line.startswith('>'):
            try:
                 (k, v) = line.strip().rsplit(':',1)
                 # remove trailing space from key
                 #k = k[:-1]
                 map[k] = v
                 #print k
                 #print v
                 #print map[k]
            except ValueError as err:
                 k = line.strip()
                 map[k] = None

with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>') and not re.search(pattern, line.strip()):
            try:
                line=line.strip('>')
                (v, k) = line.strip().rsplit('_',1)
                # remove trailing character from key
                #k = k[:-1]
                #print k
                #print v
            except ValueError as err:
                k = line.strip()
            if v not in map:
                sys.stdout.write("%s\n" % (v))
            else:
                sys.stdout.write("%s |%s\n" % (v, map[k]))
        else:
            sys.stdout.write("%s" % (line))
Reply
#2
Is this your original post?: https://www.biostars.org/p/277903/
Can you provide a link to the original sequence file so I can play with it?
Reply
#3
(Dec-15-2017, 05:18 PM)Larz60+ Wrote: Is this your original post?: https://www.biostars.org/p/277903/
Can you provide a link to the original sequence file so I can play with it?

Yes that is the original post but never made more progress. I'm expecting data back soon and this file needs to be properly formatted so I can process the data we receive.

I can provide a link (here), but I cannot provide the complete sequence files since a good portion of it is unpublished. What I can provide are the example sequences I have given in the post and a few additional more for both file1 and file2. These files are fully representative of the complete sequence files.
Reply
#4
That's enough. Let me play a while, and I'll be back (someone else may respond before that).
Shouldn't take long (I suspect the regex may need tweaking)
Reply
#5
Questions:
  • The only thing you want to replace are the headers, not the data, correct?
  • Also, the header text does not match exactly between files:
    Output:
    Original: Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1[b]_0_rc[/b] Replacement: Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
  • How much is necessary to create unique match (I am assuming anything after the '.' is not part of the match)

Almost there give me another hour or so.
Reply
#6
Ok Check this out and get back. I think it's what you are looking for. It replaces everything from the match up to the next '>' record.
It looks for the files to be in a directory named data which is a sub-directory of wherever the code is. You mat want to change this.
you can run it from the command line with a command that looks like:
python WhateverYouCallIt.py -i File1.txt -b File2.txt -o Fileout.txt > data/results.txt
code:
# Replace header in bodyfile with header in header file, writing output to outputfile Larz60+
#
from pathlib import Path
import argparse

class SwapHeaders:
    def __init__(self, origfile=None, headerfile=None, outfile=None):
        self.home = Path('.')
        self.data = self.home / 'data'
        self.original_file = self.data / origfile
        self.header_file = self.data / headerfile
        self.out_file = self.data / outfile

        with self.header_file.open() as fh:
            self.new_data = fh.readlines()

        self.make_new_file()

    def get_orig_rec(self):
        with self.original_file.open() as forig:
            for line in forig:
                yield line

    def get_match(self, match_this, fo):
        found = False
        for line in self.new_data:
            if line.startswith('>'):
                if found:
                    break
                if match_this in line:
                    found = True
            if found:
                fo.write(line)

    def make_new_file(self):
        with self.out_file.open('w') as fo:
            skip = False
            for line in self.get_orig_rec():
                if line.startswith('>'):
                    if skip:
                        skip = False
                    match = line[1:]
                    x = match.rfind('.')
                    if x:
                        match = match[:x]
                    skip = self.get_match(match, fo)
                if skip:
                    continue
                fo.write(line)


def debug_main():
    SwapHeaders(origfile='File1.txt', headerfile='File2.txt', outfile='Fileout.txt')

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--ifile",
                        dest='original_filename',
                        help="Filename where headers are to be replaced",
                        action="store")

    parser.add_argument("-b", "--bfile",
                        dest='replace_original_filename',
                        help="Filename containing body",
                        action="store")

    parser.add_argument("-o", "--ofile",
                        dest='out_filename',
                        help="Output filename",
                        action="store")

    args = parser.parse_args()
    original_filename = args.original_filename

    replace_original_filename = args.replace_original_filename

    out_filename = args.out_filename

    SwapHeaders(origfile=original_filename, headerfile=replace_original_filename, outfile=out_filename)

if __name__ == '__main__':
    main()
    # debug_main()
partial results:
Output:
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1 TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG >Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA >OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1 TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG >Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT >Anasa_tristis_comp3229_c0_seq1_136_rc TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC >uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120 AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
Reply
#7
curious about your moniker.
Are (were) you a forth programmer?
Reply
#8
(Dec-15-2017, 10:43 PM)Larz60+ Wrote: Questions:
  • The only thing you want to replace are the headers, not the data, correct?
  • Also, the header text does not match exactly between files:
    Output:
    Original: Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1[b]_0_rc[/b] Replacement: Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
  • How much is necessary to create unique match (I am assuming anything after the '.' is not part of the match)

Almost there give me another hour or so.

Yes, I only want to replace the headers, not the data.
The original header not matching between files was due to copying the examples from this site into there (it retained the bolding html code (whoops).
Anything after the . is not necessary for a unique match, but it can be used, except that last _[digit]_rc or _[digit].

(Dec-16-2017, 12:58 AM)Larz60+ Wrote: Ok Check this out and get back. I think it's what you are looking for. It replaces everything from the match up to the next '>' record.
It looks for the files to be in a directory named data which is a sub-directory of wherever the code is. You mat want to change this.
you can run it from the command line with a command that looks like:
python WhateverYouCallIt.py -i File1.txt -b File2.txt -o Fileout.txt > data/results.txt
code:
# Replace header in bodyfile with header in header file, writing output to outputfile Larz60+
#
from pathlib import Path
import argparse

class SwapHeaders:
    def __init__(self, origfile=None, headerfile=None, outfile=None):
        self.home = Path('.')
        self.data = self.home / 'data'
        self.original_file = self.data / origfile
        self.header_file = self.data / headerfile
        self.out_file = self.data / outfile

        with self.header_file.open() as fh:
            self.new_data = fh.readlines()

        self.make_new_file()

    def get_orig_rec(self):
        with self.original_file.open() as forig:
            for line in forig:
                yield line

    def get_match(self, match_this, fo):
        found = False
        for line in self.new_data:
            if line.startswith('>'):
                if found:
                    break
                if match_this in line:
                    found = True
            if found:
                fo.write(line)

    def make_new_file(self):
        with self.out_file.open('w') as fo:
            skip = False
            for line in self.get_orig_rec():
                if line.startswith('>'):
                    if skip:
                        skip = False
                    match = line[1:]
                    x = match.rfind('.')
                    if x:
                        match = match[:x]
                    skip = self.get_match(match, fo)
                if skip:
                    continue
                fo.write(line)


def debug_main():
    SwapHeaders(origfile='File1.txt', headerfile='File2.txt', outfile='Fileout.txt')

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--ifile",
                        dest='original_filename',
                        help="Filename where headers are to be replaced",
                        action="store")

    parser.add_argument("-b", "--bfile",
                        dest='replace_original_filename',
                        help="Filename containing body",
                        action="store")

    parser.add_argument("-o", "--ofile",
                        dest='out_filename',
                        help="Output filename",
                        action="store")

    args = parser.parse_args()
    original_filename = args.original_filename

    replace_original_filename = args.replace_original_filename

    out_filename = args.out_filename

    SwapHeaders(origfile=original_filename, headerfile=replace_original_filename, outfile=out_filename)

if __name__ == '__main__':
    main()
    # debug_main()
partial results:
Output:
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1 TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG >Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA >OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1 TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG >Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT >Anasa_tristis_comp3229_c0_seq1_136_rc TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC >uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120 AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT

You state: Ok Check this out and get back. I think it's what you are looking for. It replaces everything from the match up to the next '>' record.

If I'm reading correctly, that is also replacing the sequence data, not just the header.

The output is close to what I'm wanting, but it seems to miss a few headers that it should be replacing, e.g.:

Quote:>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC

(Dec-16-2017, 02:39 AM)Larz60+ Wrote: curious about your moniker.
Are (were) you a forth programmer?

No, that's my last name
Reply
#9
This is it:
code:
# Replace header in bodyfile with header in header file, writing output to outputfile Larz60+
#
from pathlib import Path
import argparse

class SwapHeaders:
    def __init__(self, origfile=None, headerfile=None, outfile=None):
        self.home = Path('.')
        self.data = self.home / 'data'
        self.original_file = self.data / origfile
        self.header_file = self.data / headerfile
        self.out_file = self.data / outfile

        with self.header_file.open() as fh:
            self.new_data = fh.readlines()

        self.make_new_file()

    def get_orig_rec(self):
        with self.original_file.open() as forig:
            for line in forig:
                yield line

    def get_match(self, match_this, fo):
        found = False
        for line in self.new_data:
            if line.startswith('>'):
                if match_this in line:
                    found = True
            if found:
                fo.write(line)
                return True
        return False

    def make_new_file(self):
        with self.out_file.open('w') as fo:
            skip = False
            for line in self.get_orig_rec():
                if line.startswith('>'):
                    match = line[1:]
                    x = match.rfind('.')
                    if x:
                        match = match[:x]
                    skip = self.get_match(match, fo)
                if skip:
                    skip = False
                    continue
                fo.write(line)


def debug_main():
    SwapHeaders(origfile='File1.txt', headerfile='File2.txt', outfile='Fileout.txt')

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--ifile",
                        dest='original_filename',
                        help="Filename where headers are to be replaced",
                        action="store")

    parser.add_argument("-b", "--bfile",
                        dest='replace_original_filename',
                        help="Filename containing body",
                        action="store")

    parser.add_argument("-o", "--ofile",
                        dest='out_filename',
                        help="Output filename",
                        action="store")

    args = parser.parse_args()
    original_filename = args.original_filename

    replace_original_filename = args.replace_original_filename

    out_filename = args.out_filename

    SwapHeaders(origfile=original_filename, headerfile=replace_original_filename, outfile=out_filename)

if __name__ == '__main__':
    # main()
    debug_main()
Partial results
Output:
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1 GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA >OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1 AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT >Anasa_tristis_comp3229_c0_seq1_136_rc TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC >uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120 AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT >uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160 TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC >Alydus_pilosus_comp17655_c0_seq1_44 TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT >Boisea_trivittata_comp12490_c0_seq1_0 ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT >Anasa_tristis_comp8051_c0_seq1_A_0 ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA >Anasa_tristis_comp8051_c0_seq1_A_38 GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG >ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292 TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA >ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292 GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA
There are items in file 2 that are not in file 1, so replacement can't be made.
I'm done, you can make any required changes.
Reply
#10
There's still a bug, you're right some are not getting replaced, key seems to be when there are two in a row
on the input side. I will fix that !
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Matching string from a file tester_V 5 446 Mar-05-2024, 05:46 AM
Last Post: Danishhafeez
  Python openyxl not updating Excel file MrBean12 1 339 Mar-03-2024, 12:16 AM
Last Post: MrBean12
  Python logging RotatingFileHandler writes to random file after the first log rotation rawatg 0 413 Feb-15-2024, 11:15 AM
Last Post: rawatg
  Unable to understand the meaning of the line of code. jahuja73 0 309 Jan-23-2024, 05:09 AM
Last Post: jahuja73
  connect sql by python using txt. file dawid294 2 440 Jan-12-2024, 08:54 PM
Last Post: deanhystad
  Writing a Linear Search algorithm - malformed string representation Drone4four 10 959 Jan-10-2024, 08:39 AM
Last Post: gulshan212
  file open "file not found error" shanoger 8 1,142 Dec-14-2023, 08:03 AM
Last Post: shanoger
  python Read each xlsx file and write it into csv with pipe delimiter mg24 4 1,466 Nov-09-2023, 10:56 AM
Last Post: mg24
  Search Excel File with a list of values huzzug 4 1,254 Nov-03-2023, 05:35 PM
Last Post: huzzug
  Replace a text/word in docx file using Python Devan 4 3,446 Oct-17-2023, 06:03 PM
Last Post: Devan

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020