Help with python code to search string in one file & replace with line in other file - mforthman - Dec-15-2017
I have a fairly complicated task that, in short, requires me to take specific strings in the header lines of one DNA sequence fasta file and replace the entire line with corresponding header lines from another file that possesses that string (this other file just has more information that I need). For simplicity, I'll just call the fist referenced file 'file1' and the second one with more info 'file2'.
Example of how file1 looks is below. You will see that the header lines (those that start with '>') in general have varying pieces of information and are differently formatted. The header lines that I would like to target for replacement with headers from another file are indicated in bold. You will notice that there are some lines that are similar to the ones I've bolded but that I'm not targeting, i.e., those that have an '_A_' or '_B_' right before the ending digit (in some cases ending digit and '_rc').
Quote:>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>Alydus_pilosus_comp17655_c0_seq1_44
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>Boisea_trivittata_comp12490_c0_seq1_0
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA
Example of how file2 looks is below. You will see that the header lines (those that start with '>') have more information, but include very similar information from file1 that can help with matching (see bolded text for areas that match between the two files). You will notice that this file does not include any file1 headers that are similarly formatted (i.e., those that start with 'uce' or 'ENSOFAS' after the '>'. You might also note that multiple file1 headers match to a single file2 header, which is perfectly fine!
Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGCCTCTTGAATTAAATGCATGAGACGTGCACTTTGCAAACCAAAAGCATTATTGACCAAATGTGGAATGTTTTGTCTAGAACAGAGGCTTGCGATGTGCTCAAGGGAATCACAAGCCCTGGGGGCATAACAGCTAGTTGTTGAAAGAACACATACAATGTTATTAGAACCTAAAGTGTTTATTTGTGTTTCCATTCCATGAATATCAGTTCGGAGTTCGTCTCCACTGGAAATCGGTTCCACTATGATTGGCTGA
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
GTAGATTATTCTCTAACTGTCTATGGGTTTCGGAGACGAGGCTCTGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCTAGAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTGGAGCCTTTTTTCCTAGCACACTGAGTTTTTCTT
Expected output: I want file1 to be modified so that the header lines that match between file1 and file2 are replaced with the corresponding file2 headers. The order of the headers between files are not the same. I also do not want to alter the sequence lines in file1 (i.e., these should be ignored in search and replace). Below is an example of what I would expect file1 to look like after processing, with the modified headers bolded:
Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA
The python script I have been working on is below. Currently, this will modify the targeted file1 headers, but what it does is delete the '>' (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the file1 header with the corresponding file2 header yet. If the code doesn't make sense, just know that I'm not very knowledgable with Python.
#!/usr/bin/env python
import sys
import re
original_fn = sys.argv[1]
company_fn = sys.argv[2]
pattern = '(uce.+$|ENSOFAS.+$|[AB]_[0-9]+$)'
map = {}
with open(original_fn, "r") as original_fh:
for line in original_fh:
if line.startswith('>'):
try:
(k, v) = line.strip().rsplit(':',1)
# remove trailing space from key
#k = k[:-1]
map[k] = v
#print k
#print v
#print map[k]
except ValueError as err:
k = line.strip()
map[k] = None
with open(company_fn, "r") as company_fh:
for line in company_fh:
if line.startswith('>') and not re.search(pattern, line.strip()):
try:
line=line.strip('>')
(v, k) = line.strip().rsplit('_',1)
# remove trailing character from key
#k = k[:-1]
#print k
#print v
except ValueError as err:
k = line.strip()
if v not in map:
sys.stdout.write("%s\n" % (v))
else:
sys.stdout.write("%s |%s\n" % (v, map[k]))
else:
sys.stdout.write("%s" % (line))
RE: Help with python code to search string in one file & replace with line in other file - Larz60+ - Dec-15-2017
Is this your original post?: https://www.biostars.org/p/277903/
Can you provide a link to the original sequence file so I can play with it?
RE: Help with python code to search string in one file & replace with line in other file - mforthman - Dec-15-2017
(Dec-15-2017, 05:18 PM)Larz60+ Wrote: Is this your original post?: https://www.biostars.org/p/277903/
Can you provide a link to the original sequence file so I can play with it?
Yes that is the original post but never made more progress. I'm expecting data back soon and this file needs to be properly formatted so I can process the data we receive.
I can provide a link (here), but I cannot provide the complete sequence files since a good portion of it is unpublished. What I can provide are the example sequences I have given in the post and a few additional more for both file1 and file2. These files are fully representative of the complete sequence files.
RE: Help with python code to search string in one file & replace with line in other file - Larz60+ - Dec-15-2017
That's enough. Let me play a while, and I'll be back (someone else may respond before that).
Shouldn't take long (I suspect the regex may need tweaking)
RE: Help with python code to search string in one file & replace with line in other file - Larz60+ - Dec-15-2017
Questions:
Almost there give me another hour or so.
RE: Help with python code to search string in one file & replace with line in other file - Larz60+ - Dec-16-2017
Ok Check this out and get back. I think it's what you are looking for. It replaces everything from the match up to the next '>' record.
It looks for the files to be in a directory named data which is a sub-directory of wherever the code is. You mat want to change this.
you can run it from the command line with a command that looks like:
python WhateverYouCallIt.py -i File1.txt -b File2.txt -o Fileout.txt > data/results.txt code:
# Replace header in bodyfile with header in header file, writing output to outputfile Larz60+
#
from pathlib import Path
import argparse
class SwapHeaders:
def __init__(self, origfile=None, headerfile=None, outfile=None):
self.home = Path('.')
self.data = self.home / 'data'
self.original_file = self.data / origfile
self.header_file = self.data / headerfile
self.out_file = self.data / outfile
with self.header_file.open() as fh:
self.new_data = fh.readlines()
self.make_new_file()
def get_orig_rec(self):
with self.original_file.open() as forig:
for line in forig:
yield line
def get_match(self, match_this, fo):
found = False
for line in self.new_data:
if line.startswith('>'):
if found:
break
if match_this in line:
found = True
if found:
fo.write(line)
def make_new_file(self):
with self.out_file.open('w') as fo:
skip = False
for line in self.get_orig_rec():
if line.startswith('>'):
if skip:
skip = False
match = line[1:]
x = match.rfind('.')
if x:
match = match[:x]
skip = self.get_match(match, fo)
if skip:
continue
fo.write(line)
def debug_main():
SwapHeaders(origfile='File1.txt', headerfile='File2.txt', outfile='Fileout.txt')
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--ifile",
dest='original_filename',
help="Filename where headers are to be replaced",
action="store")
parser.add_argument("-b", "--bfile",
dest='replace_original_filename',
help="Filename containing body",
action="store")
parser.add_argument("-o", "--ofile",
dest='out_filename',
help="Output filename",
action="store")
args = parser.parse_args()
original_filename = args.original_filename
replace_original_filename = args.replace_original_filename
out_filename = args.out_filename
SwapHeaders(origfile=original_filename, headerfile=replace_original_filename, outfile=out_filename)
if __name__ == '__main__':
main()
# debug_main() partial results:
Output: >OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
RE: Help with python code to search string in one file & replace with line in other file - Larz60+ - Dec-16-2017
curious about your moniker.
Are (were) you a forth programmer?
RE: Help with python code to search string in one file & replace with line in other file - mforthman - Dec-16-2017
(Dec-15-2017, 10:43 PM)Larz60+ Wrote: Questions:
Almost there give me another hour or so.
Yes, I only want to replace the headers, not the data.
The original header not matching between files was due to copying the examples from this site into there (it retained the bolding html code (whoops).
Anything after the . is not necessary for a unique match, but it can be used, except that last _[digit]_rc or _[digit].
(Dec-16-2017, 12:58 AM)Larz60+ Wrote: Ok Check this out and get back. I think it's what you are looking for. It replaces everything from the match up to the next '>' record.
It looks for the files to be in a directory named data which is a sub-directory of wherever the code is. You mat want to change this.
you can run it from the command line with a command that looks like:
python WhateverYouCallIt.py -i File1.txt -b File2.txt -o Fileout.txt > data/results.txt code:
# Replace header in bodyfile with header in header file, writing output to outputfile Larz60+
#
from pathlib import Path
import argparse
class SwapHeaders:
def __init__(self, origfile=None, headerfile=None, outfile=None):
self.home = Path('.')
self.data = self.home / 'data'
self.original_file = self.data / origfile
self.header_file = self.data / headerfile
self.out_file = self.data / outfile
with self.header_file.open() as fh:
self.new_data = fh.readlines()
self.make_new_file()
def get_orig_rec(self):
with self.original_file.open() as forig:
for line in forig:
yield line
def get_match(self, match_this, fo):
found = False
for line in self.new_data:
if line.startswith('>'):
if found:
break
if match_this in line:
found = True
if found:
fo.write(line)
def make_new_file(self):
with self.out_file.open('w') as fo:
skip = False
for line in self.get_orig_rec():
if line.startswith('>'):
if skip:
skip = False
match = line[1:]
x = match.rfind('.')
if x:
match = match[:x]
skip = self.get_match(match, fo)
if skip:
continue
fo.write(line)
def debug_main():
SwapHeaders(origfile='File1.txt', headerfile='File2.txt', outfile='Fileout.txt')
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--ifile",
dest='original_filename',
help="Filename where headers are to be replaced",
action="store")
parser.add_argument("-b", "--bfile",
dest='replace_original_filename',
help="Filename containing body",
action="store")
parser.add_argument("-o", "--ofile",
dest='out_filename',
help="Output filename",
action="store")
args = parser.parse_args()
original_filename = args.original_filename
replace_original_filename = args.replace_original_filename
out_filename = args.out_filename
SwapHeaders(origfile=original_filename, headerfile=replace_original_filename, outfile=out_filename)
if __name__ == '__main__':
main()
# debug_main() partial results:
Output: >OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
You state: Ok Check this out and get back. I think it's what you are looking for. It replaces everything from the match up to the next '>' record.
If I'm reading correctly, that is also replacing the sequence data, not just the header.
The output is close to what I'm wanting, but it seems to miss a few headers that it should be replacing, e.g.:
Quote:>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
(Dec-16-2017, 02:39 AM)Larz60+ Wrote: curious about your moniker.
Are (were) you a forth programmer?
No, that's my last name
RE: Help with python code to search string in one file & replace with line in other file - Larz60+ - Dec-16-2017
This is it:
code:
# Replace header in bodyfile with header in header file, writing output to outputfile Larz60+
#
from pathlib import Path
import argparse
class SwapHeaders:
def __init__(self, origfile=None, headerfile=None, outfile=None):
self.home = Path('.')
self.data = self.home / 'data'
self.original_file = self.data / origfile
self.header_file = self.data / headerfile
self.out_file = self.data / outfile
with self.header_file.open() as fh:
self.new_data = fh.readlines()
self.make_new_file()
def get_orig_rec(self):
with self.original_file.open() as forig:
for line in forig:
yield line
def get_match(self, match_this, fo):
found = False
for line in self.new_data:
if line.startswith('>'):
if match_this in line:
found = True
if found:
fo.write(line)
return True
return False
def make_new_file(self):
with self.out_file.open('w') as fo:
skip = False
for line in self.get_orig_rec():
if line.startswith('>'):
match = line[1:]
x = match.rfind('.')
if x:
match = match[:x]
skip = self.get_match(match, fo)
if skip:
skip = False
continue
fo.write(line)
def debug_main():
SwapHeaders(origfile='File1.txt', headerfile='File2.txt', outfile='Fileout.txt')
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--ifile",
dest='original_filename',
help="Filename where headers are to be replaced",
action="store")
parser.add_argument("-b", "--bfile",
dest='replace_original_filename',
help="Filename containing body",
action="store")
parser.add_argument("-o", "--ofile",
dest='out_filename',
help="Output filename",
action="store")
args = parser.parse_args()
original_filename = args.original_filename
replace_original_filename = args.replace_original_filename
out_filename = args.out_filename
SwapHeaders(origfile=original_filename, headerfile=replace_original_filename, outfile=out_filename)
if __name__ == '__main__':
# main()
debug_main() Partial results
Output: >OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>Alydus_pilosus_comp17655_c0_seq1_44
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>Boisea_trivittata_comp12490_c0_seq1_0
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA
There are items in file 2 that are not in file 1, so replacement can't be made.
I'm done, you can make any required changes.
RE: Help with python code to search string in one file & replace with line in other file - Larz60+ - Dec-16-2017
There's still a bug, you're right some are not getting replaced, key seems to be when there are two in a row
on the input side. I will fix that !
|