Help with python code to search string in one file & replace with line in other file

mforthman · Dec-15-2017, 03:19 PM

I have a fairly complicated task that, in short, requires me to take specific strings in the header lines of one DNA sequence fasta file and replace the entire line with corresponding header lines from another file that possesses that string (this other file just has more information that I need). For simplicity, I'll just call the fist referenced file 'file1' and the second one with more info 'file2'.

Example of how file1 looks is below. You will see that the header lines (those that start with '>') in general have varying pieces of information and are differently formatted. The header lines that I would like to target for replacement with headers from another file are indicated in bold. You will notice that there are some lines that are similar to the ones I've bolded but that I'm not targeting, i.e., those that have an '_A_' or '_B_' right before the ending digit (in some cases ending digit and '_rc').

Quote:>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>Alydus_pilosus_comp17655_c0_seq1_44
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>Boisea_trivittata_comp12490_c0_seq1_0
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA

Example of how file2 looks is below. You will see that the header lines (those that start with '>') have more information, but include very similar information from file1 that can help with matching (see bolded text for areas that match between the two files). You will notice that this file does not include any file1 headers that are similarly formatted (i.e., those that start with 'uce' or 'ENSOFAS' after the '>'. You might also note that multiple file1 headers match to a single file2 header, which is perfectly fine!

Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGCCTCTTGAATTAAATGCATGAGACGTGCACTTTGCAAACCAAAAGCATTATTGACCAAATGTGGAATGTTTTGTCTAGAACAGAGGCTTGCGATGTGCTCAAGGGAATCACAAGCCCTGGGGGCATAACAGCTAGTTGTTGAAAGAACACATACAATGTTATTAGAACCTAAAGTGTTTATTTGTGTTTCCATTCCATGAATATCAGTTCGGAGTTCGTCTCCACTGGAAATCGGTTCCACTATGATTGGCTGA
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
GTAGATTATTCTCTAACTGTCTATGGGTTTCGGAGACGAGGCTCTGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCTAGAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTGGAGCCTTTTTTCCTAGCACACTGAGTTTTTCTT

Expected output: I want file1 to be modified so that the header lines that match between file1 and file2 are replaced with the corresponding file2 headers. The order of the headers between files are not the same. I also do not want to alter the sequence lines in file1 (i.e., these should be ignored in search and replace). Below is an example of what I would expect file1 to look like after processing, with the modified headers bolded:

Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA

The python script I have been working on is below. Currently, this will modify the targeted file1 headers, but what it does is delete the '>' (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the file1 header with the corresponding file2 header yet. If the code doesn't make sense, just know that I'm not very knowledgable with Python.

#!/usr/bin/env python

import sys
import re

original_fn = sys.argv[1]
company_fn = sys.argv[2]

pattern = '(uce.+$|ENSOFAS.+$|[AB]_[0-9]+$)'

map = {}

with open(original_fn, "r") as original_fh:
    for line in original_fh:
        if line.startswith('>'):
            try:
                 (k, v) = line.strip().rsplit(':',1)
                 # remove trailing space from key
                 #k = k[:-1]
                 map[k] = v
                 #print k
                 #print v
                 #print map[k]
            except ValueError as err:
                 k = line.strip()
                 map[k] = None

with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>') and not re.search(pattern, line.strip()):
            try:
                line=line.strip('>')
                (v, k) = line.strip().rsplit('_',1)
                # remove trailing character from key
                #k = k[:-1]
                #print k
                #print v
            except ValueError as err:
                k = line.strip()
            if v not in map:
                sys.stdout.write("%s\n" % (v))
            else:
                sys.stdout.write("%s |%s\n" % (v, map[k]))
        else:
            sys.stdout.write("%s" % (line))

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Matching string from a file	tester_V	5	482	Mar-05-2024, 05:46 AM Last Post: Danishhafeez
	Python openyxl not updating Excel file	MrBean12	1	370	Mar-03-2024, 12:16 AM Last Post: MrBean12
	Python logging RotatingFileHandler writes to random file after the first log rotation	rawatg	0	445	Feb-15-2024, 11:15 AM Last Post: rawatg
	Unable to understand the meaning of the line of code.	jahuja73	0	328	Jan-23-2024, 05:09 AM Last Post: jahuja73
	connect sql by python using txt. file	dawid294	2	476	Jan-12-2024, 08:54 PM Last Post: deanhystad
	Writing a Linear Search algorithm - malformed string representation	Drone4four	10	1,030	Jan-10-2024, 08:39 AM Last Post: gulshan212
	file open "file not found error"	shanoger	8	1,203	Dec-14-2023, 08:03 AM Last Post: shanoger
	python Read each xlsx file and write it into csv with pipe delimiter	mg24	4	1,534	Nov-09-2023, 10:56 AM Last Post: mg24
	Search Excel File with a list of values	huzzug	4	1,291	Nov-03-2023, 05:35 PM Last Post: huzzug
	Replace a text/word in docx file using Python	Devan	4	3,599	Oct-17-2023, 06:03 PM Last Post: Devan

Help with python code to search string in one file & replace with line in other file

User Panel Messages

Announcements