Python Forum
Help with python code to search string in one file & replace with line in other file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help with python code to search string in one file & replace with line in other file
#1
I have a fairly complicated task that, in short, requires me to take specific strings in the header lines of one DNA sequence fasta file and replace the entire line with corresponding header lines from another file that possesses that string (this other file just has more information that I need). For simplicity, I'll just call the fist referenced file 'file1' and the second one with more info 'file2'.

Example of how file1 looks is below. You will see that the header lines (those that start with '>') in general have varying pieces of information and are differently formatted. The header lines that I would like to target for replacement with headers from another file are indicated in bold. You will notice that there are some lines that are similar to the ones I've bolded but that I'm not targeting, i.e., those that have an '_A_' or '_B_' right before the ending digit (in some cases ending digit and '_rc').

Quote:>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC

>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>Alydus_pilosus_comp17655_c0_seq1_44
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>Boisea_trivittata_comp12490_c0_seq1_0
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT

>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA

Example of how file2 looks is below. You will see that the header lines (those that start with '>') have more information, but include very similar information from file1 that can help with matching (see bolded text for areas that match between the two files). You will notice that this file does not include any file1 headers that are similarly formatted (i.e., those that start with 'uce' or 'ENSOFAS' after the '>'. You might also note that multiple file1 headers match to a single file2 header, which is perfectly fine!

Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGCCTCTTGAATTAAATGCATGAGACGTGCACTTTGCAAACCAAAAGCATTATTGACCAAATGTGGAATGTTTTGTCTAGAACAGAGGCTTGCGATGTGCTCAAGGGAATCACAAGCCCTGGGGGCATAACAGCTAGTTGTTGAAAGAACACATACAATGTTATTAGAACCTAAAGTGTTTATTTGTGTTTCCATTCCATGAATATCAGTTCGGAGTTCGTCTCCACTGGAAATCGGTTCCACTATGATTGGCTGA
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
GTAGATTATTCTCTAACTGTCTATGGGTTTCGGAGACGAGGCTCTGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCTAGAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTGGAGCCTTTTTTCCTAGCACACTGAGTTTTTCTT

Expected output: I want file1 to be modified so that the header lines that match between file1 and file2 are replaced with the corresponding file2 headers. The order of the headers between files are not the same. I also do not want to alter the sequence lines in file1 (i.e., these should be ignored in search and replace). Below is an example of what I would expect file1 to look like after processing, with the modified headers bolded:

Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA

The python script I have been working on is below. Currently, this will modify the targeted file1 headers, but what it does is delete the '>' (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the file1 header with the corresponding file2 header yet. If the code doesn't make sense, just know that I'm not very knowledgable with Python.

#!/usr/bin/env python

import sys
import re

original_fn = sys.argv[1]
company_fn = sys.argv[2]

pattern = '(uce.+$|ENSOFAS.+$|[AB]_[0-9]+$)'

map = {}

with open(original_fn, "r") as original_fh:
    for line in original_fh:
        if line.startswith('>'):
            try:
                 (k, v) = line.strip().rsplit(':',1)
                 # remove trailing space from key
                 #k = k[:-1]
                 map[k] = v
                 #print k
                 #print v
                 #print map[k]
            except ValueError as err:
                 k = line.strip()
                 map[k] = None

with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>') and not re.search(pattern, line.strip()):
            try:
                line=line.strip('>')
                (v, k) = line.strip().rsplit('_',1)
                # remove trailing character from key
                #k = k[:-1]
                #print k
                #print v
            except ValueError as err:
                k = line.strip()
            if v not in map:
                sys.stdout.write("%s\n" % (v))
            else:
                sys.stdout.write("%s |%s\n" % (v, map[k]))
        else:
            sys.stdout.write("%s" % (line))
Reply


Messages In This Thread
Help with python code to search string in one file & replace with line in other file - by mforthman - Dec-15-2017, 03:19 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Matching string from a file tester_V 5 482 Mar-05-2024, 05:46 AM
Last Post: Danishhafeez
  Python openyxl not updating Excel file MrBean12 1 370 Mar-03-2024, 12:16 AM
Last Post: MrBean12
  Python logging RotatingFileHandler writes to random file after the first log rotation rawatg 0 445 Feb-15-2024, 11:15 AM
Last Post: rawatg
  Unable to understand the meaning of the line of code. jahuja73 0 328 Jan-23-2024, 05:09 AM
Last Post: jahuja73
  connect sql by python using txt. file dawid294 2 476 Jan-12-2024, 08:54 PM
Last Post: deanhystad
  Writing a Linear Search algorithm - malformed string representation Drone4four 10 1,030 Jan-10-2024, 08:39 AM
Last Post: gulshan212
  file open "file not found error" shanoger 8 1,203 Dec-14-2023, 08:03 AM
Last Post: shanoger
  python Read each xlsx file and write it into csv with pipe delimiter mg24 4 1,534 Nov-09-2023, 10:56 AM
Last Post: mg24
  Search Excel File with a list of values huzzug 4 1,291 Nov-03-2023, 05:35 PM
Last Post: huzzug
  Replace a text/word in docx file using Python Devan 4 3,599 Oct-17-2023, 06:03 PM
Last Post: Devan

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020