Dec-15-2017, 03:19 PM
I have a fairly complicated task that, in short, requires me to take specific strings in the header lines of one DNA sequence fasta file and replace the entire line with corresponding header lines from another file that possesses that string (this other file just has more information that I need). For simplicity, I'll just call the fist referenced file 'file1' and the second one with more info 'file2'.
Example of how file1 looks is below. You will see that the header lines (those that start with '>') in general have varying pieces of information and are differently formatted. The header lines that I would like to target for replacement with headers from another file are indicated in bold. You will notice that there are some lines that are similar to the ones I've bolded but that I'm not targeting, i.e., those that have an '_A_' or '_B_' right before the ending digit (in some cases ending digit and '_rc').
Example of how file2 looks is below. You will see that the header lines (those that start with '>') have more information, but include very similar information from file1 that can help with matching (see bolded text for areas that match between the two files). You will notice that this file does not include any file1 headers that are similarly formatted (i.e., those that start with 'uce' or 'ENSOFAS' after the '>'. You might also note that multiple file1 headers match to a single file2 header, which is perfectly fine!
Expected output: I want file1 to be modified so that the header lines that match between file1 and file2 are replaced with the corresponding file2 headers. The order of the headers between files are not the same. I also do not want to alter the sequence lines in file1 (i.e., these should be ignored in search and replace). Below is an example of what I would expect file1 to look like after processing, with the modified headers bolded:
The python script I have been working on is below. Currently, this will modify the targeted file1 headers, but what it does is delete the '>' (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the file1 header with the corresponding file2 header yet. If the code doesn't make sense, just know that I'm not very knowledgable with Python.
Example of how file1 looks is below. You will see that the header lines (those that start with '>') in general have varying pieces of information and are differently formatted. The header lines that I would like to target for replacement with headers from another file are indicated in bold. You will notice that there are some lines that are similar to the ones I've bolded but that I'm not targeting, i.e., those that have an '_A_' or '_B_' right before the ending digit (in some cases ending digit and '_rc').
Quote:>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_0_rc
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_35_rc
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>Anasa_tristis_comp3229_c0_seq1_136_rc
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>Alydus_pilosus_comp17655_c0_seq1_44
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>Boisea_trivittata_comp12490_c0_seq1_0
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA
Example of how file2 looks is below. You will see that the header lines (those that start with '>') have more information, but include very similar information from file1 that can help with matching (see bolded text for areas that match between the two files). You will notice that this file does not include any file1 headers that are similarly formatted (i.e., those that start with 'uce' or 'ENSOFAS' after the '>'. You might also note that multiple file1 headers match to a single file2 header, which is perfectly fine!
Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
TTCTACACAAACTGCTTTGCACTGAGCACCATTAAAATCATCTGTTGACCTTGCAAGTTCTTCAAAATTTACATCAACGCTAATATTCATTTTCCGAGAATGTATTTGCATAATTCGAGCACGGGCATCTTCATTTGGATGAGGAAATTCAATTTTTCTGTCTAGCCTGCCTGATCGGAGAAGGGCTGGATCTAATATATCAACTCTGTTAGTTGCTGCAATG
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGCCTCTTGAATTAAATGCATGAGACGTGCACTTTGCAAACCAAAAGCATTATTGACCAAATGTGGAATGTTTTGTCTAGAACAGAGGCTTGCGATGTGCTCAAGGGAATCACAAGCCCTGGGGGCATAACAGCTAGTTGTTGAAAGAACACATACAATGTTATTAGAACCTAAAGTGTTTATTTGTGTTTCCATTCCATGAATATCAGTTCGGAGTTCGTCTCCACTGGAAATCGGTTCCACTATGATTGGCTGA
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
GTAGATTATTCTCTAACTGTCTATGGGTTTCGGAGACGAGGCTCTGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCTAGAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTGGAGCCTTTTTTCCTAGCACACTGAGTTTTTCTT
Expected output: I want file1 to be modified so that the header lines that match between file1 and file2 are replaced with the corresponding file2 headers. The order of the headers between files are not the same. I also do not want to alter the sequence lines in file1 (i.e., these should be ignored in search and replace). Below is an example of what I would expect file1 to look like after processing, with the modified headers bolded:
Quote:>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
GCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATTTTAATGGTGCTCAGTGCAAAGCAGTTTGTGTAGAA
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
AAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAATACATTCTCGGAAAATGAATATTAGCGTTGATGTAAATTTTGAAGAACTTGCAAGGTCAACAGATGATT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
TCAGCCAATCATAGTGGAACCGATTTCCAGTGGAGACGAACTCCGAACTGATATTCATGGAATGGAAACACAAATAAACACTTTAGGTTCTAATAACATTGTATGTGTTCTTTCAACAAC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-3225_p8 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:8,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410195,probes-global-end:410315,probes-local-start:40,probes-local-end:160
TGCTCTCGACCATGCCAACAAGGCTAATGCTGAAGCTCAGAAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGC
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp17655_c0_seq1
TGAATCTTGGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACGATACATATGAACCCTACAAGGTAACTTTTTGCCCTCATTGAGAAGACACAGCAGCATTTGAGCCTT
>OFAS000562-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000562-RA-EXON01,probes-probe:,probes-source:Boisea_trivittata_comp12490_c0_seq1
ATGTTTCGAAGATTATACTTTAACTGTCTATGTGTTTCGGAGACAAGGCTCTGAATATTAGGGTGTTGATCACCGAATGTTAGGATGAGTATTGTTGTAGCGACAATGCATATAAACCCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_A_38
GGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAATACCTAAAGCAGTTAAAGGAACTGTCCAAGCTTTGATG
>ENSOFAS011540_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:1,probes-source:Anoplocnemis_curvipes_contig7292
TGGGTATTTCGAGGGATCACTATCATAAAAGAAGGAAGACTGGAGGGAAAAGGAAACCCATCAGGAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAA
>ENSOFAS011540_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS011540,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig7292
GAAGAAGAGGAAGTATGAGTTAGGTCGGCCAGCAGCTAATACTAAGCTTGGTGTAAAAAGAGTTCATCTTGTCAGGACCAGGGGTGGAAATACAAAGTTTAGAGCTCTTCGATTGGATTA
The python script I have been working on is below. Currently, this will modify the targeted file1 headers, but what it does is delete the '>' (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the file1 header with the corresponding file2 header yet. If the code doesn't make sense, just know that I'm not very knowledgable with Python.
#!/usr/bin/env python import sys import re original_fn = sys.argv[1] company_fn = sys.argv[2] pattern = '(uce.+$|ENSOFAS.+$|[AB]_[0-9]+$)' map = {} with open(original_fn, "r") as original_fh: for line in original_fh: if line.startswith('>'): try: (k, v) = line.strip().rsplit(':',1) # remove trailing space from key #k = k[:-1] map[k] = v #print k #print v #print map[k] except ValueError as err: k = line.strip() map[k] = None with open(company_fn, "r") as company_fh: for line in company_fh: if line.startswith('>') and not re.search(pattern, line.strip()): try: line=line.strip('>') (v, k) = line.strip().rsplit('_',1) # remove trailing character from key #k = k[:-1] #print k #print v except ValueError as err: k = line.strip() if v not in map: sys.stdout.write("%s\n" % (v)) else: sys.stdout.write("%s |%s\n" % (v, map[k])) else: sys.stdout.write("%s" % (line))