Python Forum
How to link two python scripts
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to link two python scripts
#1
Hi everyone !

This is my first post here. I am originally a biologist but I started doing bioinformatics. I have very basic knowledge in programming, in perl, python and shell.

I found online a small python script which prints a list of all contigs in a multi-fasta file, with their length.

#!/usr/bin/python
from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
 output_line = '%s\t%i' % \
(seq_record.id, len(seq_record))
 print(output_line)
I used this script to store the result in a file (that I named TestTailleContigs1.txt for now), and I wrote a script that uses this file and the original multi-fasta file to cut off all the contigs with a length < 1000 and store the result in a new file.

#!/usr/bin/python
from __future__ import division

length_file = open("TestTailleContigs1.txt")
contig_file = open("2013_1056H.contigs.fa")
output_file = open("2013_1056H.contigs_filtered.fa","w")

Contigs_over_1000 = 0
Contigs = 0
FirstContigToTrim = 0

for line in length_file:
	column = line.split("\t")
	contigsize = int(column[1])
	if contigsize > 1000:
		Contigs_over_1000 += 1
		Contigs += 1
	else :
		Contigs += 1
		if FirstContigToTrim == 0:
			FirstContigToTrim = column[0]

print("The first contig to filter is " + str(FirstContigToTrim))
print("The number of contigs in the original file is " + str(Contigs))
print("The number of contigs remaining after filtering is " + str(Contigs_over_1000))

testcontig = 0
for line in contig_file:
	present = line.count(str(FirstContigToTrim))
	testcontig = testcontig + present
	if testcontig == 0:
		output_file.write(line)


length_file.close
contig_file.close
output_file.close
My question is:
- How can I combine these two scripts into one ? I actually don't need the intermediate list of lengths, I only need to use it as an input for my second script, and as I will have quite a lot of files to process, I would quite like to avoid the accumulation of intermediate files and having to run two scripts instead of just one.

I would really like to understand how it works, more than just having the solution, as I think this is something very basic that I should know how to do.

Thanks in advance for your time and help. If there is anything I should explain differently in order for you to help me, please let me know. Smile
Reply
#2
I took a look at: https://gist.github.com/mfoll/04d751165416a4466001
which is where I assume you got the first script.
He states that the code is very slow, and it's not well formatted to include as an import file.
You need it to be a class , or at least a function.
so, a couple of questions.
  • the fasta format is pretty simple, but the header is free form. I guessing that the Bio.SeqIo.Parse is reading in sequences one by one, and then printing the output.
  • what does the header look like?
  • what do you want the output to look like?
  • could you show a sample of the fasta data?
If you give me this, I can show how to create a script that will be fast, and can be imported into your program very easily.
The sequence doesn't have to be something that your working on, so long as the header format is the same.
Reply
#3
Hi Larz60+

Thanks for your help with this!

I actually found the first script on this blog: https://bioexpressblog.wordpress.com/201...asta-file/

It doesn't say there that it's slow, so I had no idea. I am working with bacterial DNA though, which are much smaller, so it's fast enough when I run my files with it. The format issue is worth a fix though (I was traying on my own to solve my problem and the format of the first script output was giving me a hard time.
  • I don't really know how the Bio.SeqIo.Parse works. I can try and have a look. I stopped at "it does what I wanted it to do"
  • The header I am working with at the moment are like this:
    >NODE_1_length_340169_cov_104.531

    You'll notice that the length is actually mentionned in the header, but I still want to check the length, as I will also work with data differently formatted, such as this:
    >C16BOV0002_c17

    Basically, headers can be very different from one project to another. The only rule is it starts with this ">" and goes to a new line before writing the sequence.

  • Not sure if you mean at the end of the 'length' script, or at the end of my series of scripts. In the end, I want a new multi-fasta with only the large sequences (size of more than 1000 basepairs), and a summary file stating how many contigs are in the original fasta and how many contigs are in the filtered fasta. I might try to execute the script for a list of files, and in this case, only one summary file for the whole list of files will be enough instead of one per file, but I am not there in my journey yet. Was that what you were asking?
  • Here is an example of data, with only two fasta in the multi-fasta, one over the limit and one under the limit (not sure how is the best way for me to transfer them to you, so I put them in a spoiler:


And there is something else I still need to add in my second script (or in the new script if we end up re-write one), which is sorting the multi-fasta file in decreasing size order. The files I am working with at the moment are already ordered, so my second script works, but I am aware that the way I designed my script only works for an ordered multi-fasta file.

I hope I gave enough information!

Thanks again for answering me!
Reply
#4
is there anything here: ftp://ftp.ncbi.nih.gov/genomes/
that I can use as a sample?
Not sure that the line terminators in your sample will be the same as an actual fasta file.
Reply
#5
Hi,

I don't know why but I couldn't open your link.

Anyway, you can download my example file from here: https://we.tl/mFsjFAKaMh
I saved it in the same format than for my original fasta files, and I double-cheched the end of line characters to make sure they are the same than in my original file.
Reply
#6
sorry, I gave you an ftp url

I was looking for something larger, 2 items is not a good test.
Nevertheless, take a look at the following and see if it does what you'd like
save the file as SplitFasta.py
#!/usr/bin/python
import sys


class SplitFasta:
    def __init__(self):
        self.lines_out = []
        self.header = None
        self.save_seq = ''

    def save_this(self, fo):
        fo.write('{}\n'.format(self.header))
        fo.write('{}\n'.format(self.save_seq))

    def split_file(self, in_filename, out_filename, min_seq_len):

        self.header = None
        seqlen = 0
        self.lines_out = []
        with open(in_filename, 'r') as fin, open(out_filename, 'w') as fo:
            firstline = True
            for line in fin:
                line = line.strip()
                if line.startswith('>'):
                    if firstline:
                        firstline = False
                        self.header = line
                    else:
                        if seqlen > min_seq_len:
                            self.save_this(fo)
                        self.save_seq = line
                        self.lines_out.append(self.header)
                        self.header = line
                        seqlen = 0
                else:
                    self.save_seq = '{}{}\n'.format(self.save_seq, line)
                    seqlen += len(line)
            if self.header is not None:
                if seqlen > min_seq_len:
                    self.save_this(fo)

    def show_seq_out(self):
        print('The following sequences are in the output file:')
        plist = "\n".join(self.lines_out)
        print(plist)

if __name__ == '__main__':
    sf = SplitFasta()
    numargs = len(sys.argv)
    print(f'numargs: {numargs}')
    if numargs > 1:
        infile = sys.argv[1]
        outfile = sys.argv[2]
        minlen = sys.argv[3]
    else:
        infile = 'Example.fa'
        outfile = 'NewExample.fa'
        minlen = 1000
    sf.split_file(infile, outfile, int(minlen))
    sf.show_seq_out()
The bottom part 'if __name__ (etc.)' is for testing only, and woun be used if the module is imported into another program

you can import this into any program like:
import SplitFasta
and then use this way
just once at start of script, instantiate the class with:
sf = SplitFasta.SplitFasta()
Then to call, use:
sf.split_file(your_in_filename, your_out_filename, your_min_seq_len)
if you want to show what was written, call:
sf.show_seq_out()
of if you just want the list of output headers:
my_header_list = sf.lines_out
If this is what you are looking for, let me know, I'll be going to bed soon, but be back in about 4 - 5 hours.

Here's a sample run (from command line):
Output:
λ python SplitFasta.py Example.fa NewExample.fa 1000 numargs: 4 The following sequences are in the output file: >NODE_30_length_1090_cov_54656.2
and the output file (NewExample.fa:
Output:
>NODE_30_length_1090_cov_54656.2 TTATGGAATATCTATTAGAGCAAAAAAGAGATTTTACGCAATTAAAATTTAGCGATATAC AGCAAATGAAATCAGCTTATAGCATAAGAATTTATAATATGCTACTTTGTGAATTAAAAC AAAACAGACAAAATCTTAAAATAAATCTTTCAGTATTGCAAAATCTTTTAGAAGTTCCGA AAAATTATGAAGAAAGATGGGCTGATTTTAATCGTTTTGTATTAAAACAAGCAGAAAAAG ATATAAATAGCAAATCTAATTTAGTTTTATTAGATATTAAAACTTATAAAACAGGGCGTA AAATAACAGACTTAGAGTTTATTTTTGATTATAAAAATAACGATAAGCGTATCGCACAGG AAAAACTAAAAGAAGAAAATTTATTTAAAAAACTCAAAGAAATATTAAGTTCTTACATAG GCAAATCAATTTATGATGATAGATTTGGCGAAATGATTATAAGTCATTACGAACATAATG AAGAAAATAAAAAGATTTTAATTATCGCCCAGAGAAAAAGCGATGATAAATTTGTTTGCT TTGGTGTTAAAAACTTCAAAGATATTAAAAGTTTAGAAAAGCTAAAAGATAAAGCAGAAG AGTTGTTTTATTTAGATAAACAAAGAGTTTTAAAAGCAAAAGAAGCTCAAAAATATAGAA ATCTTTTTAATTGATTGTATTTTAAAAATTATAAAAATAAAAGAGATATTAAAAGGCTTG ATTGATAAAAATAATTCTTAAGCTCTAATATCTATGCTTTTTTGTGTAGAATTTAAAGAA AGAATTTTATTAAATTCCCCTGTATTATCATCGCTAAATTTCATACCAAAAAGAATTTCT AGCTCATCGCTTGTGCCAAATTTATTTTCCAGTAGCTTTTTTAAAAGCTCATTCATTTTA TTATCATCTTTATAGGTTTCGCTTTTACTTTCTGCTTGTATAGGTTTAAAAGGCTTTTTT TTGTCTTCTTCTGAAGTTTCTTTGTTATTTGTATTTTTTAAAGGATTGCTATAATCTACA CCTTTTGCCTTTTCTGCTTCTTCTAGTGATTTTACAAACCCATCGTGTCTTTGTTTAAAA TCAAGATATT
Reply
#7
I didn't see thos until now:
Quote:And there is something else I still need to add in my second script (or in the new script if we end up re-write one), which is sorting the multi-fasta file in decreasing size order. The files I am working with at the moment are already ordered, so my second script works, but I am aware that the way I designed my script only works for an ordered multi-fasta file.

That changes the way this has to be approached.
I'll take a look after my nap!
Reply
#8
Hi !

Thank you. That will give me some understanding work to do, as I would really have liked to understand in order to improve my python skills. Before fully understanding your script (as a new learner in python and quite fresh in programming, it takes me a while to understand someone else's scripts), I tried to run it and I get an error message.

This is what I have as output:
Output:
$ python SplitFasta.py Example.fa ExampleBis.fa 1000 File "SplitFasta.py", line 51 print(f'numargs: {numargs}') ^ SyntaxError: invalid syntax
Do you know where it comes from?

Thanks for your help! And don't worry, it's not urgent at all. I myself will soon be off for the week-end, and I can't test things from my personal computer, only from my work station. I didn't think I would have an answer before a few days to be honest, so take your time!
Reply
#9
That's because f-string only works on python 3.6 and above.
change that line to:
print('numargs: {}'.format(numargs))
I will also need a much larger file to test sort.
Here's that web site I gave you again (not the ftp site): https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/
Please find a file that fits the bill, and send me the URL.

Thanks
Reply
#10
Hi !

Thank you for your answer regarding the error message. I will try the line you suggested.

Regarding a test file, let's use that one:
https://www.ncbi.nlm.nih.gov/Traces/wgs/...igs&page=1
Download the contig fasta (on the left side) The extension is .fsa, where mine are .fa but it's a fasta format, just different extensions.

Edit: just realised the link is not forwarding to the page I was on.
Here is the ftp of the file to download, if that works for you.
ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/A....fsa_nt.gz
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to us python.exe from our network to run scripts cubangt 3 817 Aug-17-2023, 07:53 PM
Last Post: deanhystad
  Link scripts from a different folder Extra 3 1,379 May-11-2022, 08:34 PM
Last Post: snippsat
  How do I link the virtual environment of that project to the 3.9.2 version of python? Bryant11 1 1,332 Feb-26-2022, 11:15 AM
Last Post: Larz60+
  I can't open a link with Selenium in Python jao 0 1,369 Jan-30-2022, 04:21 AM
Last Post: jao
  Parsing link from html tags with Python Melcu54 0 1,575 Jun-14-2021, 09:25 AM
Last Post: Melcu54
  How to link Sublime Text 3 Build system to Python 3.9 Using Windows 10 Fanman001 2 4,532 Mar-04-2021, 03:09 PM
Last Post: martpogs
  Running python scripts from github etc pacmyc 7 3,609 Mar-03-2021, 10:26 PM
Last Post: pacmyc
  How to skip LinkedIn signup link using python script? Mangesh121 0 1,766 Aug-26-2020, 01:22 PM
Last Post: Mangesh121
  Reading SQL scripts from excel file and run it using python saravanatn 2 2,460 Aug-23-2020, 04:49 PM
Last Post: saravanatn
  No Scripts File present after python installation ag2207 5 4,767 Jul-30-2020, 11:11 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020