How to link two python scripts - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How to link two python scripts (/thread-7818.html) |
How to link two python scripts - berthenet - Jan-26-2018 Hi everyone ! This is my first post here. I am originally a biologist but I started doing bioinformatics. I have very basic knowledge in programming, in perl, python and shell. I found online a small python script which prints a list of all contigs in a multi-fasta file, with their length. #!/usr/bin/python from Bio import SeqIO import sys cmdargs = str(sys.argv) for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"): output_line = '%s\t%i' % \ (seq_record.id, len(seq_record)) print(output_line)I used this script to store the result in a file (that I named TestTailleContigs1.txt for now), and I wrote a script that uses this file and the original multi-fasta file to cut off all the contigs with a length < 1000 and store the result in a new file. #!/usr/bin/python from __future__ import division length_file = open("TestTailleContigs1.txt") contig_file = open("2013_1056H.contigs.fa") output_file = open("2013_1056H.contigs_filtered.fa","w") Contigs_over_1000 = 0 Contigs = 0 FirstContigToTrim = 0 for line in length_file: column = line.split("\t") contigsize = int(column[1]) if contigsize > 1000: Contigs_over_1000 += 1 Contigs += 1 else : Contigs += 1 if FirstContigToTrim == 0: FirstContigToTrim = column[0] print("The first contig to filter is " + str(FirstContigToTrim)) print("The number of contigs in the original file is " + str(Contigs)) print("The number of contigs remaining after filtering is " + str(Contigs_over_1000)) testcontig = 0 for line in contig_file: present = line.count(str(FirstContigToTrim)) testcontig = testcontig + present if testcontig == 0: output_file.write(line) length_file.close contig_file.close output_file.closeMy question is: - How can I combine these two scripts into one ? I actually don't need the intermediate list of lengths, I only need to use it as an input for my second script, and as I will have quite a lot of files to process, I would quite like to avoid the accumulation of intermediate files and having to run two scripts instead of just one. I would really like to understand how it works, more than just having the solution, as I think this is something very basic that I should know how to do. Thanks in advance for your time and help. If there is anything I should explain differently in order for you to help me, please let me know. RE: How to link two python scripts - Larz60+ - Jan-26-2018 I took a look at: https://gist.github.com/mfoll/04d751165416a4466001 which is where I assume you got the first script. He states that the code is very slow, and it's not well formatted to include as an import file. You need it to be a class , or at least a function. so, a couple of questions.
The sequence doesn't have to be something that your working on, so long as the header format is the same. RE: How to link two python scripts - berthenet - Jan-26-2018 Hi Larz60+ Thanks for your help with this! I actually found the first script on this blog: https://bioexpressblog.wordpress.com/2014/04/15/calculate-length-of-all-sequences-in-an-multi-fasta-file/ It doesn't say there that it's slow, so I had no idea. I am working with bacterial DNA though, which are much smaller, so it's fast enough when I run my files with it. The format issue is worth a fix though (I was traying on my own to solve my problem and the format of the first script output was giving me a hard time.
And there is something else I still need to add in my second script (or in the new script if we end up re-write one), which is sorting the multi-fasta file in decreasing size order. The files I am working with at the moment are already ordered, so my second script works, but I am aware that the way I designed my script only works for an ordered multi-fasta file. I hope I gave enough information! Thanks again for answering me! RE: How to link two python scripts - Larz60+ - Jan-26-2018 is there anything here: ftp://ftp.ncbi.nih.gov/genomes/ that I can use as a sample? Not sure that the line terminators in your sample will be the same as an actual fasta file. RE: How to link two python scripts - berthenet - Jan-26-2018 Hi, I don't know why but I couldn't open your link. Anyway, you can download my example file from here: https://we.tl/mFsjFAKaMh I saved it in the same format than for my original fasta files, and I double-cheched the end of line characters to make sure they are the same than in my original file. RE: How to link two python scripts - Larz60+ - Jan-26-2018 sorry, I gave you an ftp url I was looking for something larger, 2 items is not a good test. Nevertheless, take a look at the following and see if it does what you'd like save the file as SplitFasta.py #!/usr/bin/python import sys class SplitFasta: def __init__(self): self.lines_out = [] self.header = None self.save_seq = '' def save_this(self, fo): fo.write('{}\n'.format(self.header)) fo.write('{}\n'.format(self.save_seq)) def split_file(self, in_filename, out_filename, min_seq_len): self.header = None seqlen = 0 self.lines_out = [] with open(in_filename, 'r') as fin, open(out_filename, 'w') as fo: firstline = True for line in fin: line = line.strip() if line.startswith('>'): if firstline: firstline = False self.header = line else: if seqlen > min_seq_len: self.save_this(fo) self.save_seq = line self.lines_out.append(self.header) self.header = line seqlen = 0 else: self.save_seq = '{}{}\n'.format(self.save_seq, line) seqlen += len(line) if self.header is not None: if seqlen > min_seq_len: self.save_this(fo) def show_seq_out(self): print('The following sequences are in the output file:') plist = "\n".join(self.lines_out) print(plist) if __name__ == '__main__': sf = SplitFasta() numargs = len(sys.argv) print(f'numargs: {numargs}') if numargs > 1: infile = sys.argv[1] outfile = sys.argv[2] minlen = sys.argv[3] else: infile = 'Example.fa' outfile = 'NewExample.fa' minlen = 1000 sf.split_file(infile, outfile, int(minlen)) sf.show_seq_out()The bottom part 'if __name__ (etc.)' is for testing only, and woun be used if the module is imported into another program you can import this into any program like: import SplitFastaand then use this way just once at start of script, instantiate the class with: sf = SplitFasta.SplitFasta()Then to call, use: sf.split_file(your_in_filename, your_out_filename, your_min_seq_len)if you want to show what was written, call: sf.show_seq_out()of if you just want the list of output headers: my_header_list = sf.lines_outIf this is what you are looking for, let me know, I'll be going to bed soon, but be back in about 4 - 5 hours. Here's a sample run (from command line): and the output file (NewExample.fa:
RE: How to link two python scripts - Larz60+ - Jan-26-2018 I didn't see thos until now: Quote:And there is something else I still need to add in my second script (or in the new script if we end up re-write one), which is sorting the multi-fasta file in decreasing size order. The files I am working with at the moment are already ordered, so my second script works, but I am aware that the way I designed my script only works for an ordered multi-fasta file. That changes the way this has to be approached. I'll take a look after my nap! RE: How to link two python scripts - berthenet - Jan-26-2018 Hi ! Thank you. That will give me some understanding work to do, as I would really have liked to understand in order to improve my python skills. Before fully understanding your script (as a new learner in python and quite fresh in programming, it takes me a while to understand someone else's scripts), I tried to run it and I get an error message. This is what I have as output: Do you know where it comes from?Thanks for your help! And don't worry, it's not urgent at all. I myself will soon be off for the week-end, and I can't test things from my personal computer, only from my work station. I didn't think I would have an answer before a few days to be honest, so take your time! RE: How to link two python scripts - Larz60+ - Jan-26-2018 That's because f-string only works on python 3.6 and above. change that line to: print('numargs: {}'.format(numargs))I will also need a much larger file to test sort. Here's that web site I gave you again (not the ftp site): https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/ Please find a file that fits the bill, and send me the URL. Thanks RE: How to link two python scripts - berthenet - Jan-29-2018 Hi ! Thank you for your answer regarding the error message. I will try the line you suggested. Regarding a test file, let's use that one: https://www.ncbi.nlm.nih.gov/Traces/wgs/?display=contigs&page=1 Download the contig fasta (on the left side) The extension is .fsa, where mine are .fa but it's a fasta format, just different extensions. Edit: just realised the link is not forwarding to the page I was on. Here is the ftp of the file to download, if that works for you. ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/AI/NZ/AINZ01/AINZ01.1.fsa_nt.gz |