Jan-30-2018, 09:45 PM
Jan-30-2018, 11:58 PM
Here is the final program. It Sorts and Selects in one module.
It is written so that you can run it from the command line using:
It is written so that you can run it from the command line using:
python SelsectAndSortFasta infilename outfilename min_seq_sizeOr, you can import it into another program and call from within similar to:
# Add following import at top of your program import SelectAndSortFasta # In your initialization routine, add: # If a class: self.sasf = SelectAndSortFasta.SelectAndSortFasta() # If just a function: sasf = SelectAndSortFasta.SelectAndSortFasta() # Then when you want to run a file sasf.sort_fasta_file(infile, outfile, minlen)Here's the code:
from operator import itemgetter import sys class SelectAndSortFasta: """ Sort in decreasing size order """ def __init__(self): self.headers = [] self.infile = None self.outfile = None self.minlen = 0 def ExtractHeader(self): """ Reads all headers, saves starting file position, header, and size in self.headers :return: None """ seqlen = 0 this_header = [] firstseq = True f = open(self.infile, 'r') f.seek(0, 2) file_size = f.tell() f.seek(0) while True: fptr = f.tell() if fptr == file_size: break line = None line = f.readline() line = line.strip() # skip empty lines if len(line) > 0: # print(f'line len: {len(line)}') if line.startswith('>'): if firstseq: firstseq = False else: this_header.append(seqlen) self.headers.append(this_header) seqlen = 0 this_header = [] this_header.append(fptr) this_header.append(line) else: seqlen += len(line) this_header.append(seqlen) self.headers.append(this_header) f.close() # This is the sort routine, sorts on column 2 (size) of self.headers, in reverse order self.headers.sort(key=itemgetter(2), reverse=True) self.show_header() def show_header(self): """ display self.headers list :return: None """ for item in self.headers: print(f'File ptr: {item[0]}, Size: {item[2]}, Header: {item[1]}') def write_outfile(self): """ Reads self.headers (which is sorted in reverse size order, seeks to that record in the inout file (from file pointer which i in column 0 of self.headers) and writes the output file :return: """ with open(self.infile) as f, open(self.outfile, 'w') as fo: for item in self.headers: # Ignore entry if too small if item[2] < minlen: continue f.seek(item[0], 0) fo.write(f.readline()) while True: buf = f.readline() if buf.startswith('>'): break fo.write(buf) def sort_fasta_file(self, infile, outfile, minlen): """ Launch pad for sort and select program :param infile: Name of input fasta file :param outfile: Name of output fasta file :param minlen: Minimun size of output record. (smaller inut records are ignored) :return: None """ self.infile = infile self.outfile = outfile self.minlen = minlen self.ExtractHeader() self.write_outfile() if __name__ == '__main__': # Test routine srtf = SelectAndSortFasta() numargs = len(sys.argv) if numargs > 1: infile = sys.argv[1] outfile = sys.argv[2] minlen = sys.argv[3] else: infile = 'data/fasta/AINZ01/AINZ01.1.fsa_nt' outfile = 'data/fasta/AINZ01/AINZ01sorted.1.fsa_nt' minlen = 1000 srtf.sort_fasta_file(infile, outfile, minlen)You may want to supress the printout on line 57 (self.show_header())
Feb-01-2018, 10:26 AM
Thanks !
I get a similar error than before:
In the meantime, can you help me correcting the print syntax so that I can check if the script is doing what I wanted?
Thanks again!
I get a similar error than before:
Error: File "SelectAndSortFasta.py", line 65
print(f'File ptr: {item[0]}, Size: {item[2]}, Header: {item[1]}')
^
SyntaxError: invalid syntax
I am assuming it's again because of my version of python. I'll ask the admin from the cluster I work on if they can install python 3 instead of the 2.6 I am working with at the moment, because it looks like I am going to get lots of these errors.In the meantime, can you help me correcting the print syntax so that I can check if the script is doing what I wanted?
Thanks again!
Feb-01-2018, 01:02 PM
I'm sorry, I've got in the habit of using f-string because I really like it.
It was introduced in python 3.6
change to:
Python 2.6 is so old!
maintenance for 2.7 and below is scheduled to stop in less than 2 years.
F-string won't work with anything less than 3.6, but 3.7 is coming out soon, maybe the admin can wait for that,
It's going to have some fantastic async additions.
It was introduced in python 3.6
change to:
print('File ptr: {}, Size: {}, Header: {}'.format(item[0], item[2], item[1]))or install python 3.6.4
Python 2.6 is so old!
maintenance for 2.7 and below is scheduled to stop in less than 2 years.
F-string won't work with anything less than 3.6, but 3.7 is coming out soon, maybe the admin can wait for that,
It's going to have some fantastic async additions.
Feb-01-2018, 01:16 PM
I contacter the admin, to see if I can get access to a more recent version of python, but I don't know how fast they are to respond.
I fixed this line now, so it's running.
However my output file is created, but is completely empty. I have this on the screen:
I am not sure what goes wrong, as I don't have an error showing.
I fixed this line now, so it's running.
However my output file is created, but is completely empty. I have this on the screen:
Output:$ python SelectAndSortFasta.py 2013_1056H.contigs.fa 2013_1056HTestOutput 1000
File ptr: 0, Size: 550068, Header: >NODE_1_length_550068_cov_488.448
File ptr: 559270, Size: 190085, Header: >NODE_2_length_190085_cov_532.242
File ptr: 752558, Size: 153922, Header: >NODE_3_length_153922_cov_529.57
File ptr: 909079, Size: 114954, Header: >NODE_4_length_114954_cov_494.41
File ptr: 1025982, Size: 106564, Header: >NODE_5_length_106564_cov_477.255
File ptr: 1134357, Size: 103116, Header: >NODE_6_length_103116_cov_541.733
File ptr: 1239226, Size: 102698, Header: >NODE_7_length_102698_cov_564.458
I didn't copy and paste the whole thing, but it goes till the end of my original fasta, all the contigs including those less than 1000bp.I am not sure what goes wrong, as I don't have an error showing.
Feb-01-2018, 01:26 PM
How are you running it?
I just ran it and it worked fine
Try to modify the input and output in the else part
of the code at the bottom:
I just ran it and it worked fine
Try to modify the input and output in the else part
of the code at the bottom:
if __name__ == '__main__': sf = SplitFasta() numargs = len(sys.argv) if numargs > 1: infile = sys.argv[1] outfile = sys.argv[2] minlen = sys.argv[3] else: infile = '2013_1056H.contigs.fa' outfile = '2013_1056HTestOutput.fa' minlen = 1000 sf.split_file(infile, outfile, int(minlen)) sf.show_seq_out()
Feb-01-2018, 01:28 PM
And run like:
python SelectAndSortFasta.pywithout arguments. Perhaps there's an issue with command line arguments
Feb-01-2018, 01:36 PM
I'm going to have to call it quits soon. Getting pretty sleepy.
I'll wait a few minutes to see if you have success.
I'll wait a few minutes to see if you have success.
Feb-01-2018, 01:39 PM
Okay I've done the modifications in the infile and outfile part, and run it without arguments. It does work now.
So I think it is safe to assume the issue comes from the command line arguments.
So I think it is safe to assume the issue comes from the command line arguments.
Feb-01-2018, 01:47 PM
It worked like a charm with new file