Posts: 12,030
Threads: 485
Joined: Sep 2016
Actually a few more hours, forgot I had a doctor's preventative maintenance appointment, and just got back.
Posts: 12,030
Threads: 485
Joined: Sep 2016
Jan-30-2018, 11:58 PM
(This post was last modified: Jan-30-2018, 11:59 PM by Larz60+.)
Here is the final program. It Sorts and Selects in one module.
It is written so that you can run it from the command line using:
python SelsectAndSortFasta infilename outfilename min_seq_size Or, you can import it into another program and call from within similar to:
# Add following import at top of your program
import SelectAndSortFasta
# In your initialization routine, add:
# If a class:
self.sasf = SelectAndSortFasta.SelectAndSortFasta()
# If just a function:
sasf = SelectAndSortFasta.SelectAndSortFasta()
# Then when you want to run a file
sasf.sort_fasta_file(infile, outfile, minlen) Here's the code:
from operator import itemgetter
import sys
class SelectAndSortFasta:
"""
Sort in decreasing size order
"""
def __init__(self):
self.headers = []
self.infile = None
self.outfile = None
self.minlen = 0
def ExtractHeader(self):
"""
Reads all headers, saves starting file position, header, and size in self.headers
:return: None
"""
seqlen = 0
this_header = []
firstseq = True
f = open(self.infile, 'r')
f.seek(0, 2)
file_size = f.tell()
f.seek(0)
while True:
fptr = f.tell()
if fptr == file_size:
break
line = None
line = f.readline()
line = line.strip()
# skip empty lines
if len(line) > 0:
# print(f'line len: {len(line)}')
if line.startswith('>'):
if firstseq:
firstseq = False
else:
this_header.append(seqlen)
self.headers.append(this_header)
seqlen = 0
this_header = []
this_header.append(fptr)
this_header.append(line)
else:
seqlen += len(line)
this_header.append(seqlen)
self.headers.append(this_header)
f.close()
# This is the sort routine, sorts on column 2 (size) of self.headers, in reverse order
self.headers.sort(key=itemgetter(2), reverse=True)
self.show_header()
def show_header(self):
"""
display self.headers list
:return: None
"""
for item in self.headers:
print(f'File ptr: {item[0]}, Size: {item[2]}, Header: {item[1]}')
def write_outfile(self):
"""
Reads self.headers (which is sorted in reverse size order,
seeks to that record in the inout file (from file pointer which i in column 0 of self.headers)
and writes the output file
:return:
"""
with open(self.infile) as f, open(self.outfile, 'w') as fo:
for item in self.headers:
# Ignore entry if too small
if item[2] < minlen:
continue
f.seek(item[0], 0)
fo.write(f.readline())
while True:
buf = f.readline()
if buf.startswith('>'):
break
fo.write(buf)
def sort_fasta_file(self, infile, outfile, minlen):
"""
Launch pad for sort and select program
:param infile: Name of input fasta file
:param outfile: Name of output fasta file
:param minlen: Minimun size of output record. (smaller inut records are ignored)
:return: None
"""
self.infile = infile
self.outfile = outfile
self.minlen = minlen
self.ExtractHeader()
self.write_outfile()
if __name__ == '__main__':
# Test routine
srtf = SelectAndSortFasta()
numargs = len(sys.argv)
if numargs > 1:
infile = sys.argv[1]
outfile = sys.argv[2]
minlen = sys.argv[3]
else:
infile = 'data/fasta/AINZ01/AINZ01.1.fsa_nt'
outfile = 'data/fasta/AINZ01/AINZ01sorted.1.fsa_nt'
minlen = 1000
srtf.sort_fasta_file(infile, outfile, minlen) You may want to supress the printout on line 57 (self.show_header())
Posts: 13
Threads: 1
Joined: Jan 2018
Feb-01-2018, 10:26 AM
(This post was last modified: Feb-01-2018, 10:27 AM by berthenet.)
Thanks !
I get a similar error than before:
Error: File "SelectAndSortFasta.py", line 65
print(f'File ptr: {item[0]}, Size: {item[2]}, Header: {item[1]}')
^
SyntaxError: invalid syntax
I am assuming it's again because of my version of python. I'll ask the admin from the cluster I work on if they can install python 3 instead of the 2.6 I am working with at the moment, because it looks like I am going to get lots of these errors.
In the meantime, can you help me correcting the print syntax so that I can check if the script is doing what I wanted?
Thanks again!
Posts: 12,030
Threads: 485
Joined: Sep 2016
Feb-01-2018, 01:02 PM
(This post was last modified: Feb-01-2018, 01:02 PM by Larz60+.)
I'm sorry, I've got in the habit of using f-string because I really like it.
It was introduced in python 3.6
change to:
print('File ptr: {}, Size: {}, Header: {}'.format(item[0], item[2], item[1])) or install python 3.6.4
Python 2.6 is so old!
maintenance for 2.7 and below is scheduled to stop in less than 2 years.
F-string won't work with anything less than 3.6, but 3.7 is coming out soon, maybe the admin can wait for that,
It's going to have some fantastic async additions.
Posts: 13
Threads: 1
Joined: Jan 2018
I contacter the admin, to see if I can get access to a more recent version of python, but I don't know how fast they are to respond.
I fixed this line now, so it's running.
However my output file is created, but is completely empty. I have this on the screen:
Output: $ python SelectAndSortFasta.py 2013_1056H.contigs.fa 2013_1056HTestOutput 1000
File ptr: 0, Size: 550068, Header: >NODE_1_length_550068_cov_488.448
File ptr: 559270, Size: 190085, Header: >NODE_2_length_190085_cov_532.242
File ptr: 752558, Size: 153922, Header: >NODE_3_length_153922_cov_529.57
File ptr: 909079, Size: 114954, Header: >NODE_4_length_114954_cov_494.41
File ptr: 1025982, Size: 106564, Header: >NODE_5_length_106564_cov_477.255
File ptr: 1134357, Size: 103116, Header: >NODE_6_length_103116_cov_541.733
File ptr: 1239226, Size: 102698, Header: >NODE_7_length_102698_cov_564.458
I didn't copy and paste the whole thing, but it goes till the end of my original fasta, all the contigs including those less than 1000bp.
I am not sure what goes wrong, as I don't have an error showing.
Posts: 12,030
Threads: 485
Joined: Sep 2016
Feb-01-2018, 01:26 PM
(This post was last modified: Feb-01-2018, 01:26 PM by Larz60+.)
How are you running it?
I just ran it and it worked fine
Try to modify the input and output in the else part
of the code at the bottom:
if __name__ == '__main__':
sf = SplitFasta()
numargs = len(sys.argv)
if numargs > 1:
infile = sys.argv[1]
outfile = sys.argv[2]
minlen = sys.argv[3]
else:
infile = '2013_1056H.contigs.fa'
outfile = '2013_1056HTestOutput.fa'
minlen = 1000
sf.split_file(infile, outfile, int(minlen))
sf.show_seq_out()
Posts: 12,030
Threads: 485
Joined: Sep 2016
And run like:
python SelectAndSortFasta.py without arguments. Perhaps there's an issue with command line arguments
Posts: 12,030
Threads: 485
Joined: Sep 2016
I'm going to have to call it quits soon. Getting pretty sleepy.
I'll wait a few minutes to see if you have success.
Posts: 13
Threads: 1
Joined: Jan 2018
Okay I've done the modifications in the infile and outfile part, and run it without arguments. It does work now.
So I think it is safe to assume the issue comes from the command line arguments.
Posts: 12,030
Threads: 485
Joined: Sep 2016
It worked like a charm with new file
|