Jan-26-2018, 11:49 AM
Hi Larz60+
Thanks for your help with this!
I actually found the first script on this blog: https://bioexpressblog.wordpress.com/201...asta-file/
It doesn't say there that it's slow, so I had no idea. I am working with bacterial DNA though, which are much smaller, so it's fast enough when I run my files with it. The format issue is worth a fix though (I was traying on my own to solve my problem and the format of the first script output was giving me a hard time.
And there is something else I still need to add in my second script (or in the new script if we end up re-write one), which is sorting the multi-fasta file in decreasing size order. The files I am working with at the moment are already ordered, so my second script works, but I am aware that the way I designed my script only works for an ordered multi-fasta file.
I hope I gave enough information!
Thanks again for answering me!
Thanks for your help with this!
I actually found the first script on this blog: https://bioexpressblog.wordpress.com/201...asta-file/
It doesn't say there that it's slow, so I had no idea. I am working with bacterial DNA though, which are much smaller, so it's fast enough when I run my files with it. The format issue is worth a fix though (I was traying on my own to solve my problem and the format of the first script output was giving me a hard time.
- I don't really know how the Bio.SeqIo.Parse works. I can try and have a look. I stopped at "it does what I wanted it to do"
- The header I am working with at the moment are like this:
>NODE_1_length_340169_cov_104.531
You'll notice that the length is actually mentionned in the header, but I still want to check the length, as I will also work with data differently formatted, such as this:
>C16BOV0002_c17
Basically, headers can be very different from one project to another. The only rule is it starts with this ">" and goes to a new line before writing the sequence.
- Not sure if you mean at the end of the 'length' script, or at the end of my series of scripts. In the end, I want a new multi-fasta with only the large sequences (size of more than 1000 basepairs), and a summary file stating how many contigs are in the original fasta and how many contigs are in the filtered fasta. I might try to execute the script for a list of files, and in this case, only one summary file for the whole list of files will be enough instead of one per file, but I am not there in my journey yet. Was that what you were asking?
- Here is an example of data, with only two fasta in the multi-fasta, one over the limit and one under the limit (not sure how is the best way for me to transfer them to you, so I put them in a spoiler:
And there is something else I still need to add in my second script (or in the new script if we end up re-write one), which is sorting the multi-fasta file in decreasing size order. The files I am working with at the moment are already ordered, so my second script works, but I am aware that the way I designed my script only works for an ordered multi-fasta file.
I hope I gave enough information!
Thanks again for answering me!