Python Forum
Bioinformatics homework
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Bioinformatics homework
#1
I have to write a python program that given a large 50 MB DNA sequence and a smaller one, of around 15 characters, returns a list of all sequences of 15 characters ordered by how close they are to the one given as well as where they are in the larger one.


My current approach is to first get all the subsequences:

def get_subsequences_of_size(size, data):
    sequences = {}
    i = 0
    while(i+size <= len(data)):
        sequence = data[i:i+size]
        if sequence not in sequences:
            sequences[sequence] = data.count(sequence)
        i += 1
    return sequences
and then pack them in a list of dictionaries according to what the problem asked (I forgot to get the position):

def find_similar_sequences(seq, data):
    similar_sequences = {}
    sequences = get_subsequences_of_size(len(seq), data)
    for sequence in sequences.keys():
        diffs, muts = calculate_similarity(seq,sequence)
        if diffs not in similar_sequences:
            similar_sequences[diffs] = [{"Sequence": sequence, "Mutations": muts}]
        else:
            similar_sequences[diffs].append({"Sequence": sequence, "Mutations": muts})
        #similar_sequences[sequence] = {"Similarity": (len(sequence)-diffs), "Differences": diffs, "Mutatations": muts}
    return similar_sequences
The problem is that this code is VERY SLOW. What kind of approach should I take to speed it up?
Reply
#2
Dont you need to explain a little more about what 'close' means? As far as I understand, DNA consists of CGTA or some 4 letters like that. How can the given 15 character string be 'close' to a 15 character sequence in the large DNA file? Maybe some examples (link to data?) would be a good idea to show us what the result should be.

I would think that any approach will need to compare the given 15 characters with each sliding window of 15 characters in the large file, but that is just an uneducated opinion as I am not a maths guy. I think this kind of thing is going to be fairly slow because you have to test each 15-character window whilst moving the window along by one character each time. I am sure someone else probably has a better idea than me how to do this.
Reply
#3
If speed is the issue, then you likely have a lot of data, as there's nothing that's obviously poorly written. 

Do you have some sample data?  If you could share a single file that's runnable, so we can play around with it, that'd be helpful.

That said, you could use a for loop instead of a while loop, but that shouldn't impact performance that much.  And you checking if each sequence has already been seen shouldn't be an issue, since it's a dict and key lookup is fast.  The issue is likely in part of the code you didn't share.

All that said, get_subsequences_of_size() probably just returns a dict with 0 for all the values, as list.count doesn't really do anything for slices...
>>> items = list(range(20))
>>> items
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> look_for = items[4:6]
>>> look_for
[4, 5]
>>> items.count(look_for)
0
Reply
#4
You might find something useful here: https://www.biostars.org/p/16581/
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020