Python Forum
Bioinformatics homework
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Bioinformatics homework
#1
I have to write a python program that given a large 50 MB DNA sequence and a smaller one, of around 15 characters, returns a list of all sequences of 15 characters ordered by how close they are to the one given as well as where they are in the larger one.


My current approach is to first get all the subsequences:

def get_subsequences_of_size(size, data):
    sequences = {}
    i = 0
    while(i+size <= len(data)):
        sequence = data[i:i+size]
        if sequence not in sequences:
            sequences[sequence] = data.count(sequence)
        i += 1
    return sequences
and then pack them in a list of dictionaries according to what the problem asked (I forgot to get the position):

def find_similar_sequences(seq, data):
    similar_sequences = {}
    sequences = get_subsequences_of_size(len(seq), data)
    for sequence in sequences.keys():
        diffs, muts = calculate_similarity(seq,sequence)
        if diffs not in similar_sequences:
            similar_sequences[diffs] = [{"Sequence": sequence, "Mutations": muts}]
        else:
            similar_sequences[diffs].append({"Sequence": sequence, "Mutations": muts})
        #similar_sequences[sequence] = {"Similarity": (len(sequence)-diffs), "Differences": diffs, "Mutatations": muts}
    return similar_sequences
The problem is that this code is VERY SLOW. What kind of approach should I take to speed it up?
Reply


Messages In This Thread
Bioinformatics homework - by licopenus - Sep-17-2017, 09:01 PM
RE: Bioinformatics homework - by ocpaul20 - Sep-19-2017, 07:28 AM
RE: Bioinformatics homework - by nilamo - Sep-27-2017, 09:44 PM
RE: Bioinformatics homework - by Larz60+ - Sep-27-2017, 11:39 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020