Bioinformatics homework

licopenus · Sep-17-2017, 09:01 PM

I have to write a python program that given a large 50 MB DNA sequence and a smaller one, of around 15 characters, returns a list of all sequences of 15 characters ordered by how close they are to the one given as well as where they are in the larger one.

My current approach is to first get all the subsequences:

def get_subsequences_of_size(size, data):
    sequences = {}
    i = 0
    while(i+size <= len(data)):
        sequence = data[i:i+size]
        if sequence not in sequences:
            sequences[sequence] = data.count(sequence)
        i += 1
    return sequences

and then pack them in a list of dictionaries according to what the problem asked (I forgot to get the position):

def find_similar_sequences(seq, data):
    similar_sequences = {}
    sequences = get_subsequences_of_size(len(seq), data)
    for sequence in sequences.keys():
        diffs, muts = calculate_similarity(seq,sequence)
        if diffs not in similar_sequences:
            similar_sequences[diffs] = [{"Sequence": sequence, "Mutations": muts}]
        else:
            similar_sequences[diffs].append({"Sequence": sequence, "Mutations": muts})
        #similar_sequences[sequence] = {"Similarity": (len(sequence)-diffs), "Differences": diffs, "Mutatations": muts}
    return similar_sequences

The problem is that this code is VERY SLOW. What kind of approach should I take to speed it up?

Bioinformatics homework

User Panel Messages

Announcements