Help understanding Bioinformatics question?

a_real_phoenix · (This post was last modified: Jun-21-2019, 05:35 PM by a_real_phoenix.)

Hi there. I have some Bioinformatics homework to do which involves using Python, and honestly I'm having a hard time wrapping my head around the question. I was hoping someone here might be able to explain things to me :)

You have received a series of files (four files: A, B, C and D)listing the frequencies of occurrence of sequence k-mers for a number of bacterial species. The result for each species is contained in one file which has no header and has lines like this:

ACTCGTATAGTCGA 347

Where the first part is the kmer and the second part the count.

Using n most frequent k-mers (where n is a user defined number when the program is run), calculate the Jaccard distance between each pair of organisms and hence identify the two most similar organisms.

If A is the list of k-mer types (not the counts, this list will have n items) in species one and B the list in species 2 then Jaccard distance is defined as:

(All unique kmers – kmers in both A and B) / All unique kmers This should be a number between 0 and 1. The closer to 0, the more similar the organisms.

E.g. if A and B have the kmers shown below (couldn't figure out how to add a table to my post :/)then the total number of unique kmers is 8, and the Jaccard index is (8-2)/8 = 0.75

The program should prompt the user for a directory in which the data files are to be found. Each data file name is the species name. The program should also prompt for the value of n to be used.

Thanks a ton for any help understanding what's happening here, I really appreciate it <3

Help understanding Bioinformatics question?

User Panel Messages

Announcements