Python Forum
Help understanding Bioinformatics question?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help understanding Bioinformatics question?
#1
Hi there. I have some Bioinformatics homework to do which involves using Python, and honestly I'm having a hard time wrapping my head around the question. I was hoping someone here might be able to explain things to me :)

You have received a series of files (four files: A, B, C and D)listing the frequencies of occurrence of sequence k-mers for a number of bacterial species. The result for each species is contained in one file which has no header and has lines like this:

ACTCGTATAGTCGA 347

Where the first part is the kmer and the second part the count.

Using n most frequent k-mers (where n is a user defined number when the program is run), calculate the Jaccard distance between each pair of organisms and hence identify the two most similar organisms.

If A is the list of k-mer types (not the counts, this list will have n items) in species one and B the list in species 2 then Jaccard distance is defined as:

(All unique kmers – kmers in both A and B) / All unique kmers This should be a number between 0 and 1. The closer to 0, the more similar the organisms.

E.g. if A and B have the kmers shown below (couldn't figure out how to add a table to my post :/)then the total number of unique kmers is 8, and the Jaccard index is (8-2)/8 = 0.75



The program should prompt the user for a directory in which the data files are to be found. Each data file name is the species name. The program should also prompt for the value of n to be used.



Thanks a ton for any help understanding what's happening here, I really appreciate it <3
Reply


Messages In This Thread
Help understanding Bioinformatics question? - by a_real_phoenix - Jun-21-2019, 05:35 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020