Jun-21-2019, 05:35 PM
(This post was last modified: Jun-21-2019, 05:35 PM by a_real_phoenix.)
Hi there. I have some Bioinformatics homework to do which involves using Python, and honestly I'm having a hard time wrapping my head around the question. I was hoping someone here might be able to explain things to me :)
You have received a series of files (four files: A, B, C and D)listing the frequencies of occurrence of sequence k-mers for a number of bacterial species. The result for each species is contained in one file which has no header and has lines like this:
ACTCGTATAGTCGA 347
Where the first part is the kmer and the second part the count.
Using n most frequent k-mers (where n is a user defined number when the program is run), calculate the Jaccard distance between each pair of organisms and hence identify the two most similar organisms.
If A is the list of k-mer types (not the counts, this list will have n items) in species one and B the list in species 2 then Jaccard distance is defined as:
(All unique kmers – kmers in both A and B) / All unique kmers This should be a number between 0 and 1. The closer to 0, the more similar the organisms.
E.g. if A and B have the kmers shown below (couldn't figure out how to add a table to my post :/)then the total number of unique kmers is 8, and the Jaccard index is (8-2)/8 = 0.75
The program should prompt the user for a directory in which the data files are to be found. Each data file name is the species name. The program should also prompt for the value of n to be used.
Thanks a ton for any help understanding what's happening here, I really appreciate it <3
You have received a series of files (four files: A, B, C and D)listing the frequencies of occurrence of sequence k-mers for a number of bacterial species. The result for each species is contained in one file which has no header and has lines like this:
ACTCGTATAGTCGA 347
Where the first part is the kmer and the second part the count.
Using n most frequent k-mers (where n is a user defined number when the program is run), calculate the Jaccard distance between each pair of organisms and hence identify the two most similar organisms.
If A is the list of k-mer types (not the counts, this list will have n items) in species one and B the list in species 2 then Jaccard distance is defined as:
(All unique kmers – kmers in both A and B) / All unique kmers This should be a number between 0 and 1. The closer to 0, the more similar the organisms.
E.g. if A and B have the kmers shown below (couldn't figure out how to add a table to my post :/)then the total number of unique kmers is 8, and the Jaccard index is (8-2)/8 = 0.75
The program should prompt the user for a directory in which the data files are to be found. Each data file name is the species name. The program should also prompt for the value of n to be used.
Thanks a ton for any help understanding what's happening here, I really appreciate it <3