Help understanding Bioinformatics question?

a_real_phoenix · (This post was last modified: Jun-21-2019, 05:35 PM by a_real_phoenix.)

Hi there. I have some Bioinformatics homework to do which involves using Python, and honestly I'm having a hard time wrapping my head around the question. I was hoping someone here might be able to explain things to me :)

You have received a series of files (four files: A, B, C and D)listing the frequencies of occurrence of sequence k-mers for a number of bacterial species. The result for each species is contained in one file which has no header and has lines like this:

ACTCGTATAGTCGA 347

Where the first part is the kmer and the second part the count.

Using n most frequent k-mers (where n is a user defined number when the program is run), calculate the Jaccard distance between each pair of organisms and hence identify the two most similar organisms.

If A is the list of k-mer types (not the counts, this list will have n items) in species one and B the list in species 2 then Jaccard distance is defined as:

(All unique kmers – kmers in both A and B) / All unique kmers This should be a number between 0 and 1. The closer to 0, the more similar the organisms.

E.g. if A and B have the kmers shown below (couldn't figure out how to add a table to my post :/)then the total number of unique kmers is 8, and the Jaccard index is (8-2)/8 = 0.75

The program should prompt the user for a directory in which the data files are to be found. Each data file name is the species name. The program should also prompt for the value of n to be used.

Thanks a ton for any help understanding what's happening here, I really appreciate it <3

***micseydel*** · Jun-22-2019, 04:34 AM

First off, this isn't a Python question, so you might have a hard time getting attention. That aside, do you have any specific questions? I'm not sure what you would expect from us other than to re-word things semi-randomly.

**scidam** · Jun-22-2019, 10:29 AM

You need to divide your problem into a set of small ones. Underlying math isn't quite hard: Jaccard coefficient (similarity) is a fraction of the measure of an intersection of two sets and the measure of a union of them. Jaccard distance seems to be (1 - Jaccard coefficient).

So, you need to implement a function that traverse the specified directory and returns
data loaded from two files. This function could be implemented as a generator.
This generator will yield a new pair of data until all possible combinations
be traversed (s(s-1)/2, where s is the number of files in the folder).

Below is a sketch of the solution; completely not tested but might be helpful...

def traverse_dir(path='.'):
    # some code goes here, probably you'll need to use os.path.walk
    yield df1, df2, filenames  # df1, df2 assumed to be pandas dataframes; each dataframe has two columns
    
   
    
def get_jaccard(df1, df2, n=3):
    """Return Jaccard distance between two dfs of specified form
    
    Parameters
    ==========
        
        :param df1: Pandas data frame
        :param df2: Pandas data frame
        :param n: an integer, the number of most frequent ... to use
    
    
    Notes
    =====
    
    df1, df2 assumed to have the following form:
    
    df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]})
    df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC'], 1:[3, 4, 8]})
    
    Pandas assumed to be imported as pd.
    """
    
    d1 = df1.sort_values(by=1, ascending=False)[:n]
    d2 = df1.sort_values(by=1, ascending=False)[:n]
    common = pd.np.intersection1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values)
    a = d1[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    b = d2[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    comm_measure = pd.np.vstack([a,b]).min(axis=0).sum()
    all_unique = ... # write something here... 
    
    return  (1 - comm_measure / all_unique)


requested_path = input('Enter path:')
n = input('Enter n:')

# and something like this... 
for a, b, filenames in traverse_dir(requested_path):
    print("Processing files: {}, result={} ".format(filenames, get_jaccard(a, b, n=n)))

a_real_phoenix · (This post was last modified: Jun-28-2019, 12:10 AM by scidam.)

(Jun-22-2019, 10:29 AM)scidam Wrote: You need to divide your problem into a set of small ones. Underlying math isn't quite hard: Jaccard coefficient (similarity) is a fraction of the measure of an intersection of two sets and the measure of a union of them. Jaccard distance seems to be (1 - Jaccard coefficient).

So, you need to implement a function that traverse the specified directory and returns
data loaded from two files. This function could be implemented as a generator.
This generator will yield a new pair of data until all possible combinations
be traversed (s(s-1)/2, where s is the number of files in the folder).

Below is a sketch of the solution; completely not tested but might be helpful...
def traverse_dir(path='.'):
    # some code goes here, probably you'll need to use os.path.walk
    yield df1, df2, filenames  # df1, df2 assumed to be pandas dataframes; each dataframe has two columns
    
   
    
def get_jaccard(df1, df2, n=3):
    """Return Jaccard distance between two dfs of specified form
    
    Parameters
    ==========
        
        :param df1: Pandas data frame
        :param df2: Pandas data frame
        :param n: an integer, the number of most frequent ... to use
    
    
    Notes
    =====
    
    df1, df2 assumed to have the following form:
    
    df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]})
    df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC'], 1:[3, 4, 8]})
    
    Pandas assumed to be imported as pd.
    """
    
    d1 = df1.sort_values(by=1, ascending=False)[:n]
    d2 = df1.sort_values(by=1, ascending=False)[:n]
    common = pd.np.intersection1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values)
    a = d1[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    b = d2[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    comm_measure = pd.np.vstack([a,b]).min(axis=0).sum()
    all_unique = ... # write something here... 
    
    return  (1 - comm_measure / all_unique)


requested_path = input('Enter path:')
n = input('Enter n:')

# and something like this... 
for a, b, filenames in traverse_dir(requested_path):
    print("Processing files: {}, result={} ".format(filenames, get_jaccard(a, b, n=n)))

Hi there, thanks a ton for the amazing reply and sorry for the late response! I've gone and done a lot of my own code, and yours has been a useful reference point for me, although I won't deny some of it is lost on me xD

I've done most of the smaller tasks now, but I'm struggling to get this jaccard difference. I was hoping you might have a look at what I've done and maybe guide me? :)

Here's my code:

import pandas as pd
import numpy as np

requested_path_A = input('Enter file path for species A:')
requested_path_B = input('Enter file path for species B:')
requested_path_C = input('Enter file path for species C:')
requested_path_D = input('Enter file path for species D:')

This outputs:

Output:Enter file path for species A:H:\Bioinformatics_Resit\Species_A.txt
Enter file path for species B:H:\Bioinformatics_Resit\Species_B.txt
Enter file path for species C:H:\Bioinformatics_Resit\Species_C.txt
Enter file path for species D:H:\Bioinformatics_Resit\Species_D.txt

n = int(input('Enter your chosen value for n:'))

Which outputs:

Output:
Enter your chosen value for n:7

df1 = pd.read_csv(requested_path_A, sep='\s+', names=["K-mers A", "Count A",])
df2 = pd.read_csv(requested_path_B, sep='\s+', names=["K-mers B", "Count B",])
df3 = pd.read_csv(requested_path_C, sep='\s+', names=["K-mers C", "Count C",])
df4 = pd.read_csv(requested_path_D, sep='\s+', names=["K-mers D", "Count D",])

df1=df1.nlargest(n, ['Count A'])
df2=df2.nlargest(n, ['Count B'])
df3=df3.nlargest(n, ['Count C'])
df4=df4.nlargest(n, ['Count D'])

**scidam** · Jun-28-2019, 04:15 AM

Ok, I help you with Jaccard distance. In general, Jaccard distance is 1 - Jaccard similarity, where Jaccard similarity is measure(intersection of two sets) / measure(union of two sets). So, we need somehow to define how to compute these measures. Measure definition may be different from case to case, it ever depends on specificity of the problem. Usually, the number of items (cardinal number) is used as a measure of a set (finite). Here, occasionally, we have multi-sets, i.e. each set consist of items which are presented in it multiple times, e.g. 'ACCT' fragment occurs 3 times, etc. It seems obvious to define the total number of items (including its repetitions), as a measure of the set. Additionally, we need to define intersection of two multi-sets. Let we have a multiset A = {'X': 4, 'Y': 2, 'Z': 4} ('X' included in A 3 times, etc.) and B = {'X': 2, 'Y': 3, 'D': 7}. What would be an intersection of these sets? Normally, it is a new multi-set: {'X': min(4, 2), 'Y': min(2, 3)}.

I posted below almost working implementation of Jaccard distance of two multisets presented as Pandas dataframes:

def get_jaccard(df1, df2, n=3):
    """Return Jaccard distance between two dfs of specified form
    
    Parameters
    ==========
        
        :param df1: Pandas data frame
        :param df2: Pandas data frame
        :param n: an integer, the number of most frequent ... to use
    
    
    Notes
    =====
    
    df1, df2 assumed to have the following form:
    
    df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]})
    df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC', 'AGT'], 1:[3, 4, 8, 8]})
    
    Expected value for df1 and df2:
        get_jaccard(df1, df2, n=3) should return:
        1 - 4 / (8 + 8 + 8 + 4)
    
    Explanation
    -----------
       df2 3 most frequent features are ['CCTTGGA', 'ACC', 'AGT']
       df1 3 most frequent features are ['AACCTTGG', 'CCTTGGA']
       common features: [CCTTGGA]
       
       Jaccard = measure(intersection)/measure(union)
       
       Let "measure = the number of fragments"
       Ok, 'CCTTGGA' count in df2 = 4, 
       'CCTTGGA' count in df1 = 8,
       
       Measure of intersection: min(4, 8) = 4
       If we had several common fragmens, we would
       computed their sum, e.g. min(a1, b1) + min(a2, b2) etc.
       Here we have the only one: 'CCTTGGA';
       
       mes. of union: count of only df1 features + count of only df2 features +
                      max(counts of common_features)
       
       # Note: we consider only 3 most frequent features!
       count of only df1 features: 4
       count of only df2 features: 8 (ACC) + 8 (AGT)
       max(counts of common_features): max(4, 8)
       
       So, we got:
         Jaccard similarity = 4 / (8 + 8 + 8 + 4)
         
         and, finally:
         
         Jaccard dist. = 1 - Jaccard similarity

    Pandas assumed to be imported as pd.
    """
    
    d1 = df1.sort_values(by=1, ascending=False)[:n]
    d2 = df2.sort_values(by=1, ascending=False)[:n]
    common = pd.np.intersect1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values)
    d1_only_features = set(d1.iloc[:, 0].values) - set(common)
    d2_only_features = set(d2.iloc[:, 0].values) - set(common)
    a = d1.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    b = d2.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    comm_measure = pd.np.vstack([a,b]).min(axis=0).sum()
    comm_measure_max = pd.np.vstack([a,b]).max(axis=0).sum()
    d2_only_measure = # fill these lines
    d1_only_measure = # fill these lines
    total =  d1_only_measure + d1_only_measure + comm_measure_max
    return  (1 - comm_measure / total)

However, this forum is an educational resource, so you need to complete the code by yourself.

Help understanding Bioinformatics question?

User Panel Messages

Announcements