Help understanding Bioinformatics question?

**scidam** · Jun-28-2019, 04:15 AM

Ok, I help you with Jaccard distance. In general, Jaccard distance is 1 - Jaccard similarity, where Jaccard similarity is measure(intersection of two sets) / measure(union of two sets). So, we need somehow to define how to compute these measures. Measure definition may be different from case to case, it ever depends on specificity of the problem. Usually, the number of items (cardinal number) is used as a measure of a set (finite). Here, occasionally, we have multi-sets, i.e. each set consist of items which are presented in it multiple times, e.g. 'ACCT' fragment occurs 3 times, etc. It seems obvious to define the total number of items (including its repetitions), as a measure of the set. Additionally, we need to define intersection of two multi-sets. Let we have a multiset A = {'X': 4, 'Y': 2, 'Z': 4} ('X' included in A 3 times, etc.) and B = {'X': 2, 'Y': 3, 'D': 7}. What would be an intersection of these sets? Normally, it is a new multi-set: {'X': min(4, 2), 'Y': min(2, 3)}.

I posted below almost working implementation of Jaccard distance of two multisets presented as Pandas dataframes:

def get_jaccard(df1, df2, n=3):
    """Return Jaccard distance between two dfs of specified form
    
    Parameters
    ==========
        
        :param df1: Pandas data frame
        :param df2: Pandas data frame
        :param n: an integer, the number of most frequent ... to use
    
    
    Notes
    =====
    
    df1, df2 assumed to have the following form:
    
    df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]})
    df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC', 'AGT'], 1:[3, 4, 8, 8]})
    
    Expected value for df1 and df2:
        get_jaccard(df1, df2, n=3) should return:
        1 - 4 / (8 + 8 + 8 + 4)
    
    Explanation
    -----------
       df2 3 most frequent features are ['CCTTGGA', 'ACC', 'AGT']
       df1 3 most frequent features are ['AACCTTGG', 'CCTTGGA']
       common features: [CCTTGGA]
       
       Jaccard = measure(intersection)/measure(union)
       
       Let "measure = the number of fragments"
       Ok, 'CCTTGGA' count in df2 = 4, 
       'CCTTGGA' count in df1 = 8,
       
       Measure of intersection: min(4, 8) = 4
       If we had several common fragmens, we would
       computed their sum, e.g. min(a1, b1) + min(a2, b2) etc.
       Here we have the only one: 'CCTTGGA';
       
       mes. of union: count of only df1 features + count of only df2 features +
                      max(counts of common_features)
       
       # Note: we consider only 3 most frequent features!
       count of only df1 features: 4
       count of only df2 features: 8 (ACC) + 8 (AGT)
       max(counts of common_features): max(4, 8)
       
       So, we got:
         Jaccard similarity = 4 / (8 + 8 + 8 + 4)
         
         and, finally:
         
         Jaccard dist. = 1 - Jaccard similarity

    Pandas assumed to be imported as pd.
    """
    
    d1 = df1.sort_values(by=1, ascending=False)[:n]
    d2 = df2.sort_values(by=1, ascending=False)[:n]
    common = pd.np.intersect1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values)
    d1_only_features = set(d1.iloc[:, 0].values) - set(common)
    d2_only_features = set(d2.iloc[:, 0].values) - set(common)
    a = d1.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    b = d2.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    comm_measure = pd.np.vstack([a,b]).min(axis=0).sum()
    comm_measure_max = pd.np.vstack([a,b]).max(axis=0).sum()
    d2_only_measure = # fill these lines
    d1_only_measure = # fill these lines
    total =  d1_only_measure + d1_only_measure + comm_measure_max
    return  (1 - comm_measure / total)

However, this forum is an educational resource, so you need to complete the code by yourself.

Help understanding Bioinformatics question?

User Panel Messages

Announcements