Jun-28-2019, 04:15 AM
Ok, I help you with Jaccard distance. In general, Jaccard distance is 1 - Jaccard similarity, where Jaccard similarity is measure(intersection of two sets) / measure(union of two sets). So, we need somehow to define how to compute these measures. Measure definition may be different from case to case, it ever depends on specificity of the problem. Usually, the number of items (cardinal number) is used as a measure of a set (finite). Here, occasionally, we have multi-sets, i.e. each set consist of items which are presented in it multiple times, e.g. 'ACCT' fragment occurs 3 times, etc. It seems obvious to define the total number of items (including its repetitions), as a measure of the set. Additionally, we need to define intersection of two multi-sets. Let we have a multiset
I posted below almost working implementation of Jaccard distance of two multisets presented as Pandas dataframes:
A = {'X': 4, 'Y': 2, 'Z': 4}
('X' included in A 3 times, etc.) and B = {'X': 2, 'Y': 3, 'D': 7}
. What would be an intersection of these sets? Normally, it is a new multi-set: {'X': min(4, 2), 'Y': min(2, 3)}
.I posted below almost working implementation of Jaccard distance of two multisets presented as Pandas dataframes:
def get_jaccard(df1, df2, n=3): """Return Jaccard distance between two dfs of specified form Parameters ========== :param df1: Pandas data frame :param df2: Pandas data frame :param n: an integer, the number of most frequent ... to use Notes ===== df1, df2 assumed to have the following form: df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]}) df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC', 'AGT'], 1:[3, 4, 8, 8]}) Expected value for df1 and df2: get_jaccard(df1, df2, n=3) should return: 1 - 4 / (8 + 8 + 8 + 4) Explanation ----------- df2 3 most frequent features are ['CCTTGGA', 'ACC', 'AGT'] df1 3 most frequent features are ['AACCTTGG', 'CCTTGGA'] common features: [CCTTGGA] Jaccard = measure(intersection)/measure(union) Let "measure = the number of fragments" Ok, 'CCTTGGA' count in df2 = 4, 'CCTTGGA' count in df1 = 8, Measure of intersection: min(4, 8) = 4 If we had several common fragmens, we would computed their sum, e.g. min(a1, b1) + min(a2, b2) etc. Here we have the only one: 'CCTTGGA'; mes. of union: count of only df1 features + count of only df2 features + max(counts of common_features) # Note: we consider only 3 most frequent features! count of only df1 features: 4 count of only df2 features: 8 (ACC) + 8 (AGT) max(counts of common_features): max(4, 8) So, we got: Jaccard similarity = 4 / (8 + 8 + 8 + 4) and, finally: Jaccard dist. = 1 - Jaccard similarity Pandas assumed to be imported as pd. """ d1 = df1.sort_values(by=1, ascending=False)[:n] d2 = df2.sort_values(by=1, ascending=False)[:n] common = pd.np.intersect1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values) d1_only_features = set(d1.iloc[:, 0].values) - set(common) d2_only_features = set(d2.iloc[:, 0].values) - set(common) a = d1.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values b = d2.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values comm_measure = pd.np.vstack([a,b]).min(axis=0).sum() comm_measure_max = pd.np.vstack([a,b]).max(axis=0).sum() d2_only_measure = # fill these lines d1_only_measure = # fill these lines total = d1_only_measure + d1_only_measure + comm_measure_max return (1 - comm_measure / total)However, this forum is an educational resource, so you need to complete the code by yourself.