Bottom Page

• 0 Vote(s) - 0 Average
• 1
• 2
• 3
• 4
• 5
 Help understanding Bioinformatics question? a_real_phoenix Unladen Swallow Posts: 4 Threads: 3 Joined: Jun 2019 Reputation: 0 Likes received: 0 #1 Jun-21-2019, 05:35 PM (This post was last modified: Jun-21-2019, 05:35 PM by a_real_phoenix. Edited 2 times in total.) Hi there. I have some Bioinformatics homework to do which involves using Python, and honestly I'm having a hard time wrapping my head around the question. I was hoping someone here might be able to explain things to me :) You have received a series of files (four files: A, B, C and D)listing the frequencies of occurrence of sequence k-mers for a number of bacterial species. The result for each species is contained in one file which has no header and has lines like this: ACTCGTATAGTCGA 347 Where the first part is the kmer and the second part the count. Using n most frequent k-mers (where n is a user defined number when the program is run), calculate the Jaccard distance between each pair of organisms and hence identify the two most similar organisms. If A is the list of k-mer types (not the counts, this list will have n items) in species one and B the list in species 2 then Jaccard distance is defined as: (All unique kmers – kmers in both A and B) / All unique kmers This should be a number between 0 and 1. The closer to 0, the more similar the organisms. E.g. if A and B have the kmers shown below (couldn't figure out how to add a table to my post :/)then the total number of unique kmers is 8, and the Jaccard index is (8-2)/8 = 0.75 The program should prompt the user for a directory in which the data files are to be found. Each data file name is the species name. The program should also prompt for the value of n to be used. Thanks a ton for any help understanding what's happening here, I really appreciate it <3 micseydel Involuntary Spiderweb Collector Posts: 2,261 Threads: 57 Joined: Sep 2016 Reputation: 67 Likes received: 689 #2 Jun-22-2019, 04:34 AM First off, this isn't a Python question, so you might have a hard time getting attention. That aside, do you have any specific questions? I'm not sure what you would expect from us other than to re-word things semi-randomly. Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more. Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :) scidam  Posts: 737 Threads: 1 Joined: Mar 2018 Reputation: 104 Likes received: 110 #3 Jun-22-2019, 10:29 AM You need to divide your problem into a set of small ones. Underlying math isn't quite hard: Jaccard coefficient (similarity) is a fraction of the measure of an intersection of two sets and the measure of a union of them. Jaccard distance seems to be (1 - Jaccard coefficient). So, you need to implement a function that traverse the specified directory and returns data loaded from two files. This function could be implemented as a generator. This generator will yield a new pair of data until all possible combinations be traversed (s(s-1)/2, where s is the number of files in the folder). Below is a sketch of the solution; completely not tested but might be helpful... ```def traverse_dir(path='.'): # some code goes here, probably you'll need to use os.path.walk yield df1, df2, filenames # df1, df2 assumed to be pandas dataframes; each dataframe has two columns def get_jaccard(df1, df2, n=3): """Return Jaccard distance between two dfs of specified form Parameters ========== :param df1: Pandas data frame :param df2: Pandas data frame :param n: an integer, the number of most frequent ... to use Notes ===== df1, df2 assumed to have the following form: df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]}) df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC'], 1:[3, 4, 8]}) Pandas assumed to be imported as pd. """ d1 = df1.sort_values(by=1, ascending=False)[:n] d2 = df1.sort_values(by=1, ascending=False)[:n] common = pd.np.intersection1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values) a = d1[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values b = d2[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values comm_measure = pd.np.vstack([a,b]).min(axis=0).sum() all_unique = ... # write something here... return (1 - comm_measure / all_unique) requested_path = input('Enter path:') n = input('Enter n:') # and something like this... for a, b, filenames in traverse_dir(requested_path): print("Processing files: {}, result={} ".format(filenames, get_jaccard(a, b, n=n))) ``` snippsat likes this post a_real_phoenix Unladen Swallow Posts: 4 Threads: 3 Joined: Jun 2019 Reputation: 0 Likes received: 0 #4 Jun-27-2019, 05:41 PM (This post was last modified: Jun-28-2019, 12:10 AM by scidam. Edited 3 times in total. Edit Reason: BBCode added ) (Jun-22-2019, 10:29 AM)scidam Wrote: You need to divide your problem into a set of small ones. Underlying math isn't quite hard: Jaccard coefficient (similarity) is a fraction of the measure of an intersection of two sets and the measure of a union of them. Jaccard distance seems to be (1 - Jaccard coefficient). So, you need to implement a function that traverse the specified directory and returns data loaded from two files. This function could be implemented as a generator. This generator will yield a new pair of data until all possible combinations be traversed (s(s-1)/2, where s is the number of files in the folder). Below is a sketch of the solution; completely not tested but might be helpful... ```def traverse_dir(path='.'): # some code goes here, probably you'll need to use os.path.walk yield df1, df2, filenames # df1, df2 assumed to be pandas dataframes; each dataframe has two columns def get_jaccard(df1, df2, n=3): """Return Jaccard distance between two dfs of specified form Parameters ========== :param df1: Pandas data frame :param df2: Pandas data frame :param n: an integer, the number of most frequent ... to use Notes ===== df1, df2 assumed to have the following form: df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]}) df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC'], 1:[3, 4, 8]}) Pandas assumed to be imported as pd. """ d1 = df1.sort_values(by=1, ascending=False)[:n] d2 = df1.sort_values(by=1, ascending=False)[:n] common = pd.np.intersection1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values) a = d1[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values b = d2[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values comm_measure = pd.np.vstack([a,b]).min(axis=0).sum() all_unique = ... # write something here... return (1 - comm_measure / all_unique) requested_path = input('Enter path:') n = input('Enter n:') # and something like this... for a, b, filenames in traverse_dir(requested_path): print("Processing files: {}, result={} ".format(filenames, get_jaccard(a, b, n=n))) ``` Hi there, thanks a ton for the amazing reply and sorry for the late response! I've gone and done a lot of my own code, and yours has been a useful reference point for me, although I won't deny some of it is lost on me xD I've done most of the smaller tasks now, but I'm struggling to get this jaccard difference. I was hoping you might have a look at what I've done and maybe guide me? :) Here's my code: ```import pandas as pd import numpy as np requested_path_A = input('Enter file path for species A:') requested_path_B = input('Enter file path for species B:') requested_path_C = input('Enter file path for species C:') requested_path_D = input('Enter file path for species D:') ```This outputs: ``````Output: Enter file path for species A:H:\Bioinformatics_Resit\Species_A.txt Enter file path for species B:H:\Bioinformatics_Resit\Species_B.txt Enter file path for species C:H:\Bioinformatics_Resit\Species_C.txt Enter file path for species D:H:\Bioinformatics_Resit\Species_D.txt ```````n = int(input('Enter your chosen value for n:'))`Which outputs: ``````Output: Enter your chosen value for n:7`````````df1 = pd.read_csv(requested_path_A, sep='\s+', names=["K-mers A", "Count A",]) df2 = pd.read_csv(requested_path_B, sep='\s+', names=["K-mers B", "Count B",]) df3 = pd.read_csv(requested_path_C, sep='\s+', names=["K-mers C", "Count C",]) df4 = pd.read_csv(requested_path_D, sep='\s+', names=["K-mers D", "Count D",]) df1=df1.nlargest(n, ['Count A']) df2=df2.nlargest(n, ['Count B']) df3=df3.nlargest(n, ['Count C']) df4=df4.nlargest(n, ['Count D']) ``` scidam wrote Jun-28-2019, 12:10 AM:Please post all code, output and errors (in it's entirety) between their respective tags. I did it for you this time, Here are instructions on how to do it yourself next time. scidam  Posts: 737 Threads: 1 Joined: Mar 2018 Reputation: 104 Likes received: 110 #5 Jun-28-2019, 04:15 AM Ok, I help you with Jaccard distance. In general, Jaccard distance is 1 - Jaccard similarity, where Jaccard similarity is measure(intersection of two sets) / measure(union of two sets). So, we need somehow to define how to compute these measures. Measure definition may be different from case to case, it ever depends on specificity of the problem. Usually, the number of items (cardinal number) is used as a measure of a set (finite). Here, occasionally, we have multi-sets, i.e. each set consist of items which are presented in it multiple times, e.g. 'ACCT' fragment occurs 3 times, etc. It seems obvious to define the total number of items (including its repetitions), as a measure of the set. Additionally, we need to define intersection of two multi-sets. Let we have a multiset `A = {'X': 4, 'Y': 2, 'Z': 4}` ('X' included in A 3 times, etc.) and `B = {'X': 2, 'Y': 3, 'D': 7}`. What would be an intersection of these sets? Normally, it is a new multi-set: `{'X': min(4, 2), 'Y': min(2, 3)}`. I posted below almost working implementation of Jaccard distance of two multisets presented as Pandas dataframes: ```def get_jaccard(df1, df2, n=3): """Return Jaccard distance between two dfs of specified form Parameters ========== :param df1: Pandas data frame :param df2: Pandas data frame :param n: an integer, the number of most frequent ... to use Notes ===== df1, df2 assumed to have the following form: df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]}) df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC', 'AGT'], 1:[3, 4, 8, 8]}) Expected value for df1 and df2: get_jaccard(df1, df2, n=3) should return: 1 - 4 / (8 + 8 + 8 + 4) Explanation ----------- df2 3 most frequent features are ['CCTTGGA', 'ACC', 'AGT'] df1 3 most frequent features are ['AACCTTGG', 'CCTTGGA'] common features: [CCTTGGA] Jaccard = measure(intersection)/measure(union) Let "measure = the number of fragments" Ok, 'CCTTGGA' count in df2 = 4, 'CCTTGGA' count in df1 = 8, Measure of intersection: min(4, 8) = 4 If we had several common fragmens, we would computed their sum, e.g. min(a1, b1) + min(a2, b2) etc. Here we have the only one: 'CCTTGGA'; mes. of union: count of only df1 features + count of only df2 features + max(counts of common_features) # Note: we consider only 3 most frequent features! count of only df1 features: 4 count of only df2 features: 8 (ACC) + 8 (AGT) max(counts of common_features): max(4, 8) So, we got: Jaccard similarity = 4 / (8 + 8 + 8 + 4) and, finally: Jaccard dist. = 1 - Jaccard similarity Pandas assumed to be imported as pd. """ d1 = df1.sort_values(by=1, ascending=False)[:n] d2 = df2.sort_values(by=1, ascending=False)[:n] common = pd.np.intersect1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values) d1_only_features = set(d1.iloc[:, 0].values) - set(common) d2_only_features = set(d2.iloc[:, 0].values) - set(common) a = d1.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values b = d2.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values comm_measure = pd.np.vstack([a,b]).min(axis=0).sum() comm_measure_max = pd.np.vstack([a,b]).max(axis=0).sum() d2_only_measure = # fill these lines d1_only_measure = # fill these lines total = d1_only_measure + d1_only_measure + comm_measure_max return (1 - comm_measure / total)```However, this forum is an educational resource, so you need to complete the code by yourself. « Next Oldest | Next Newest »

Top Page

Forum Jump:

Users browsing this thread: 1 Guest(s)