Could anyone help me get the jaccard distance between my dataframes please? :)

a_real_phoenix · (This post was last modified: Jun-27-2019, 06:01 PM by a_real_phoenix.)

Hi there, I'm doing a piece of homework and I'm almost done, but the last part has me pretty stumped. I have four files to work with, and I have to get n top values from each file, n being a user defined number, and then calculate the jaccard distance between each pair of files. So far I've done everything except the jaccard distance.

The jaccard distance is defined as (the number of unique values across the pair of files - the number of values that occur in both files) / the number of unique values across the pair of files again.

So, for example if I had ten values between the two files, but two of them occur in both files, it would be (8-2)/8.

I have made four separate dataframes out of the four files, each one only showing n top values. This is where I don't really know how to proceed. I have tried merging the dataframes into one big dataframe with only the relevant columns (the K-mers A-D columns) but that didn't work out for me. I don't know how to retrieve the number of unique or reoccurring values, especially when the dataframes are separated, so I'm quite stuck. I'd really appreciate any help :)

Here's my code:

import pandas as pd
import numpy as np

requested_path_A = input('Enter file path for species A:')
requested_path_B = input('Enter file path for species B:')
requested_path_C = input('Enter file path for species C:')
requested_path_D = input('Enter file path for species D:')

Outputs:

Enter file path for species A:H:\Bioinformatics_Resit\Species_A.txt
Enter file path for species B:H:\Bioinformatics_Resit\Species_B.txt
Enter file path for species C:H:\Bioinformatics_Resit\Species_C.txt
Enter file path for species D:H:\Bioinformatics_Resit\Species_D.txt

n = int(input('Enter your chosen value for n:'))

Outputs:

 Enter your chosen value for n:7

df1 = pd.read_csv(requested_path_A, sep='\s+', names=["K-mers A", "Count A",])
df2 = pd.read_csv(requested_path_B, sep='\s+', names=["K-mers B", "Count B",])
df3 = pd.read_csv(requested_path_C, sep='\s+', names=["K-mers C", "Count C",])
df4 = pd.read_csv(requested_path_D, sep='\s+', names=["K-mers D", "Count D",])

df1=df1.nlargest(n, ['Count A'])
df2=df2.nlargest(n, ['Count B'])
df3=df3.nlargest(n, ['Count C'])
df4=df4.nlargest(n, ['Count D'])

Edit: I forgot how forums work and didn't realise my comment on a previous thread would bump it up :/

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Merging two DataFrames based on indexes from two other DataFrames	lucinda_rigeitti	0	1,746	Jan-16-2020, 08:36 PM Last Post: lucinda_rigeitti

Could anyone help me get the jaccard distance between my dataframes please? :)

User Panel Messages

Announcements