Python Forum
Could anyone help me get the jaccard distance between my dataframes please? :)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Could anyone help me get the jaccard distance between my dataframes please? :)
#1
Hi there, I'm doing a piece of homework and I'm almost done, but the last part has me pretty stumped. I have four files to work with, and I have to get n top values from each file, n being a user defined number, and then calculate the jaccard distance between each pair of files. So far I've done everything except the jaccard distance.

The jaccard distance is defined as (the number of unique values across the pair of files - the number of values that occur in both files) / the number of unique values across the pair of files again.

So, for example if I had ten values between the two files, but two of them occur in both files, it would be (8-2)/8.

I have made four separate dataframes out of the four files, each one only showing n top values. This is where I don't really know how to proceed. I have tried merging the dataframes into one big dataframe with only the relevant columns (the K-mers A-D columns) but that didn't work out for me. I don't know how to retrieve the number of unique or reoccurring values, especially when the dataframes are separated, so I'm quite stuck. I'd really appreciate any help :)

Here's my code:

import pandas as pd
import numpy as np

requested_path_A = input('Enter file path for species A:')
requested_path_B = input('Enter file path for species B:')
requested_path_C = input('Enter file path for species C:')
requested_path_D = input('Enter file path for species D:')
Outputs:

Enter file path for species A:H:\Bioinformatics_Resit\Species_A.txt
Enter file path for species B:H:\Bioinformatics_Resit\Species_B.txt
Enter file path for species C:H:\Bioinformatics_Resit\Species_C.txt
Enter file path for species D:H:\Bioinformatics_Resit\Species_D.txt
n = int(input('Enter your chosen value for n:'))
Outputs:

 Enter your chosen value for n:7 
df1 = pd.read_csv(requested_path_A, sep='\s+', names=["K-mers A", "Count A",])
df2 = pd.read_csv(requested_path_B, sep='\s+', names=["K-mers B", "Count B",])
df3 = pd.read_csv(requested_path_C, sep='\s+', names=["K-mers C", "Count C",])
df4 = pd.read_csv(requested_path_D, sep='\s+', names=["K-mers D", "Count D",])

df1=df1.nlargest(n, ['Count A'])
df2=df2.nlargest(n, ['Count B'])
df3=df3.nlargest(n, ['Count C'])
df4=df4.nlargest(n, ['Count D'])
Edit: I forgot how forums work and didn't realise my comment on a previous thread would bump it up :/
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Merging two DataFrames based on indexes from two other DataFrames lucinda_rigeitti 0 1,726 Jan-16-2020, 08:36 PM
Last Post: lucinda_rigeitti

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020