Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Could anyone help me get the jaccard distance between my dataframes please? :)
Hi there, I'm doing a piece of homework and I'm almost done, but the last part has me pretty stumped. I have four files to work with, and I have to get n top values from each file, n being a user defined number, and then calculate the jaccard distance between each pair of files. So far I've done everything except the jaccard distance.

The jaccard distance is defined as (the number of unique values across the pair of files - the number of values that occur in both files) / the number of unique values across the pair of files again.

So, for example if I had ten values between the two files, but two of them occur in both files, it would be (8-2)/8.

I have made four separate dataframes out of the four files, each one only showing n top values. This is where I don't really know how to proceed. I have tried merging the dataframes into one big dataframe with only the relevant columns (the K-mers A-D columns) but that didn't work out for me. I don't know how to retrieve the number of unique or reoccurring values, especially when the dataframes are separated, so I'm quite stuck. I'd really appreciate any help :)

Here's my code:

import pandas as pd
import numpy as np

requested_path_A = input('Enter file path for species A:')
requested_path_B = input('Enter file path for species B:')
requested_path_C = input('Enter file path for species C:')
requested_path_D = input('Enter file path for species D:')


Enter file path for species A:H:\Bioinformatics_Resit\Species_A.txt
Enter file path for species B:H:\Bioinformatics_Resit\Species_B.txt
Enter file path for species C:H:\Bioinformatics_Resit\Species_C.txt
Enter file path for species D:H:\Bioinformatics_Resit\Species_D.txt
n = int(input('Enter your chosen value for n:'))

 Enter your chosen value for n:7 
df1 = pd.read_csv(requested_path_A, sep='\s+', names=["K-mers A", "Count A",])
df2 = pd.read_csv(requested_path_B, sep='\s+', names=["K-mers B", "Count B",])
df3 = pd.read_csv(requested_path_C, sep='\s+', names=["K-mers C", "Count C",])
df4 = pd.read_csv(requested_path_D, sep='\s+', names=["K-mers D", "Count D",])

df1=df1.nlargest(n, ['Count A'])
df2=df2.nlargest(n, ['Count B'])
df3=df3.nlargest(n, ['Count C'])
df4=df4.nlargest(n, ['Count D'])
Edit: I forgot how forums work and didn't realise my comment on a previous thread would bump it up :/

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Creating A List of DataFrames & Manipulating Columns in Each DataFrame firebird 1 236 Jul-31-2019, 04:04 AM
Last Post: scidam
  Compare between 2 DataFrames Nidhesh 2 249 Jul-26-2019, 08:16 AM
Last Post: Nidhesh
  Giving index when joining dataframes kw42chan 1 420 Jul-06-2019, 06:19 AM
Last Post: kw42chan
  Two dataframes merged Ecniv 10 822 Jun-16-2019, 09:10 PM
Last Post: Ecniv
  Statistical analysis of two dataframes zhl 1 577 Jun-11-2019, 07:26 PM
Last Post: Ecniv
  Interpolate using multiple dataframes Lastwizzle 0 319 May-29-2019, 05:32 PM
Last Post: Lastwizzle
  Sum product multiple Dataframes based on column headers. Lastwizzle 0 677 May-21-2019, 04:05 PM
Last Post: Lastwizzle
  Why can't I merge pandas dataframes learnpython2018 2 1,206 Sep-23-2018, 05:53 PM
Last Post: learnpython2018

Forum Jump:

Users browsing this thread: 1 Guest(s)