Dec-15-2021, 03:18 PM
Hi,
I am trying to parallelise a function that will loop of over many millions of elements. It is a pearson correlation calculation of every column pair. as the dataframe is huge, I have created an if elif logic to only include column pairs above a certain threshold. This makes the problem tractable for keeping relevant column pairs, but i would also like to run this in parallel. The function and code to run in parallel is below - a few points: The parallel implementation is slower than running without it! - I am running ona subset of 1000 columns to test this first before running on the entire combination of columns. I am using. macbook pro 2018 with 12 cores. Usage of cores doesn't increase when running! any help or fix/imporvement to my code is much appreciated
I am trying to parallelise a function that will loop of over many millions of elements. It is a pearson correlation calculation of every column pair. as the dataframe is huge, I have created an if elif logic to only include column pairs above a certain threshold. This makes the problem tractable for keeping relevant column pairs, but i would also like to run this in parallel. The function and code to run in parallel is below - a few points: The parallel implementation is slower than running without it! - I am running ona subset of 1000 columns to test this first before running on the entire combination of columns. I am using. macbook pro 2018 with 12 cores. Usage of cores doesn't increase when running! any help or fix/imporvement to my code is much appreciated
import pandas as pd import numpy as np from joblib import Parallel, delayed from tqdm import tqdm from itertools import combinations def pearsons(combination_list, counts): correlations = {} corr = np.corrcoef(counts.loc[:, combination_list[0]], counts.loc[:, combination_list[1]])[1, 0] if corr < 0.8: pass elif corr >= 0.8: correlations[combination_list[0] + '_' + combination_list[1]] = corr return correlations cols = counts_subset.iloc[:, 0:999].columns cols_list = comb(cols, 2) input_ = tqdm(cols_list, total=len(cols_list)) Parallel(n_jobs=12)(delayed(pearsons)(i, counts_subset.iloc[:, 0:999]) for i in input_)