Python Forum

Hi,

I am trying to parallelise a function that will loop of over many millions of elements. It is a pearson correlation calculation of every column pair. as the dataframe is huge, I have created an if elif logic to only include column pairs above a certain threshold. This makes the problem tractable for keeping relevant column pairs, but i would also like to run this in parallel. The function and code to run in parallel is below - a few points: The parallel implementation is slower than running without it! - I am running ona subset of 1000 columns to test this first before running on the entire combination of columns. I am using. macbook pro 2018 with 12 cores. Usage of cores doesn't increase when running! any help or fix/imporvement to my code is much appreciated

import pandas as pd
import numpy as np
from joblib import Parallel, delayed
from tqdm import tqdm
from itertools import combinations

def pearsons(combination_list, counts):
    correlations = {}
    corr = np.corrcoef(counts.loc[:, combination_list[0]], counts.loc[:, combination_list[1]])[1, 0]
    if corr < 0.8:
        pass
    elif corr >= 0.8:
        correlations[combination_list[0] + '_' + combination_list[1]] = corr
    return correlations

cols = counts_subset.iloc[:, 0:999].columns
cols_list = comb(cols, 2)

input_ = tqdm(cols_list, total=len(cols_list))
Parallel(n_jobs=12)(delayed(pearsons)(i, counts_subset.iloc[:, 0:999]) for i in input_)

amjass12