Code improvement: groupby and operation on conditionals

pythonidae · Dec-06-2019, 06:02 AM

Hello. I have data in the following format. I am looking to groupby the location, and then find the maximum value for each compound where the corresponding flag is 'U'.

Quote: location compound1 compound1_flag compound2 compound2_flag compound3 compound3_flag
A 0.36 NaN 0.4 U 0.2 U
B 5 U 2.2 NaN 8 U
B 9.8 J 3.2 U 3300 NaN
B 7.2 J 1800 NaN 280 NaN
C 9 J 3200 NaN 700 NaN
C 4.5 NaN 6.1 NaN 1000 NaN
C 7.8 NaN 0.16 U 0.47 J
D 0.89 NaN 0.30 U 6.1 NaN
D 14 NaN 0.16 U 0.59 NaN
E 2 U 5.6 NaN 200 NaN

I have a half-way working solution, but I think it could be more readable, and hopefully there's a better way to get it into tabular format.

The following will print out the desired data, but I'm guessing there's a clearer method rather than nested for loops or the method of finding the values associated with 'U' flags.

analytes = ['compound1', 'compound2', 'compound3']
flags = [''.join([x, '_flag']) for x in analytes]
nondetect_dict = {k:v for (k,v) in zip(flags, analytes)}
by_location = data.groupby('location')
for key, group in by_location:
    for x in flags:
        print(key, nondetect_dict[x], by_location.get_group(key).loc[data[x].str.contains('U', regex=False, na=False), nondetect_dict[x]].max())

I tried to convert this into a dictionary with a list comprehension for the second for loop. I thought I could turn into a dataframe, but this throws a syntax error:

max_nondetect = {}
for key, group in by_location:
        max_nondetect[key] = [nondetect_dict[x], by_location.get_group(key).loc[data[x].str.contains('U', regex=False, na=False), nondetect_dict[x]].max() for x in flags]
pd.DataFrame.from_dict(max_nondetect, orient='index')

Any thoughts on a more efficient approach? Thank you in advance.

**perfringo** · Dec-06-2019, 11:41 AM

I think that main problem is the way how data is structured. I assume that this is in file.

I would:

- read data from file into four - columns structure (compound, location, value, flag)
- use pd.DataFrame to get data needed

import pandas as pd

with open('max_value_compound.txt', 'r') as f:
    data = []
    headers = next(f)
    for row in f:
        location, *rest = row.split()
        for i, compound in zip(range(0, len(rest), 2), range(1, 4)):
            data.append([compound, location, *rest[i:i+2]])

df = pd.DataFrame(data, columns=['compound', 'location', 'value', 'flag'])

df[df['flag'] == 'U'].pivot_table(index=['location', 'compound'], values=['value'], aggfunc=max)   # both giving same output/result
df[df['flag'] == 'U'].groupby(['location', 'compound']).agg({'value': 'max'})

Output:                 value
location compound      
A        2          0.4
         3          0.2
B        1            5
         2          3.2
         3            8
C        2         0.16
D        2         0.30
E        1            2

If location is not needed then one can skip 'location':

df[df['flag'] == 'U'].pivot_table(index=['compound'], values=['value'], aggfunc=max) 
df[df['flag'] == 'U'].groupby(['compound']).agg({'value': 'max'})

which will give:

Output:         value
compound      
1            5
2          3.2
3            8

pythonidae · Dec-07-2019, 05:49 AM

Thank you! This is fantastic. Yes, changing the way the data is structured makes this much better to interact with.

If I can ask a followup question: the data is in an existing dataframe. Is there a good way to format it to those four column structures from a dataframe? From some initial digging, it looks like one approach may be to create a MultiIndex of all the compounds and flags and then stack the data. Does that sound like a good/best method to investigate?

pythonidae · Dec-09-2019, 06:19 AM

I've tried to replicate this reformatting from a dataframe instead of a text file, but have run into a few problems.

First, to create the dataframe from this example:

import sys
from io import StringIO

TESTDATA = StringIO("""location compound1 compound1_flag compound2 compound2_flag compound3 compound3_flag
A 0.36 NaN 0.4 U 0.2 U
B 5 U 2.2 NaN 8 U
B 9.8 J 3.2 U 3300 NaN
B 7.2 J 1800 NaN 280 NaN
C 9 J 3200 NaN 700 NaN
C 4.5 NaN 6.1 NaN 1000 NaN
C 7.8 NaN 0.16 U 0.47 J
D 0.89 NaN 0.30 U 6.1 NaN
D 14 NaN 0.16 U 0.59 NaN
E 2 U 5.6 NaN 200 NaN""")

df1 = pd.read_csv(TESTDATA, sep=" ")

Then to create a multiindex to be able to call all compounds and all flags.
I got errors if I tried to multiindex only some of the columns, and none of the pandas multiindex examples include this usage, so I had to repeat 'location' again.

cols = list(df1.columns)
labels = ['Location', 'Compound', 'Flag', 'Compound', 'Flag', 'Compound', 'Flag']
arrays = [labels, cols]
tuples = list(zip(*arrays))
df1.columns = pd.MultiIndex.from_tuples(tuples)

When I try to reuse the previous code to find the max of nondetects:

df[df['Flag'] == 'U'].groupby(['Location', 'Compound']).agg({'value': 'max'})

Output:
ValueError: cannot join with no overlapping index names

Using 'stack' approaches the previous format, but the location is 'NaN' for each index instead of repeated for each of them. This isn't the result I expected. Also, 'compound' is hierarchical to the values of all compounds, instead of the names of the compounds. This is also a mistaken approach.

df1.stack()

Output:Compound 	Flag 	Location
0 	compound1 	0.36 	NaN 	NaN
compound2 	0.40 	NaN 	NaN
compound2_flag 	NaN 	U 	NaN
compound3 	0.20 	NaN 	NaN
compound3_flag 	NaN 	U 	NaN
location 	NaN 	NaN 	A
1 	compound1 	5.00 	NaN 	NaN
compound1_flag 	NaN 	U 	NaN
compound2 	2.20 	NaN 	NaN
compound3 	8.00 	NaN 	NaN
compound3_flag 	NaN 	U 	NaN
location 	NaN 	NaN 	B
2 	compound1 	9.80 	NaN 	NaN
compound1_flag 	NaN 	J 	NaN
compound2 	3.20 	NaN 	NaN
compound2_flag 	NaN 	U 	NaN
compound3 	3300.00 	NaN 	NaN
location 	NaN 	NaN 	B
3 	compound1 	7.20 	NaN 	NaN
compound1_flag 	NaN 	J 	NaN
compound2 	1800.00 	NaN 	NaN
compound3 	280.00 	NaN 	NaN
location 	NaN 	NaN 	B
4 	compound1 	9.00 	NaN 	NaN
compound1_flag 	NaN 	J 	NaN
compound2 	3200.00 	NaN 	NaN
compound3 	700.00 	NaN 	NaN
location 	NaN 	NaN 	C
5 	compound1 	4.50 	NaN 	NaN
compound2 	6.10 	NaN 	NaN
compound3 	1000.00 	NaN 	NaN
location 	NaN 	NaN 	C
6 	compound1 	7.80 	NaN 	NaN
compound2 	0.16 	NaN 	NaN
compound2_flag 	NaN 	U 	NaN
compound3 	0.47 	NaN 	NaN
compound3_flag 	NaN 	J 	NaN
location 	NaN 	NaN 	C
7 	compound1 	0.89 	NaN 	NaN
compound2 	0.30 	NaN 	NaN
compound2_flag 	NaN 	U 	NaN
compound3 	6.10 	NaN 	NaN
location 	NaN 	NaN 	D
8 	compound1 	14.00 	NaN 	NaN
compound2 	0.16 	NaN 	NaN
compound2_flag 	NaN 	U 	NaN
compound3 	0.59 	NaN 	NaN
location 	NaN 	NaN 	D
9 	compound1 	2.00 	NaN 	NaN
compound1_flag 	NaN 	U 	NaN
compound2 	5.60 	NaN 	NaN
compound3 	200.00 	NaN 	NaN
location 	NaN 	NaN 	E

**perfringo** · (This post was last modified: Dec-09-2019, 02:08 PM by perfringo.)

pandas is like swiss knife and it is important to focus on what you want to achieve.

If you still want to achieve "find the maximum value for each compound where the corresponding flag is 'U'." then maybe you can just build a simple filter and keep dataframe as it is.

import pandas as pd

df = pd.read_csv('max_compound_data.txt', delimiter=' ')


def filter_and_agg(dataframe, flag='U', compound_id=1, agg_type='max'):
    compound, status = [f'compound{compound_id}', f'compound{compound_id}_flag']
    df = dataframe[[compound, status]].dropna()
    yield df[df[status] == flag].agg({compound: 'max'}).values[0]

for i in range(1, 4):
    print(f'{i}: {filter_and_agg(df, compound_id=i)}')

# outputs:
# 1: 5.0
# 2: 3.2
# 3: 8.0

for i in range(1, 4):
    print(f'{i}: {filter_and_agg(df, flag="J", compound_id=i, agg_type="min")}')

# outputs:
# 1: 7.2
# 2: nan
# 3: 0.47


filter_and_agg(df, compound_id=2)                               # 3.2
filter_and_agg(df, compound_id=3, agg_type='sum')               # 8.2
filter_and_agg(df, flag='J', compound_id=1, agg_type='min')     # 7.2

EDIT: use return instead of yield as filter returns only one value (it works with yield though).

pythonidae · Dec-10-2019, 01:57 PM

Thank you! I changed the code somewhat to incorporate the location and create a dictionary of the values to lookup for each compound per location. The following works, but I don't see the right strategy for the final piece of code.

def filter_and_agg(dataframe, flag='U', location='A', compound_id=1, agg_type='max'):
    compound, status = [f'compound{compound_id}', f'compound{compound_id}_flag']
    return dataframe[compound][(dataframe[status]=='U') & (dataframe['location']==location)].agg(max)

max_dict={}
for x in set(list(df.location)):
    max_dict[x] = {}
    for i in range(1, 4):
        max_dict[x][f'compound{i}']=f'{filter_and_agg(df, location=x, compound_id=i)}'

Output:{'B': {'compound1': '5.0', 'compound2': '3.2', 'compound3': '8.0'},
 'E': {'compound1': '2.0', 'compound2': 'nan', 'compound3': 'nan'},
 'C': {'compound1': 'nan', 'compound2': '0.16', 'compound3': 'nan'},
 'A': {'compound1': 'nan', 'compound2': '0.4', 'compound3': '0.2'},
 'D': {'compound1': 'nan', 'compound2': '0.3', 'compound3': 'nan'}}

I would like to be able to use this dictionary to lookup and replace any value in the dataframe that is less than it.

I am trying approaches like this:

for location in set(list(df.location)):
    df1 = df[df['location']==location]
    for i in range(1, 4):
        df[np.where(df1[f'compound{i}'] < float(max_dict[location][f'compound{i}']))] = max_dict[location][f'compound{i}']

This returns the error

Error:
TypeError: unhashable type: 'numpy.ndarray'

This post https://stackoverflow.com/questions/4847...n-criteria has a similar case and uses 'merge' for conditional replacement, but I don't see a straightforward way to use this while comparing 'less than' values. Any suggestion for changing this code block?

pythonidae · Dec-12-2019, 05:58 AM

I've found a solution for the final block of code, in order to set a 'floor' or minimum to the dataset from the table that was constructed previously.

for idx, row in df.iterrows():
    for analyte in ['compound1', 'compound2', 'compound3']:
        if df.loc[idx, analyte] < float(max_dict[row[0]][analyte]):
            df.loc[idx, analyte] = float(max_dict[row[0]][analyte])

Thank you for your help, perfringo!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	pandas change row value an existing column with conditionals	Gigux	1	2,962	Jun-22-2019, 08:04 PM Last Post: Gigux
	Genetic Algorithm improvement	Alberto	0	4,294	Oct-19-2017, 02:13 PM Last Post: Alberto

Code improvement: groupby and operation on conditionals

User Panel Messages

Announcements