Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Code improvement: groupby and operation on conditionals
#1
Hello. I have data in the following format. I am looking to groupby the location, and then find the maximum value for each compound where the corresponding flag is 'U'.

Quote: location compound1 compound1_flag compound2 compound2_flag compound3 compound3_flag
A 0.36 NaN 0.4 U 0.2 U
B 5 U 2.2 NaN 8 U
B 9.8 J 3.2 U 3300 NaN
B 7.2 J 1800 NaN 280 NaN
C 9 J 3200 NaN 700 NaN
C 4.5 NaN 6.1 NaN 1000 NaN
C 7.8 NaN 0.16 U 0.47 J
D 0.89 NaN 0.30 U 6.1 NaN
D 14 NaN 0.16 U 0.59 NaN
E 2 U 5.6 NaN 200 NaN

I have a half-way working solution, but I think it could be more readable, and hopefully there's a better way to get it into tabular format.

The following will print out the desired data, but I'm guessing there's a clearer method rather than nested for loops or the method of finding the values associated with 'U' flags.
analytes = ['compound1', 'compound2', 'compound3']
flags = [''.join([x, '_flag']) for x in analytes]
nondetect_dict = {k:v for (k,v) in zip(flags, analytes)}
by_location = data.groupby('location')
for key, group in by_location:
    for x in flags:
        print(key, nondetect_dict[x], by_location.get_group(key).loc[data[x].str.contains('U', regex=False, na=False), nondetect_dict[x]].max())
I tried to convert this into a dictionary with a list comprehension for the second for loop. I thought I could turn into a dataframe, but this throws a syntax error:
max_nondetect = {}
for key, group in by_location:
        max_nondetect[key] = [nondetect_dict[x], by_location.get_group(key).loc[data[x].str.contains('U', regex=False, na=False), nondetect_dict[x]].max() for x in flags]
pd.DataFrame.from_dict(max_nondetect, orient='index')
Any thoughts on a more efficient approach? Thank you in advance.
Quote
#2
I think that main problem is the way how data is structured. I assume that this is in file.

I would:

- read data from file into four - columns structure (compound, location, value, flag)
- use pd.DataFrame to get data needed

import pandas as pd

with open('max_value_compound.txt', 'r') as f:
    data = []
    headers = next(f)
    for row in f:
        location, *rest = row.split()
        for i, compound in zip(range(0, len(rest), 2), range(1, 4)):
            data.append([compound, location, *rest[i:i+2]])

df = pd.DataFrame(data, columns=['compound', 'location', 'value', 'flag'])

df[df['flag'] == 'U'].pivot_table(index=['location', 'compound'], values=['value'], aggfunc=max)   # both giving same output/result
df[df['flag'] == 'U'].groupby(['location', 'compound']).agg({'value': 'max'})


Output:
value location compound A 2 0.4 3 0.2 B 1 5 2 3.2 3 8 C 2 0.16 D 2 0.30 E 1 2
If location is not needed then one can skip 'location':

df[df['flag'] == 'U'].pivot_table(index=['compound'], values=['value'], aggfunc=max) 
df[df['flag'] == 'U'].groupby(['compound']).agg({'value': 'max'})  
which will give:

Output:
value compound 1 5 2 3.2 3 8
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Quote
#3
Thank you! This is fantastic. Yes, changing the way the data is structured makes this much better to interact with.

If I can ask a followup question: the data is in an existing dataframe. Is there a good way to format it to those four column structures from a dataframe? From some initial digging, it looks like one approach may be to create a MultiIndex of all the compounds and flags and then stack the data. Does that sound like a good/best method to investigate?
Quote
#4
I've tried to replicate this reformatting from a dataframe instead of a text file, but have run into a few problems.

First, to create the dataframe from this example:
import sys
from io import StringIO

TESTDATA = StringIO("""location compound1 compound1_flag compound2 compound2_flag compound3 compound3_flag
A 0.36 NaN 0.4 U 0.2 U
B 5 U 2.2 NaN 8 U
B 9.8 J 3.2 U 3300 NaN
B 7.2 J 1800 NaN 280 NaN
C 9 J 3200 NaN 700 NaN
C 4.5 NaN 6.1 NaN 1000 NaN
C 7.8 NaN 0.16 U 0.47 J
D 0.89 NaN 0.30 U 6.1 NaN
D 14 NaN 0.16 U 0.59 NaN
E 2 U 5.6 NaN 200 NaN""")

df1 = pd.read_csv(TESTDATA, sep=" ")
Then to create a multiindex to be able to call all compounds and all flags.
I got errors if I tried to multiindex only some of the columns, and none of the pandas multiindex examples include this usage, so I had to repeat 'location' again.

cols = list(df1.columns)
labels = ['Location', 'Compound', 'Flag', 'Compound', 'Flag', 'Compound', 'Flag']
arrays = [labels, cols]
tuples = list(zip(*arrays))
df1.columns = pd.MultiIndex.from_tuples(tuples)


When I try to reuse the previous code to find the max of nondetects:
df[df['Flag'] == 'U'].groupby(['Location', 'Compound']).agg({'value': 'max'})
Output:
ValueError: cannot join with no overlapping index names
Using 'stack' approaches the previous format, but the location is 'NaN' for each index instead of repeated for each of them. This isn't the result I expected. Also, 'compound' is hierarchical to the values of all compounds, instead of the names of the compounds. This is also a mistaken approach.

df1.stack()
Output:
Compound Flag Location 0 compound1 0.36 NaN NaN compound2 0.40 NaN NaN compound2_flag NaN U NaN compound3 0.20 NaN NaN compound3_flag NaN U NaN location NaN NaN A 1 compound1 5.00 NaN NaN compound1_flag NaN U NaN compound2 2.20 NaN NaN compound3 8.00 NaN NaN compound3_flag NaN U NaN location NaN NaN B 2 compound1 9.80 NaN NaN compound1_flag NaN J NaN compound2 3.20 NaN NaN compound2_flag NaN U NaN compound3 3300.00 NaN NaN location NaN NaN B 3 compound1 7.20 NaN NaN compound1_flag NaN J NaN compound2 1800.00 NaN NaN compound3 280.00 NaN NaN location NaN NaN B 4 compound1 9.00 NaN NaN compound1_flag NaN J NaN compound2 3200.00 NaN NaN compound3 700.00 NaN NaN location NaN NaN C 5 compound1 4.50 NaN NaN compound2 6.10 NaN NaN compound3 1000.00 NaN NaN location NaN NaN C 6 compound1 7.80 NaN NaN compound2 0.16 NaN NaN compound2_flag NaN U NaN compound3 0.47 NaN NaN compound3_flag NaN J NaN location NaN NaN C 7 compound1 0.89 NaN NaN compound2 0.30 NaN NaN compound2_flag NaN U NaN compound3 6.10 NaN NaN location NaN NaN D 8 compound1 14.00 NaN NaN compound2 0.16 NaN NaN compound2_flag NaN U NaN compound3 0.59 NaN NaN location NaN NaN D 9 compound1 2.00 NaN NaN compound1_flag NaN U NaN compound2 5.60 NaN NaN compound3 200.00 NaN NaN location NaN NaN E
Quote
#5
pandas is like swiss knife and it is important to focus on what you want to achieve.

If you still want to achieve "find the maximum value for each compound where the corresponding flag is 'U'." then maybe you can just build a simple filter and keep dataframe as it is.


import pandas as pd

df = pd.read_csv('max_compound_data.txt', delimiter=' ')


def filter_and_agg(dataframe, flag='U', compound_id=1, agg_type='max'):
    compound, status = [f'compound{compound_id}', f'compound{compound_id}_flag']
    df = dataframe[[compound, status]].dropna()
    yield df[df[status] == flag].agg({compound: 'max'}).values[0]

for i in range(1, 4):
    print(f'{i}: {filter_and_agg(df, compound_id=i)}')

# outputs:
# 1: 5.0
# 2: 3.2
# 3: 8.0

for i in range(1, 4):
    print(f'{i}: {filter_and_agg(df, flag="J", compound_id=i, agg_type="min")}')

# outputs:
# 1: 7.2
# 2: nan
# 3: 0.47


filter_and_agg(df, compound_id=2)                               # 3.2
filter_and_agg(df, compound_id=3, agg_type='sum')               # 8.2
filter_and_agg(df, flag='J', compound_id=1, agg_type='min')     # 7.2
EDIT: use return instead of yield as filter returns only one value (it works with yield though).
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Quote
#6
Thank you! I changed the code somewhat to incorporate the location and create a dictionary of the values to lookup for each compound per location. The following works, but I don't see the right strategy for the final piece of code.

def filter_and_agg(dataframe, flag='U', location='A', compound_id=1, agg_type='max'):
    compound, status = [f'compound{compound_id}', f'compound{compound_id}_flag']
    return dataframe[compound][(dataframe[status]=='U') & (dataframe['location']==location)].agg(max)

max_dict={}
for x in set(list(df.location)):
    max_dict[x] = {}
    for i in range(1, 4):
        max_dict[x][f'compound{i}']=f'{filter_and_agg(df, location=x, compound_id=i)}'
Output:
{'B': {'compound1': '5.0', 'compound2': '3.2', 'compound3': '8.0'}, 'E': {'compound1': '2.0', 'compound2': 'nan', 'compound3': 'nan'}, 'C': {'compound1': 'nan', 'compound2': '0.16', 'compound3': 'nan'}, 'A': {'compound1': 'nan', 'compound2': '0.4', 'compound3': '0.2'}, 'D': {'compound1': 'nan', 'compound2': '0.3', 'compound3': 'nan'}}
I would like to be able to use this dictionary to lookup and replace any value in the dataframe that is less than it.

I am trying approaches like this:
for location in set(list(df.location)):
    df1 = df[df['location']==location]
    for i in range(1, 4):
        df[np.where(df1[f'compound{i}'] < float(max_dict[location][f'compound{i}']))] = max_dict[location][f'compound{i}']
This returns the error
Error:
TypeError: unhashable type: 'numpy.ndarray'
This post https://stackoverflow.com/questions/4847...n-criteria has a similar case and uses 'merge' for conditional replacement, but I don't see a straightforward way to use this while comparing 'less than' values. Any suggestion for changing this code block?
Quote
#7
I've found a solution for the final block of code, in order to set a 'floor' or minimum to the dataset from the table that was constructed previously.

for idx, row in df.iterrows():
    for analyte in ['compound1', 'compound2', 'compound3']:
        if df.loc[idx, analyte] < float(max_dict[row[0]][analyte]):
            df.loc[idx, analyte] = float(max_dict[row[0]][analyte])
Thank you for your help, perfringo!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  How to use bfill und groupby under conditions karlito 1 320 Dec-13-2019, 07:03 PM
Last Post: ibreeden
  itertuples, datetime, pandas, groupby, in range karlito 0 417 Nov-29-2019, 11:35 AM
Last Post: karlito
  Groupby([]).sum() Miscalculation mini_geek 1 281 Sep-16-2019, 12:43 AM
Last Post: Larz60+
  Groupby in pandas with conditional - add and subtract rregorr 2 1,466 Jul-12-2019, 05:17 PM
Last Post: rregorr
  pandas change row value an existing column with conditionals Gigux 1 491 Jun-22-2019, 08:04 PM
Last Post: Gigux
  Pandas segmenting groupby average brocq_18 0 633 Jul-11-2018, 10:54 AM
Last Post: brocq_18
  Genetic Algorithm improvement Alberto 0 2,339 Oct-19-2017, 02:13 PM
Last Post: Alberto
  groupby question smw10c 2 1,375 Mar-29-2017, 11:17 PM
Last Post: smw10c

Forum Jump:


Users browsing this thread: 1 Guest(s)