apply(pd.Series) until no more array

apply(pd.Series) until no more array - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: apply(pd.Series) until no more array (/thread-33306.html)

apply(pd.Series) until no more array - mikisDeWitte - Apr-14-2021

Update: we found it eventually. Using the same link, a more recent answer helped us out a lot:

df2 = pd.json_normalize(df['Pollutant Levels'])

Hello,

We extract data from a mongoDB using python.
When we store this data in a dataframe, it comes out as a dataframe where several columns are a list of dictionaries.

Reading this solution (https://stackoverflow.com/questions/38231591/split-explode-a-column-of-dictionaries-into-separate-columns-with-pandas) we started looping over these fields and always using "df[field].apply(pd.Series)".

However, we don't want to do this on one field, but on several 100's, which is very slow.
Also, our fields have nested dictionaries, so we need to do this apply(pd.Series) several times on a single column.

Does anyone know of a faster and/or more robust way?

Some dummy data with 2 levels deep:

region = ['a','b']
data1 = [{'foo': {'baz': 30}, 'bar': {'baz': 16}},
        {'foo': {'baz': 45}, 'bar': {'baz': 8}}]
data2 = [{'foo': {'baz': 15}, 'bar': {'baz': 11}},
        {'foo': {'baz': 21}, 'bar': {'baz': 45}}]
data_tuples = list(zip(region,data1, data2))
df = pd.DataFrame(data_tuples, columns=['region','data1', 'data2'])

We want these dictionaries all in seperate columns with the fieldname as the concatenation of the original fieldname and the other levels.
For the dummy data above, our fields would be 'data1_foo_baz', 'data1_bar_baz', 'data2_foo_baz', 'data2_bar_baz'

This is how we currently do it:

fields_to_parse = ['data1', 'data2']
for d in fields_to_parse:
    #parse it once
    d_parsed = df[d].apply(pd.Series)
    #parse the result again
    subfields = d_parsed.columns
    for subfield in subfields:
        #disentangle
        full_parsed = d_parsed[subfield].apply(pd.Series)
        new_columns = { col : d + "_" + subfield + "_" + col for col in full_parsed.columns}
        
        full_parsed.rename(columns=new_columns,inplace=True)
        #append
        df = pd.concat([df,full_parsed],axis = 1)
#Drop the original, fully parsed columns
result = df.drop(fields_to_parse, axis=1)

Any advise?

Thanks!
Mikis

RE: apply(pd.Series) until no more array - Caprone - Apr-17-2021

HI
You can take advantage of json_normalize method and list_comp combo:

data1 = [{'foo': {'baz': 30}, 'bar': {'baz': 16}},
        {'foo': {'baz': 45}, 'bar': {'baz': 8}}]
data2 = [{'foo': {'baz': 15}, 'bar': {'baz': 11}},
        {'foo': {'baz': 21}, 'bar': {'baz': 45}}]

data = [data1, data2]

col_names_prefx = ['data1_', 'data2_'] # or what You need...

tmp = [[pd.concat([pd.DataFrame(), pd.json_normalize(df, sep='_')]) for df in dlt] for dlt in data]
res = pd.concat([pd.concat(df, ignore_index=True).add_prefix(name) for df, name in zip(tmp, col_names_prefx)], axis=1)

after, obviously, You can add any other column as You want as usual...