Python Forum
apply(pd.Series) until no more array
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
apply(pd.Series) until no more array
#1
Update: we found it eventually. Using the same link, a more recent answer helped us out a lot:
df2 = pd.json_normalize(df['Pollutant Levels'])
Hello,

We extract data from a mongoDB using python.
When we store this data in a dataframe, it comes out as a dataframe where several columns are a list of dictionaries.

Reading this solution (https://stackoverflow.com/questions/3823...ith-pandas) we started looping over these fields and always using "df[field].apply(pd.Series)".

However, we don't want to do this on one field, but on several 100's, which is very slow.
Also, our fields have nested dictionaries, so we need to do this apply(pd.Series) several times on a single column.

Does anyone know of a faster and/or more robust way?

Some dummy data with 2 levels deep:
region = ['a','b']
data1 = [{'foo': {'baz': 30}, 'bar': {'baz': 16}},
        {'foo': {'baz': 45}, 'bar': {'baz': 8}}]
data2 = [{'foo': {'baz': 15}, 'bar': {'baz': 11}},
        {'foo': {'baz': 21}, 'bar': {'baz': 45}}]
data_tuples = list(zip(region,data1, data2))
df = pd.DataFrame(data_tuples, columns=['region','data1', 'data2'])
We want these dictionaries all in seperate columns with the fieldname as the concatenation of the original fieldname and the other levels.
For the dummy data above, our fields would be 'data1_foo_baz', 'data1_bar_baz', 'data2_foo_baz', 'data2_bar_baz'

This is how we currently do it:
fields_to_parse = ['data1', 'data2']
for d in fields_to_parse:
    #parse it once
    d_parsed = df[d].apply(pd.Series)
    #parse the result again
    subfields = d_parsed.columns
    for subfield in subfields:
        #disentangle
        full_parsed = d_parsed[subfield].apply(pd.Series)
        new_columns = { col : d + "_" + subfield + "_" + col for col in full_parsed.columns}
        
        full_parsed.rename(columns=new_columns,inplace=True)
        #append
        df = pd.concat([df,full_parsed],axis = 1)
#Drop the original, fully parsed columns
result = df.drop(fields_to_parse, axis=1)
Any advise?

Thanks!
Mikis
Reply
#2
HI
You can take advantage of json_normalize method and list_comp combo:

data1 = [{'foo': {'baz': 30}, 'bar': {'baz': 16}},
        {'foo': {'baz': 45}, 'bar': {'baz': 8}}]
data2 = [{'foo': {'baz': 15}, 'bar': {'baz': 11}},
        {'foo': {'baz': 21}, 'bar': {'baz': 45}}]

data = [data1, data2]

col_names_prefx = ['data1_', 'data2_'] # or what You need...

tmp = [[pd.concat([pd.DataFrame(), pd.json_normalize(df, sep='_')]) for df in dlt] for dlt in data]
res = pd.concat([pd.concat(df, ignore_index=True).add_prefix(name) for df, name in zip(tmp, col_names_prefx)], axis=1)
after, obviously, You can add any other column as You want as usual...
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Apply Method Question smw10c 4 5,494 Apr-08-2017, 12:47 PM
Last Post: smw10c

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020