Apr-14-2021, 02:27 PM
(This post was last modified: Apr-14-2021, 02:27 PM by mikisDeWitte.)
Update: we found it eventually. Using the same link, a more recent answer helped us out a lot:
We extract data from a mongoDB using python.
When we store this data in a dataframe, it comes out as a dataframe where several columns are a list of dictionaries.
Reading this solution (https://stackoverflow.com/questions/3823...ith-pandas) we started looping over these fields and always using "df[field].apply(pd.Series)".
However, we don't want to do this on one field, but on several 100's, which is very slow.
Also, our fields have nested dictionaries, so we need to do this apply(pd.Series) several times on a single column.
Does anyone know of a faster and/or more robust way?
Some dummy data with 2 levels deep:
For the dummy data above, our fields would be 'data1_foo_baz', 'data1_bar_baz', 'data2_foo_baz', 'data2_bar_baz'
This is how we currently do it:
Thanks!
Mikis
df2 = pd.json_normalize(df['Pollutant Levels'])Hello,
We extract data from a mongoDB using python.
When we store this data in a dataframe, it comes out as a dataframe where several columns are a list of dictionaries.
Reading this solution (https://stackoverflow.com/questions/3823...ith-pandas) we started looping over these fields and always using "df[field].apply(pd.Series)".
However, we don't want to do this on one field, but on several 100's, which is very slow.
Also, our fields have nested dictionaries, so we need to do this apply(pd.Series) several times on a single column.
Does anyone know of a faster and/or more robust way?
Some dummy data with 2 levels deep:
region = ['a','b'] data1 = [{'foo': {'baz': 30}, 'bar': {'baz': 16}}, {'foo': {'baz': 45}, 'bar': {'baz': 8}}] data2 = [{'foo': {'baz': 15}, 'bar': {'baz': 11}}, {'foo': {'baz': 21}, 'bar': {'baz': 45}}] data_tuples = list(zip(region,data1, data2)) df = pd.DataFrame(data_tuples, columns=['region','data1', 'data2'])We want these dictionaries all in seperate columns with the fieldname as the concatenation of the original fieldname and the other levels.
For the dummy data above, our fields would be 'data1_foo_baz', 'data1_bar_baz', 'data2_foo_baz', 'data2_bar_baz'
This is how we currently do it:
fields_to_parse = ['data1', 'data2'] for d in fields_to_parse: #parse it once d_parsed = df[d].apply(pd.Series) #parse the result again subfields = d_parsed.columns for subfield in subfields: #disentangle full_parsed = d_parsed[subfield].apply(pd.Series) new_columns = { col : d + "_" + subfield + "_" + col for col in full_parsed.columns} full_parsed.rename(columns=new_columns,inplace=True) #append df = pd.concat([df,full_parsed],axis = 1) #Drop the original, fully parsed columns result = df.drop(fields_to_parse, axis=1)Any advise?
Thanks!
Mikis