![]() |
apply(pd.Series) until no more array - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: apply(pd.Series) until no more array (/thread-33306.html) |
apply(pd.Series) until no more array - mikisDeWitte - Apr-14-2021 Update: we found it eventually. Using the same link, a more recent answer helped us out a lot: df2 = pd.json_normalize(df['Pollutant Levels'])Hello, We extract data from a mongoDB using python. When we store this data in a dataframe, it comes out as a dataframe where several columns are a list of dictionaries. Reading this solution (https://stackoverflow.com/questions/38231591/split-explode-a-column-of-dictionaries-into-separate-columns-with-pandas) we started looping over these fields and always using "df[field].apply(pd.Series)". However, we don't want to do this on one field, but on several 100's, which is very slow. Also, our fields have nested dictionaries, so we need to do this apply(pd.Series) several times on a single column. Does anyone know of a faster and/or more robust way? Some dummy data with 2 levels deep: region = ['a','b'] data1 = [{'foo': {'baz': 30}, 'bar': {'baz': 16}}, {'foo': {'baz': 45}, 'bar': {'baz': 8}}] data2 = [{'foo': {'baz': 15}, 'bar': {'baz': 11}}, {'foo': {'baz': 21}, 'bar': {'baz': 45}}] data_tuples = list(zip(region,data1, data2)) df = pd.DataFrame(data_tuples, columns=['region','data1', 'data2'])We want these dictionaries all in seperate columns with the fieldname as the concatenation of the original fieldname and the other levels. For the dummy data above, our fields would be 'data1_foo_baz', 'data1_bar_baz', 'data2_foo_baz', 'data2_bar_baz' This is how we currently do it: fields_to_parse = ['data1', 'data2'] for d in fields_to_parse: #parse it once d_parsed = df[d].apply(pd.Series) #parse the result again subfields = d_parsed.columns for subfield in subfields: #disentangle full_parsed = d_parsed[subfield].apply(pd.Series) new_columns = { col : d + "_" + subfield + "_" + col for col in full_parsed.columns} full_parsed.rename(columns=new_columns,inplace=True) #append df = pd.concat([df,full_parsed],axis = 1) #Drop the original, fully parsed columns result = df.drop(fields_to_parse, axis=1)Any advise? Thanks! Mikis RE: apply(pd.Series) until no more array - Caprone - Apr-17-2021 HI You can take advantage of json_normalize method and list_comp combo: data1 = [{'foo': {'baz': 30}, 'bar': {'baz': 16}}, {'foo': {'baz': 45}, 'bar': {'baz': 8}}] data2 = [{'foo': {'baz': 15}, 'bar': {'baz': 11}}, {'foo': {'baz': 21}, 'bar': {'baz': 45}}] data = [data1, data2] col_names_prefx = ['data1_', 'data2_'] # or what You need... tmp = [[pd.concat([pd.DataFrame(), pd.json_normalize(df, sep='_')]) for df in dlt] for dlt in data] res = pd.concat([pd.concat(df, ignore_index=True).add_prefix(name) for df, name in zip(tmp, col_names_prefx)], axis=1)after, obviously, You can add any other column as You want as usual... |