Python Forum

Full Version: Calculated DF column from dictionary value
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I would like to make a calculated column as shown here ('new'), from the exiting 'data' column (which is a list with a dictionary inside). It works in this code.

data = [10,[{'self': 'https://elia.atlassian.net/rest/api/3/customFieldOption/10200', 'value': 'IT-Sourced Changes 2022', 'id': '10200'}],30]
df = pd.DataFrame(data, columns=['Data'])
df['new'] = df.Data.explode().str['value']
df.head(3)
However, when I try it on an existing dataframe, I get 'ValueError: cannot reindex from a duplicate axis'. Not sure why.

https://imgur.com/a/B4qEOWa
Explode does this:
import pandas as pd

df = pd.DataFrame({"Data": [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
print(df)
print(df.Data.explode())
Output:
Data 0 [1, 2, 3] 1 [4, 5, 6] 2 [7, 8, 9] 0 1 0 2 0 3 1 4 1 5 1 6 2 7 2 8 2 9
Notice all the duplicate index values generated by explode(). When I try to add this as a column to an existing dataframe I get the same error you are seeing.

What happens if I reset the index to count up from zero? That will get rid of the duplicate index values.
import pandas as pd

df = pd.DataFrame({"Data": [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
series = df.Data.explode().reset_index(drop=True)
print(series)
df["Explode"] = series
print(df)
Output:
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 Name: Data, dtype: object Data Explode 0 [1, 2, 3] 1 1 [4, 5, 6] 2 2 [7, 8, 9] 3
This works! But why does the index matter?
import pandas as pd

df = pd.DataFrame({"Data": [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
series = df.Data.explode().reset_index(drop=True)
df = df[1:]
print(df)
df["Explode"] = series
print(df)
Output:
Data 1 [4, 5, 6] 2 [7, 8, 9] Data Explode 1 [4, 5, 6] 2 2 [7, 8, 9] 3
When adding a series to an existing dataframe, pandas uses the index values to merge in the new values. Notice that series starts at 1, but when added to the dataframe it starts at 2. This is because the first index in df is 1. and series.iloc[1] == 2.

So you are getting an error because pandas does not know what to do with the duplicate index values created by explode(). That makes sense. How else would you collate other than using the row index values?
First time I use this forum, but that not only worked - it's the best explanation I've ever gotten to fix a problem. Thank you so much.