Why does newly-formed dict only consist of last row of each year?

Mark17 · Nov-13-2023, 08:10 PM

Hi all,

I'm trying to convert two columns ('BIRTH_YEAR', 'NAME') of baby_names into a dictionary. Why does the newly-formed dictionary only consist of the last df row of each year?

baby_name_dict = {}
print(baby_names.info(), '\n')
baby_name_dict = dict(zip(baby_names.BIRTH_YEAR, baby_names.NAME))
print(f'baby_name_dict is:  {baby_name_dict}.')

Output:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13962 entries, 0 to 13961
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   BIRTH_YEAR  13962 non-null  int64 
 1   GENDER      13962 non-null  object
 2   ETHNICTY    13962 non-null  object
 3   NAME        13962 non-null  object
 4   COUNT       13962 non-null  int64 
 5   RANK        13962 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 654.6+ KB
None 

baby_name_dict is:  {2011: 'ZEV', 2012: 'ZEV', 2013: 'Zev', 2014: 'Zev'}.

Getting myself ready for a really foolish oversight... :)

**buran** · Nov-13-2023, 08:48 PM

keys are unique and last seen [value, i.e. name] wins.

**deanhystad** · Nov-13-2023, 09:31 PM

What are you trying to do, count baby names by year? You could group your dataframe by (groupby) year and baby name and count the number of babies in each group.

Mark17 · Nov-15-2023, 08:21 PM

(Nov-13-2023, 09:31 PM)deanhystad Wrote: What are you trying to do, count baby names by year? You could group your dataframe by (groupby) year and baby name and count the number of babies in each group.

This is a good exercise... I'll work on a .groupby solution.

I'm trying to get a frequency count.

Mark17 · Nov-16-2023, 07:07 PM

(Nov-15-2023, 08:21 PM)Mark17 Wrote:
(Nov-13-2023, 09:31 PM)deanhystad Wrote: What are you trying to do, count baby names by year? You could group your dataframe by (groupby) year and baby name and count the number of babies in each group.

This is a good exercise... I'll work on a .groupby solution.

I'm trying to get a frequency count.

.value_counts() is a start:

values = baby_names['NAME'].value_counts()
print(values)

This gets me a list (actually a series) of name frequencies. I see I can also tack on .to_dict() and get a dictionary... but the names are keys and the frequencies are values. What I really want are the frequencies as keys and lists of names as values (since multiple names often occur with the same frequency)--and then I'd want to see that for each year.

I next tried a .groupby() solution...

print(baby_names.groupby(['BRITH_YEAR', 'NAME']))

...but I just get a groupby object:

Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018E277041C0>

If I tack on .sum(), then I get this:

Output:                    COUNT  RANK
BRITH_YEAR NAME                
2011       AALIYAH    528   140
           AARAV       60   204
           AARON     1092   516
           ABBY        40   312
           ABDIEL      48   368
...                   ...   ...
2014       Zion        40    33
           Zissy       25    71
           Zoe        240    86
           Zoey       116   164
           Zuri        21    30

[4882 rows x 2 columns]

This is a bit confusing. 'COUNT' and 'RANK' are the last two column names. The actual number of times 'ABBY' appears is five (four in 2011 and one in 2012), so .sum() isn't counting occurrences of the names:

baby_names[baby_names['NAME'] == 'ABBY']

Output:      BRITH_YEAR  GENDER        ETHNICTY  NAME  COUNT  RANK
1295        2011  FEMALE        HISPANIC      ABBY     10        78
2767        2011  FEMALE        HISPANIC      ABBY     10        78
4267        2011  FEMALE        HISPANIC      ABBY     10        78
6230        2011  FEMALE        HISPANIC      ABBY     10        78
7852        2012  FEMALE  ASIAN AND PACI  ABBY     11        44

Lots of stuff here... I appreciate any light you can shine on these concepts to help me understand them!

**deanhystad** · Nov-17-2023, 04:54 AM

I was thinking of something like this:

from random import randint, choice
import pandas as pd

# Make up some data for processing
baby_names = pd.DataFrame(
    [{"Year": year, "Name": choice('ABCD')} for _ in range(100) for year in range(2000, 2005)]
)
stats = baby_names.groupby(["Year", "Name"]).agg(Count=("Name", "count"))
stats["%"] = 100 * stats["Count"] / stats.groupby("Year")["Count"].transform('sum')
stats.sort_values(by=["Year", "%"], ascending=[True, False], inplace=True)
print(stats)

Mark17 · Nov-17-2023, 05:28 PM

(Nov-17-2023, 04:54 AM)deanhystad Wrote: I was thinking of something like this:

from random import randint, choice
import pandas as pd

# Make up some data for processing
baby_names = pd.DataFrame(
    [{"Year": year, "Name": choice('ABCD')} for _ in range(100) for year in range(2000, 2005)]
)
stats = baby_names.groupby(["Year", "Name"]).agg(Count=("Name", "count"))
stats["%"] = 100 * stats["Count"] / stats.groupby("Year")["Count"].transform('sum')
stats.sort_values(by=["Year", "%"], ascending=[True, False], inplace=True)
print(stats)

Interesting. Lots of stuff in there... I will study that and try to apply. Thanks so much!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Trying to get year not the entire year & time	mbrown009	2	1,869	Jan-09-2023, 01:46 PM Last Post: snippsat
	Sort a dict in dict	cherry_cherry	4	104,622	Apr-08-2020, 12:25 PM Last Post: perfringo
	[gpxpy] "Error parsing XML: not well-formed (invalid token): line 1, column 1"	Winfried	5	8,856	Jan-26-2020, 01:09 AM Last Post: Winfried
	How to show newly added column to csv	johnson54937	3	3,153	Jan-07-2020, 04:01 AM Last Post: Larz60+
	How to eliminate magic squares formed by the same numbers, but permuted	frame	7	5,071	May-09-2019, 11:28 AM Last Post: frame
	FileNotFoundError in newly structured Python project	PrateekG	0	2,926	May-23-2018, 06:20 AM Last Post: PrateekG
	Help needed building newly released FOSS 'Meshroom'	mStuff	0	3,437	Apr-29-2018, 10:54 AM Last Post: mStuff
	Copy folders to newly created folder and append	Filthy_McNasty	5	6,470	Feb-21-2017, 05:26 PM Last Post: wavic

Why does newly-formed dict only consist of last row of each year?

User Panel Messages

Announcements