Python Forum

Greetings Pythonistas!

I am taking Colt Steele’s Udemy course on data analysis where he teaches his students how to parse .CSV files using pandas, plotly, matplotlib, and seaborn.

I’ve watched over half the content so far and completed most of the exercises as I go.

I am now applying what I have learned up this this point to a personal dataset which is not part of the sample CSV data provided by the instructor. I am charting new territory.

My dataset is from an Android app where I have been recording my research/productivity sessions based on different projects/tasks over the years. Each entry accounts for time spent and is organized according by category. For example, if I am coding a website with Django, at the end of my coding session, I enter the time I started and the time I ended and leave a note. I've been recording my productivity in this way since 2020. While I have data for the past 4 years, for now I am working with just the year 2020. Using pandas I wish to plot the data onto bar charts to visualize how my interests and productivity has changed and evolved over time.

I have a working prototype. The end product is basic. I still have some work to do. But before I continue, the biggest challenge I've identified is that I have noticed that I am clearly breaking the fundamental Pythonic principle of "DRY". Below are some code snippets I’ve written. As you are reading over it, here are the questions I have for all of you to keep in mind:

How could I plot the same bar chart with pandas but by better automating the parsing of the ‘Activity type’ (category) Series data points? There are 12 'Acitivity types' (categories). I manually calculated 5 of them. But there are 7 more. It gets granular and tiresome since I am repeating myself over and over. This is problematic.
Eventually I’d like to plot the data onto 4 different bar charts comparing different quarters of the year: Q1,Q2,Q3,Q4. Using my current novice approach where I am repeating myself, that would involve parsing those same 12 data points 4 times again and again (meaning there would 48 of pretty much the same code blocks). How might you people recommend I improve my approach with this data set?

Here is my work so far:

import pandas as pd
df2020 = pd.read_csv('data/2020-raw-sans-comment.csv')

To give a sense as to the size of my dataset, here are the total value counts of each category and the output:

df2020["Activity type"].value_counts()

Quote:Activity type
--------
Python 110
Django 106
Research (general) 102
Magick 59
Web design 25
Organizing 21
sysadmin 15
nix general 14
TM prep 12
Writing 11
Self-hypnosis 5
Visualization 1

Now to show what the table looks like in general for one specific category ('Python')

# Collect 'Activity type' column Series 'Python'
python_instances = df2020[df2020["Activity type"] == "Python"]
python_instances

Output:

Activity type	Duration	From	To
2	Python	00:33:41	2020-12-24 15:20:42	2020-12-24 15:54:23
9	Python	00:36:58	2020-12-10 12:07:59	2020-12-10 12:44:57
13	Python	00:35:37	2020-12-06 21:45:04	2020-12-06 22:20:41
16	Python	00:44:19	2020-12-03 19:15:41	2020-12-03 20:00:00
18	Python	01:52:29	2020-12-03 07:50:29	2020-12-03 09:42:58
...	...	...	...	...
454	Python	00:29:14	2020-01-15 22:00:46	2020-01-15 22:30:00
459	Python	00:29:17	2020-01-12 20:40:43	2020-01-12 21:10:00
463	Python	00:29:36	2020-01-09 22:18:24	2020-01-09 22:48:00
464	Python	00:59:56	2020-01-09 17:00:04	2020-01-09 18:00:00
465	Python	00:29:35	2020-01-07 20:40:25	2020-01-07 21:10:00

Next is where I calculate the sum time in hours of the Duration column (essentially, the time delta between From and To):

# Calculate (sum) time deltas in Duration Series and convert to hours for 'Python'
python_instances = df2020[df2020["Activity type"] == "Python"]
time_python_deltas = pd.to_timedelta(python_instances["Duration"])
total_hours_python = time_python_deltas.sum().total_seconds() / 3600

# Ditto for 'Django' 
django_instances = df2020[df2020["Activity type"] == "Django"]
time_django_deltas = pd.to_timedelta(django_instances["Duration"])
total_hours_django = time_django_deltas.sum().total_seconds() / 3600

# Ditto for 'Research (general)' 
research_instances = df2020[df2020["Activity type"] == "Research (general)"]
time_research_deltas = pd.to_timedelta(research_instances["Duration"])
total_hours_research = time_research_deltas.sum().total_seconds() / 3600

# Ditto for 'Web design' 
web_design_instances = df2020[df2020["Activity type"] == "Web design"]
time_web_design_deltas = pd.to_timedelta(web_design_instances["Duration"])
total_hours_web_design = time_web_design_deltas.sum().total_seconds() / 3600

# Ditto for 'Magick' 
magick_instances = df2020[df2020["Activity type"] == "Magick"]
time_magick_deltas = pd.to_timedelta(magick_instances["Duration"])
total_hours_magick = time_magick_deltas.sum().total_seconds() / 3600

The above concludes the parsing of 5 Activity type categories. As you can see, it's repetitive. How can I improve my approach here? There must be a better way.

Here is how I the data into bar chart using pandas:

data = {
    'Activity Type': ['Python', 'Django', 'Research', 'Web Design', 'Magick'],
    'Hours': [total_hours_python, total_hours_django, total_hours_research, total_hours_web_design, total_hours_magick]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Set the 'Category' column as the index
df.set_index('Activity Type', inplace=True)

# Plot the pie chart
df.plot.bar(y='Hours', figsize=(8, 8))

Attached are the end products so far.

I think you can load all the activities into one dataframe, group by activity, and sum the duration.

from random import random, choice
import pandas as pd
import matplotlib.pyplot as plt


df = pd.DataFrame(
    [{"Activity": choice("ABCDE"), "Duration": random() * 10} for _ in range(20)]
)
print(df)
df.groupby("Activity")["Duration"].sum().plot.pie()
plt.show()

Hi! Thank you for your reply, @deanhystad. Your snippet works well. Based on your feedback, plus based on a subsequent back and forth conversation I had with ChatGPT, here is the final product I settled with where I arrived at the right output which is much cleaner where I don't repeat myself:

# Convert 'Duration' column to timedelta format
df2020['Duration'] = pd.to_timedelta(df2020['Duration'])

# Group by 'Activity type' and sum the durations
grouped_df = df2020.groupby('Activity')['Duration'].sum()

# Convert timedelta to hours
grouped_df = grouped_df / pd.Timedelta(hours=1)

# Plot the bar chart
grouped_df.plot.bar(figsize=(8, 8))

N.B.: I didn't have to use plt.show() since I am using a Jupyter Notebook.

Drone4four

deanhystad

Drone4four