Feb-04-2024, 11:41 AM
Greetings Pythonistas!
I am taking Colt Steele’s Udemy course on data analysis where he teaches his students how to parse .CSV files using pandas, plotly, matplotlib, and seaborn.
I’ve watched over half the content so far and completed most of the exercises as I go.
I am now applying what I have learned up this this point to a personal dataset which is not part of the sample CSV data provided by the instructor. I am charting new territory.
My dataset is from an Android app where I have been recording my research/productivity sessions based on different projects/tasks over the years. Each entry accounts for time spent and is organized according by category. For example, if I am coding a website with Django, at the end of my coding session, I enter the time I started and the time I ended and leave a note. I've been recording my productivity in this way since 2020. While I have data for the past 4 years, for now I am working with just the year 2020. Using pandas I wish to plot the data onto bar charts to visualize how my interests and productivity has changed and evolved over time.
I have a working prototype. The end product is basic. I still have some work to do. But before I continue, the biggest challenge I've identified is that I have noticed that I am clearly breaking the fundamental Pythonic principle of "DRY". Below are some code snippets I’ve written. As you are reading over it, here are the questions I have for all of you to keep in mind:
Here is my work so far:
Now to show what the table looks like in general for one specific category ('Python')
Here is how I the data into bar chart using pandas:
I am taking Colt Steele’s Udemy course on data analysis where he teaches his students how to parse .CSV files using pandas, plotly, matplotlib, and seaborn.
I’ve watched over half the content so far and completed most of the exercises as I go.
I am now applying what I have learned up this this point to a personal dataset which is not part of the sample CSV data provided by the instructor. I am charting new territory.
My dataset is from an Android app where I have been recording my research/productivity sessions based on different projects/tasks over the years. Each entry accounts for time spent and is organized according by category. For example, if I am coding a website with Django, at the end of my coding session, I enter the time I started and the time I ended and leave a note. I've been recording my productivity in this way since 2020. While I have data for the past 4 years, for now I am working with just the year 2020. Using pandas I wish to plot the data onto bar charts to visualize how my interests and productivity has changed and evolved over time.
I have a working prototype. The end product is basic. I still have some work to do. But before I continue, the biggest challenge I've identified is that I have noticed that I am clearly breaking the fundamental Pythonic principle of "DRY". Below are some code snippets I’ve written. As you are reading over it, here are the questions I have for all of you to keep in mind:
- How could I plot the same bar chart with pandas but by better automating the parsing of the ‘Activity type’ (category) Series data points? There are 12 'Acitivity types' (categories). I manually calculated 5 of them. But there are 7 more. It gets granular and tiresome since I am repeating myself over and over. This is problematic.
- Eventually I’d like to plot the data onto 4 different bar charts comparing different quarters of the year: Q1,Q2,Q3,Q4. Using my current novice approach where I am repeating myself, that would involve parsing those same 12 data points 4 times again and again (meaning there would 48 of pretty much the same code blocks). How might you people recommend I improve my approach with this data set?
Here is my work so far:
import pandas as pd df2020 = pd.read_csv('data/2020-raw-sans-comment.csv')To give a sense as to the size of my dataset, here are the total value counts of each category and the output:
df2020["Activity type"].value_counts()
Quote:Activity type
--------
Python 110
Django 106
Research (general) 102
Magick 59
Web design 25
Organizing 21
sysadmin 15
nix general 14
TM prep 12
Writing 11
Self-hypnosis 5
Visualization 1
Now to show what the table looks like in general for one specific category ('Python')
# Collect 'Activity type' column Series 'Python' python_instances = df2020[df2020["Activity type"] == "Python"] python_instancesOutput:
Activity type Duration From To 2 Python 00:33:41 2020-12-24 15:20:42 2020-12-24 15:54:23 9 Python 00:36:58 2020-12-10 12:07:59 2020-12-10 12:44:57 13 Python 00:35:37 2020-12-06 21:45:04 2020-12-06 22:20:41 16 Python 00:44:19 2020-12-03 19:15:41 2020-12-03 20:00:00 18 Python 01:52:29 2020-12-03 07:50:29 2020-12-03 09:42:58 ... ... ... ... ... 454 Python 00:29:14 2020-01-15 22:00:46 2020-01-15 22:30:00 459 Python 00:29:17 2020-01-12 20:40:43 2020-01-12 21:10:00 463 Python 00:29:36 2020-01-09 22:18:24 2020-01-09 22:48:00 464 Python 00:59:56 2020-01-09 17:00:04 2020-01-09 18:00:00 465 Python 00:29:35 2020-01-07 20:40:25 2020-01-07 21:10:00Next is where I calculate the sum time in hours of the
Duration
column (essentially, the time delta between From
and To
):# Calculate (sum) time deltas in Duration Series and convert to hours for 'Python' python_instances = df2020[df2020["Activity type"] == "Python"] time_python_deltas = pd.to_timedelta(python_instances["Duration"]) total_hours_python = time_python_deltas.sum().total_seconds() / 3600 # Ditto for 'Django' django_instances = df2020[df2020["Activity type"] == "Django"] time_django_deltas = pd.to_timedelta(django_instances["Duration"]) total_hours_django = time_django_deltas.sum().total_seconds() / 3600 # Ditto for 'Research (general)' research_instances = df2020[df2020["Activity type"] == "Research (general)"] time_research_deltas = pd.to_timedelta(research_instances["Duration"]) total_hours_research = time_research_deltas.sum().total_seconds() / 3600 # Ditto for 'Web design' web_design_instances = df2020[df2020["Activity type"] == "Web design"] time_web_design_deltas = pd.to_timedelta(web_design_instances["Duration"]) total_hours_web_design = time_web_design_deltas.sum().total_seconds() / 3600 # Ditto for 'Magick' magick_instances = df2020[df2020["Activity type"] == "Magick"] time_magick_deltas = pd.to_timedelta(magick_instances["Duration"]) total_hours_magick = time_magick_deltas.sum().total_seconds() / 3600The above concludes the parsing of 5 Activity type categories. As you can see, it's repetitive. How can I improve my approach here? There must be a better way.
Here is how I the data into bar chart using pandas:
data = { 'Activity Type': ['Python', 'Django', 'Research', 'Web Design', 'Magick'], 'Hours': [total_hours_python, total_hours_django, total_hours_research, total_hours_web_design, total_hours_magick] } # Create a DataFrame df = pd.DataFrame(data) # Set the 'Category' column as the index df.set_index('Activity Type', inplace=True) # Plot the pie chart df.plot.bar(y='Hours', figsize=(8, 8))Attached are the end products so far.