Python Forum

Full Version: Parsing "aTimeLogger" Android app data to graphs using pandas
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Greetings Pythonistas!

I’ve got a data set in CSV format which spans from July 2019 up to December 2023. It is a collection of productivity intervals that I’ve meticulously recorded while doing social science research like Philosophy as well as learning how to program Python, Django, some DevOps, and a little algorithmic design. When I finish spending 25 minutes on my lunch hour on my tablet watching Udemy course content teaching the basics of Binary Search Trees, at the end of my lunch break, using an app called aTimeLogger on my Android phone I enter the “Activity” type (discrete subjects such as “Python”, “algorithms”, “Django”, “sysadmin”, or even “Magick” / “writing”). I also enter the date, the start time, and the end time. After entering a start time and end time, a time delta is automatically calculated. Then I write an annotation (1-2 sentences) which is sort of like a mental note for my future reference. See attached for the data set. Take note: I purged the “Comment” (annotation) data objects (string fields) from the data set for the sake of this forum thread.

For additional context, over the term of the data set, I’ve spent a total of ~378 hours doing something “Python” related and ~579 hours working on Django course content (or a Django based web project). Those are the two largest categories. The rest of the Activities aren’t as data dense.

I am trying to plot these activities as line graphs over time. The x-axis is the continuous element of time. The y-axis are the summed duration time deltas for every entry. If I take just the ‘Python’ activity and plot every instance as a line graph over the ~5 year span/period, the data points are noisy, busy, and hard to follow. But they still plot successfully. Here is my initilization of the data:

import pandas as pd
import matplotlib.pyplot as plt

bulk_df = pd.read_csv("data/all-comments-removed.csv",parse_dates=["From","To",])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
Here is some general info about my dataset:

bulk_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2583 entries, 0 to 2582
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype          
---  ------    --------------  -----          
 0   Activity  2583 non-null   object         
 1   Duration  2583 non-null   timedelta64[ns]
 2   From      2583 non-null   datetime64[ns] 
 3   To        2583 non-null   datetime64[ns] 
dtypes: datetime64[ns](2), object(1), timedelta64[ns](1)
memory usage: 80.8+ KB
Here is my ‘Python’ activity data parsed:

# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean()

# Plotting
python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8))
Here is the output so far:

[attachment=2860]

That kind of works.

But what I really want to do is use pandas’ rolling() method and pass in integers like ‘182’ days for a half year and ‘90’ for 3 months (quarter year). I tried that . You can see my code snippet above where I tried to use .rolling(). My Jupyter Notebook didn’t like it so I commented it out. I am clearly doing something wrong.

Here was a separate attempt at a different stage in my coding session this morning where I tried to plot the “Python” activity with the original noisy data points which I tried to simulatenously contrast with a rolling window:

# Convert timedelta to hours
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"]

# Plotting
# python.plot(x='From', y='Duration_hours', kind='line', marker='o')

# Calculate the rolling mean with a window of 2
rolling_mean = python['Duration_hours'].rolling(window=2).mean()

# Plotting
python.plot(x='From', y='Duration_hours', kind='line', marker='o',figsize=(14, 8))
rolling_mean.plot(x='From', y='Duration_hours', kind='line', marker='o', label='Rolling Mean')
Here was the strange output:

[attachment=2861]

How do I tell pandas to show the quarterly and half year averages of the timedelta data point for the “Python” activity. Thanks!
Toq uote myself from earlier:

(May-12-2024, 09:15 AM)Drone4four Wrote: [ -> ]
# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean()

# Plotting
python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8))

At line 2, when I invoke pandas' dt module to parse quarter years with this:

python = bulk_df[bulk_df["Activity"] == "Python"].dt.quarter
I get this traceback:

Error:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /tmp/ipykernel_3152351/2962891946.py in ?() 1 # Create a DataFrame 2 # python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean() ----> 3 python = bulk_df[bulk_df["Activity"] == "Python"].dt.quarter 4 5 # Plotting 6 python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8)) /usr/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name) 5898 and name not in self._accessors 5899 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5900 ): 5901 return self[name] -> 5902 return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'dt'
Can try something like this.
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

bulk_df = pd.read_csv('all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600
# Copy so changes made to python_df dos not affect bulk_df and vice versa
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)
# Calculate rolling means using the index now
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()
# Plot
plt.figure(figsize=(14, 8))
plt.plot(python_df.index, python_df['Duration_hours'], marker='o', linestyle='', label='Raw Duration Data')
plt.plot(python_df.index, python_df['Rolling_Mean_90'], label='90-Day Rolling Average')
plt.plot(python_df.index, python_df['Rolling_Mean_182'], label='182-Day Rolling Average')
plt.title("Python Activity with Rolling Averages")
plt.xlabel("Date")
plt.ylabel("Hours Spent")
plt.legend()
plt.show()
@snippsat: Thank you! This is terrific. It works really well.

I made some light changes. For example, I commented out the “Python” activity raw duration data as well as the “Python” 182-day rolling mean line and then added the “Django” activity 90-day rolling mean. This enables me to compare the two categories. Here is how it looks now:
[attachment=2876]
I thought the average/mean was what I needed, but apparently not. The lines gyrate sporadically. That’s not really what I had in mind. To refine my intention, what I was intending to achieve is all the “Python” activity data stacked (summed together) in quarterly increments. There should only be one data point every 3 months.

There is another way of explaining what I have set out to accomplish. If I replot the data as a bar graph with:

plt.bar(python_df.index, python_df['Rolling_Mean_90'], label='Python 90-Day Rolling Average')
plt.bar(django_df.index, django_df['Rolling_Mean_90'], label='Django 90-Day Rolling Average', color="red")
As-is that parses like this:
[attachment=2877]
It looks to me like every individual data point is getting plotted. What I really want to do is stack all the “Python” and “Django” data points collected and summed together in 90 day intervals rather than all individually.

What changes do I need to make in order to achieve that?
Its really difficult to understand! Bonkself
Try this:

python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)
plt.figure(figsize=(14, 8))
plt.bar(py_dj_Month_combined.index, py_dj_Month_combined, label='django-90-day Rolling Mean', color='blue')
plt.bar(python_df_Month.index, python_df_Month, label='python-90-day Rolling Mean', color='red')
plt.legend()
plt.show()
@ericxzhou! Thank you. This worked. My apologies on my delayed response.

Here is how Jupyter with pandas and matplotlib parse the code now:

[attachment=2893]

As you can see in the bottom (second) bar graph, all the data points are showing. Thanks gain ericxzhou. Although based mostly on your suggestions, I made some slight modifications. Below is my latest code snippet.

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

# Load the data
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Copy and filter data for Python activity
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)

# Calculate rolling means
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()

# Plot raw data and rolling mean
plt.figure(figsize=(12, 6))
plt.plot(python_df.index, python_df['Duration_hours'], marker='o', linestyle='', label='Raw Python Duration Data')
plt.plot(python_df.index, python_df['Rolling_Mean_90'], label='90-Day Rolling Average')
# plt.plot(python_df.index, python_df['Rolling_Mean_182'], label='182-Day Rolling Average')
plt.title("Python Activity with Rolling Averages")
plt.xlabel("Date")
plt.ylabel("Hours Spent")
plt.legend()

# Resample and combine data
python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)

# Plot the combined data with wider bars
plt.figure(figsize=(12, 6))
plt.bar(py_dj_Month_combined.index, py_dj_Month_combined, width=20, label='Django 90-Day Rolling Mean')  # Adjust width as needed
plt.bar(python_df_Month.index, python_df_Month, width=20, label='Python 90-Day Rolling Mean')  # Adjust width as needed
plt.legend()
plt.show()
I have returned to this same project and wish to extend my data analysis. My two latest code snippets and graphs can be found below.

Here are my questions for each pair:

  1. In the first snippet and graph, in my Jupyter Notebook pandas and matplotlib show two categories successfully thanks to the helpful feedback from other forum members. So thank you to those who have contributed to the discussion so far. But I noticed that when I change the alpha (translucency) variable, the time spent on the different categories overlap each other. How do I stack the data instead? That’s my first question.
  2. In the second snippet and graph, only one category shows up (”Magick”). How do I get the other “Research” category to show? As far as I can tell, the way I parse, modify, and cast function calls and methods against the two dataframes should work. I’ve been swapping out variable names, tried refactoring, as well as making large and small other changes without success. Who here can identify what I might be missing to get both categories to show (instead of one)? (My additional intent here is to ensure they also stack (rather than overlapping) like I have set out to do with the first graph).
First pair:
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt
 
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Copy so changes made to python_df dos not affect bulk_df and vice versa
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)

# Calculate rolling means using the index now
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()

# Copy so changes made to django_df dos not affect bulk_df and vice versa
django_df = bulk_df[bulk_df["Activity"] == "Django"].copy()
django_df.set_index('From', inplace=True)
# Calculate rolling means using the index now
django_df['Rolling_Mean_90'] = django_df['Duration_hours'].rolling('90D').mean()
django_df['Rolling_Mean_182'] = django_df['Duration_hours'].rolling('182D').mean()

python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
# py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)

plt.figure(figsize=(14, 8))
plt.bar(python_df_Month.index, python_df_Month, label='Python 90-Day Rolling Mean',width=20, alpha=0.5) # color='red')
plt.bar(django_df_Month.index, django_df_Month, label='Django 90-Day Rolling Mean', width=20, alpha=0.5) #, color='blue')
plt.legend()
plt.title('Stacked Bar Chart for Python and Django Activities')
plt.xlabel('Date')
plt.ylabel('Hours Spent')
plt.show()
That renders as:
[attachment=2903]

Second pair:
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

# Load the data
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Copy and filter data for "Magick" activity and calculate rolling means
magick_df = bulk_df[bulk_df["Activity"] == "Magick"].copy()
magick_df.set_index('From', inplace=True)
magick_df['Rolling_Mean_90'] = magick_df['Duration_hours'].rolling('90D').mean()
magick_df['Rolling_Mean_182'] = magick_df['Duration_hours'].rolling('182D').mean()

# Copy and filter data for "Research (general)" activity and calculate rolling means
research_df = bulk_df[bulk_df["Activity"] == "Research (general)"].copy()
research_df.set_index('From', inplace=True)
research_df['Rolling_Mean_90'] = research_df['Duration_hours'].rolling('90D').mean()
research_df['Rolling_Mean_182'] = research_df['Duration_hours'].rolling('182D').mean()

# Resample data
magick_df_Month = magick_df['Rolling_Mean_90'].resample('MS').sum()
research_df_Month = research_df['Rolling_Mean_90'].resample('MS').sum()

# Plot the combined data with wider bars
plt.figure(figsize=(12, 6))
plt.bar(research_df_Month.index, research_df_Month, label='"Research" 90-Day Rolling Mean', width=20, alpha=0.5, color='blue')
plt.bar(magick_df_Month.index, magick_df_Month, label='"Magick" ("Philosophy") 90-Day Rolling Mean',width=20, alpha=0.5, color='red')

plt.legend()
plt.title('Stacked Bar Chart for Magick and Research Activities')
plt.xlabel('Date')
plt.ylabel('Hours Spent')
plt.show()
That shows as:
[attachment=2904]
Eureka! I got it working.

Here is my final end product along with the working code snippet.

[attachment=2922]

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

# Load the data
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Copy and filter data for "Magick" activity and calculate rolling means
magick_df = bulk_df[bulk_df["Activity"] == "Magick"].copy()
magick_df.set_index('From', inplace=True)
magick_df['Rolling_Mean_90'] = magick_df['Duration_hours'].rolling('90D').mean()
magick_df['Rolling_Mean_182'] = magick_df['Duration_hours'].rolling('182D').mean()

# Copy and filter data for "Research (general)" activity and calculate rolling means
research_df = bulk_df[bulk_df["Activity"] == "Research"].copy()
research_df.set_index('From', inplace=True)
research_df['Rolling_Mean_90'] = research_df['Duration_hours'].rolling('90D').mean()
research_df['Rolling_Mean_182'] = research_df['Duration_hours'].rolling('182D').mean()

# Resample data
magick_df_Month = magick_df['Rolling_Mean_90'].resample('MS').sum()
research_df_Month = research_df['Rolling_Mean_90'].resample('MS').sum()

# Copy and filter data for "Python" activity and calculate rolling means
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()

# Copy and filter data for "Django" activity and calculate rolling means
django_df = bulk_df[bulk_df["Activity"] == "Django"].copy()
django_df.set_index('From', inplace=True)
django_df['Rolling_Mean_90'] = django_df['Duration_hours'].rolling('90D').mean()
django_df['Rolling_Mean_182'] = django_df['Duration_hours'].rolling('182D').mean()

# Resample data
python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()

# Create a common index
common_index = magick_df_Month.index.union(research_df_Month.index).union(python_df_Month.index).union(django_df_Month.index)

# Reindex all Series to the common index
magick_df_Month = magick_df_Month.reindex(common_index, fill_value=0)
research_df_Month = research_df_Month.reindex(common_index, fill_value=0)
python_df_Month = python_df_Month.reindex(common_index, fill_value=0)
django_df_Month = django_df_Month.reindex(common_index, fill_value=0)

# Plot the combined data with stacked bars
plt.figure(figsize=(14, 8))

# Plot first activity
plt.bar(magick_df_Month.index, magick_df_Month, label='"Magick" ("Philosophy")', width=20, alpha=1, color='#3D9E60')

# Plot second activity, stacked on the first
plt.bar(research_df_Month.index, research_df_Month, label='"Research"', width=20, alpha=1, color='#2A6D20', bottom=magick_df_Month)

# Plot third activity, stacked on the first and second
plt.bar(django_df_Month.index, django_df_Month, label='"Django"', width=20, alpha=1, color='#FB947E', bottom=magick_df_Month + research_df_Month)

# Plot fourth activity, stacked on the first, second, and third
plt.bar(python_df_Month.index, python_df_Month, label='"Python"', width=20, alpha=1, color='#D95134', bottom=magick_df_Month + research_df_Month + django_df_Month)

plt.legend()
plt.title('Stacked Bar Chart for Python, Django, Magick, Research Activities 90-Day Rolling Mean')
plt.xlabel('Quarter (90 day increments)')
plt.ylabel('Hours Spent')
plt.show()