Feb-01-2024, 04:24 AM
Hello,
I am trying to figure out why I can't get this to work. I have a SQL server database and I am able to connect to it and query it by pyodbc (SQLAlchemy won't work) and I save that as a csv file. In that file, I combined over 100 tables with about 84,000 rows each. Each of the rows are TIMESTAMP at 15 min intervals (energy meters) and I am attempting to sum up the consumption values (already computed in the csv file (difference from the previous VALUE associated TIMESTAMP)). In this python script, I am trying to generate 54 columns where the first column is Tablename from the csv file, Most Recent TIMESTAMP of the associated data pull, Week_1, Week_2, Week_3.....Week_52. The week columns are supposed to sum up the Consumption column from the csv for each week starting from today's date going back a full 52 weeks (for a full year). Each row in this table should be only one unique table name per row.
Python generates a csv that has a unique tablename per row but the calculations are in close to 3270631 vice the total should be 15769 KWH for the entire week. Any help would be greatly appreciated. I did make sure the TIMESTAMP is pandas formatted for date time when I parsed it, so not sure what it is basing the calculations off of. Here is my code:
import pandas as pd
# Load the combined_data.csv file
combined_df = pd.read_csv('combined_data.csv')
# Ensure 'TIMESTAMP' is in datetime format
combined_df['TIMESTAMP'] = pd.to_datetime(combined_df['TIMESTAMP'])
# Calculate the week number for each row
combined_df['Week_Number'] = combined_df['TIMESTAMP'].dt.strftime('%Y-%U')
# Calculate Consumption based on the previous row
combined_df['Consumption'] = combined_df.groupby('TableName')['VALUE'].diff()
# Exclude rows where Consumption is zero or negative
combined_df['Consumption'] = combined_df['Consumption'].apply(lambda x: x if x > 0 else pd.NA)
# Calculate the current week for filtering
current_week = pd.to_datetime('today').strftime('%Y-%U')
# Filter rows for the current week and the three previous weeks
filtered_df = combined_df[combined_df['Week_Number'].between(current_week, current_week) |
combined_df['Week_Number'].between(current_week, current_week) |
combined_df['Week_Number'].between(str(pd.to_datetime(current_week) - pd.DateOffset(weeks=1)), current_week) |
combined_df['Week_Number'].between(str(pd.to_datetime(current_week) - pd.DateOffset(weeks=2)), current_week) |
combined_df['Week_Number'].between(str(pd.to_datetime(current_week) - pd.DateOffset(weeks=3)), current_week)]
# Group by TableName and Week_Number and sum the Consumption values
weekly_sum_df = filtered_df.groupby(['TableName', 'Week_Number'])['Consumption'].sum().unstack().reset_index()
# Rename the columns to represent each week
weekly_sum_df.columns = ['TableName'] + [f'Week_{i}' for i in range(1, weekly_sum_df.shape[1])]
# Save the result to a CSV file
weekly_sum_df.to_csv('weekly_sum_by_table.csv', index=False)
print("Weekly sums by table saved to 'weekly_sum_by_table.csv'.")
I am trying to figure out why I can't get this to work. I have a SQL server database and I am able to connect to it and query it by pyodbc (SQLAlchemy won't work) and I save that as a csv file. In that file, I combined over 100 tables with about 84,000 rows each. Each of the rows are TIMESTAMP at 15 min intervals (energy meters) and I am attempting to sum up the consumption values (already computed in the csv file (difference from the previous VALUE associated TIMESTAMP)). In this python script, I am trying to generate 54 columns where the first column is Tablename from the csv file, Most Recent TIMESTAMP of the associated data pull, Week_1, Week_2, Week_3.....Week_52. The week columns are supposed to sum up the Consumption column from the csv for each week starting from today's date going back a full 52 weeks (for a full year). Each row in this table should be only one unique table name per row.
Python generates a csv that has a unique tablename per row but the calculations are in close to 3270631 vice the total should be 15769 KWH for the entire week. Any help would be greatly appreciated. I did make sure the TIMESTAMP is pandas formatted for date time when I parsed it, so not sure what it is basing the calculations off of. Here is my code:
import pandas as pd
# Load the combined_data.csv file
combined_df = pd.read_csv('combined_data.csv')
# Ensure 'TIMESTAMP' is in datetime format
combined_df['TIMESTAMP'] = pd.to_datetime(combined_df['TIMESTAMP'])
# Calculate the week number for each row
combined_df['Week_Number'] = combined_df['TIMESTAMP'].dt.strftime('%Y-%U')
# Calculate Consumption based on the previous row
combined_df['Consumption'] = combined_df.groupby('TableName')['VALUE'].diff()
# Exclude rows where Consumption is zero or negative
combined_df['Consumption'] = combined_df['Consumption'].apply(lambda x: x if x > 0 else pd.NA)
# Calculate the current week for filtering
current_week = pd.to_datetime('today').strftime('%Y-%U')
# Filter rows for the current week and the three previous weeks
filtered_df = combined_df[combined_df['Week_Number'].between(current_week, current_week) |
combined_df['Week_Number'].between(current_week, current_week) |
combined_df['Week_Number'].between(str(pd.to_datetime(current_week) - pd.DateOffset(weeks=1)), current_week) |
combined_df['Week_Number'].between(str(pd.to_datetime(current_week) - pd.DateOffset(weeks=2)), current_week) |
combined_df['Week_Number'].between(str(pd.to_datetime(current_week) - pd.DateOffset(weeks=3)), current_week)]
# Group by TableName and Week_Number and sum the Consumption values
weekly_sum_df = filtered_df.groupby(['TableName', 'Week_Number'])['Consumption'].sum().unstack().reset_index()
# Rename the columns to represent each week
weekly_sum_df.columns = ['TableName'] + [f'Week_{i}' for i in range(1, weekly_sum_df.shape[1])]
# Save the result to a CSV file
weekly_sum_df.to_csv('weekly_sum_by_table.csv', index=False)
print("Weekly sums by table saved to 'weekly_sum_by_table.csv'.")