Hi, I have a requirement to forecast certain values point-in-time into the future based on historic data.
Sample data set:
date_time ,no_of_tables ,total_rows_in_mill ,total_bytes_gb ,average_load_time
01/04/2019 00:00 ,4 ,10 ,2 ,1
02/04/2019 00:15 ,5 ,10.5 ,2 ,1
03/04/2019 00:30 ,4 ,12 ,2.2 ,30
04/04/2019 00:45 ,8 ,20 ,4.5 ,40
05/04/2019 01:00 ,10 ,50 ,10 ,150
06/04/2019 01:15 ,10 ,45 ,11 ,180
07/04/2019 01:30 ,11 ,48 ,10 ,200
08/04/2019 01:45 ,10 ,52 ,12 ,180
09/04/2019 02:00 ,8 ,49 ,13 ,130
As show above, these are metrics from an application. Data is bucketed into 15 minutes and shows how many tables were loading in a given 15 minute window, total rows processed in million unit, total size of data in GB. The last column is average load time in seconds and this is what I need to predict into the future at point-in-time based on the 3 variables plus the datetime field.
I have used a linear regression model and got the following.
At the moment, the model is able to predict just based on the variable fields, but not the date/time field.
Any suggestions please ?
I am new to python and machine learning, so any pointers or advise will be great.
Sample data set:
date_time ,no_of_tables ,total_rows_in_mill ,total_bytes_gb ,average_load_time
01/04/2019 00:00 ,4 ,10 ,2 ,1
02/04/2019 00:15 ,5 ,10.5 ,2 ,1
03/04/2019 00:30 ,4 ,12 ,2.2 ,30
04/04/2019 00:45 ,8 ,20 ,4.5 ,40
05/04/2019 01:00 ,10 ,50 ,10 ,150
06/04/2019 01:15 ,10 ,45 ,11 ,180
07/04/2019 01:30 ,11 ,48 ,10 ,200
08/04/2019 01:45 ,10 ,52 ,12 ,180
09/04/2019 02:00 ,8 ,49 ,13 ,130
As show above, these are metrics from an application. Data is bucketed into 15 minutes and shows how many tables were loading in a given 15 minute window, total rows processed in million unit, total size of data in GB. The last column is average load time in seconds and this is what I need to predict into the future at point-in-time based on the 3 variables plus the datetime field.
I have used a linear regression model and got the following.
import pandas as pd from sklearn import linear_model import statsmodels.api as sm df=pd.read_csv("C:/Users/ABCDE/Downloads/PyTestdata.csv") display(df) X = df[['no_of_tables','total_rows_in_mill','total_bytes_gb']] # here we have 3 variables for multiple regression. Y = df['average_load_time'] print('Intercept: \n', regr.intercept_) print('Coefficients: \n', regr.coef_) new_no_of_tables=20 new_total_rows_in_mill=80 new_total_bytes_gb=20 print ('Predicted average_load_time: \n', regr.predict([[new_no_of_tables ,new_total_rows_in_mill,new_total_bytes_gb]]))However I am not sure how to include date_time into the model. For e.g I want to be able to answer question such as "What will be the average load time if you load 20 tables, 30 million rows, 20GB of data at 10/04/2019 01:00 ?
At the moment, the model is able to predict just based on the variable fields, but not the date/time field.
Any suggestions please ?
I am new to python and machine learning, so any pointers or advise will be great.