May-07-2019, 01:52 PM
(This post was last modified: May-07-2019, 01:53 PM by mk1216. Edited 1 time in total.

*Edit Reason: change file path*)
Hi, I have a requirement to forecast certain values point-in-time into the future based on historic data.

Sample data set:

date_time ,no_of_tables ,total_rows_in_mill ,total_bytes_gb ,average_load_time

01/04/2019 00:00 ,4 ,10 ,2 ,1

02/04/2019 00:15 ,5 ,10.5 ,2 ,1

03/04/2019 00:30 ,4 ,12 ,2.2 ,30

04/04/2019 00:45 ,8 ,20 ,4.5 ,40

05/04/2019 01:00 ,10 ,50 ,10 ,150

06/04/2019 01:15 ,10 ,45 ,11 ,180

07/04/2019 01:30 ,11 ,48 ,10 ,200

08/04/2019 01:45 ,10 ,52 ,12 ,180

09/04/2019 02:00 ,8 ,49 ,13 ,130

As show above, these are metrics from an application. Data is bucketed into 15 minutes and shows how many tables were loading in a given 15 minute window, total rows processed in million unit, total size of data in GB. The last column is average load time in seconds and this is what I need to predict into the future at point-in-time based on the 3 variables plus the datetime field.

I have used a linear regression model and got the following.

At the moment, the model is able to predict just based on the variable fields, but not the date/time field.

Any suggestions please ?

I am new to python and machine learning, so any pointers or advise will be great.

Sample data set:

date_time ,no_of_tables ,total_rows_in_mill ,total_bytes_gb ,average_load_time

01/04/2019 00:00 ,4 ,10 ,2 ,1

02/04/2019 00:15 ,5 ,10.5 ,2 ,1

03/04/2019 00:30 ,4 ,12 ,2.2 ,30

04/04/2019 00:45 ,8 ,20 ,4.5 ,40

05/04/2019 01:00 ,10 ,50 ,10 ,150

06/04/2019 01:15 ,10 ,45 ,11 ,180

07/04/2019 01:30 ,11 ,48 ,10 ,200

08/04/2019 01:45 ,10 ,52 ,12 ,180

09/04/2019 02:00 ,8 ,49 ,13 ,130

As show above, these are metrics from an application. Data is bucketed into 15 minutes and shows how many tables were loading in a given 15 minute window, total rows processed in million unit, total size of data in GB. The last column is average load time in seconds and this is what I need to predict into the future at point-in-time based on the 3 variables plus the datetime field.

I have used a linear regression model and got the following.

import pandas as pd from sklearn import linear_model import statsmodels.api as sm df=pd.read_csv("C:/Users/ABCDE/Downloads/PyTestdata.csv") display(df) X = df[['no_of_tables','total_rows_in_mill','total_bytes_gb']] # here we have 3 variables for multiple regression. Y = df['average_load_time'] print('Intercept: \n', regr.intercept_) print('Coefficients: \n', regr.coef_) new_no_of_tables=20 new_total_rows_in_mill=80 new_total_bytes_gb=20 print ('Predicted average_load_time: \n', regr.predict([[new_no_of_tables ,new_total_rows_in_mill,new_total_bytes_gb]]))However I am not sure how to include date_time into the model. For e.g I want to be able to answer question such as "What will be the average load time if you load 20 tables, 30 million rows, 20GB of data at 10/04/2019 01:00 ?

At the moment, the model is able to predict just based on the variable fields, but not the date/time field.

Any suggestions please ?

I am new to python and machine learning, so any pointers or advise will be great.