predicting values at point in time

predicting values at point in time - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: predicting values at point in time (/thread-18152.html)

predicting values at point in time - mk1216 - May-07-2019

Hi, I have a requirement to forecast certain values point-in-time into the future based on historic data.

Sample data set:
date_time ,no_of_tables ,total_rows_in_mill ,total_bytes_gb ,average_load_time
01/04/2019 00:00 ,4 ,10 ,2 ,1
02/04/2019 00:15 ,5 ,10.5 ,2 ,1
03/04/2019 00:30 ,4 ,12 ,2.2 ,30
04/04/2019 00:45 ,8 ,20 ,4.5 ,40
05/04/2019 01:00 ,10 ,50 ,10 ,150
06/04/2019 01:15 ,10 ,45 ,11 ,180
07/04/2019 01:30 ,11 ,48 ,10 ,200
08/04/2019 01:45 ,10 ,52 ,12 ,180
09/04/2019 02:00 ,8 ,49 ,13 ,130

As show above, these are metrics from an application. Data is bucketed into 15 minutes and shows how many tables were loading in a given 15 minute window, total rows processed in million unit, total size of data in GB. The last column is average load time in seconds and this is what I need to predict into the future at point-in-time based on the 3 variables plus the datetime field.

I have used a linear regression model and got the following.

import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm

df=pd.read_csv("C:/Users/ABCDE/Downloads/PyTestdata.csv")

display(df)

X = df[['no_of_tables','total_rows_in_mill','total_bytes_gb']] # here we have 3 variables for multiple regression. 
Y = df['average_load_time']

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

new_no_of_tables=20
new_total_rows_in_mill=80
new_total_bytes_gb=20

print ('Predicted average_load_time: \n', regr.predict([[new_no_of_tables ,new_total_rows_in_mill,new_total_bytes_gb]]))

However I am not sure how to include date_time into the model. For e.g I want to be able to answer question such as "What will be the average load time if you load 20 tables, 30 million rows, 20GB of data at 10/04/2019 01:00 ?

At the moment, the model is able to predict just based on the variable fields, but not the date/time field.

Any suggestions please ?
I am new to python and machine learning, so any pointers or advise will be great.

RE: predicting values at point in time - mk1216 - May-07-2019

Missed the regression library call in previous post, here is the full program:

import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
 
df=pd.read_csv("C:/Users/ABCDE/Downloads/PyTestdata.csv")
 
display(df)
 
X = df[['no_of_tables','total_rows_in_mill','total_bytes_gb']] # here we have 3 variables for multiple regression. 
Y = df['average_load_time']

# with sklearn
regr_avg_load_time = linear_model.LinearRegression()
regr_avg_load_time.fit(X, Y)
 
print('Intercept: \n', regr_avg_load_time.intercept_)
print('Coefficients: \n', regr_avg_load_time.coef_)
 
new_no_of_tables=20
new_total_rows_in_mill=80
new_total_bytes_gb=20
 
print ('Predicted average_load_time: \n', regr_avg_load_time.predict([[new_no_of_tables ,new_total_rows_in_mill,new_total_bytes_gb]]))