![]() |
predicting values at point in time - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: predicting values at point in time (/thread-18152.html) |
predicting values at point in time - mk1216 - May-07-2019 Hi, I have a requirement to forecast certain values point-in-time into the future based on historic data. Sample data set: date_time ,no_of_tables ,total_rows_in_mill ,total_bytes_gb ,average_load_time 01/04/2019 00:00 ,4 ,10 ,2 ,1 02/04/2019 00:15 ,5 ,10.5 ,2 ,1 03/04/2019 00:30 ,4 ,12 ,2.2 ,30 04/04/2019 00:45 ,8 ,20 ,4.5 ,40 05/04/2019 01:00 ,10 ,50 ,10 ,150 06/04/2019 01:15 ,10 ,45 ,11 ,180 07/04/2019 01:30 ,11 ,48 ,10 ,200 08/04/2019 01:45 ,10 ,52 ,12 ,180 09/04/2019 02:00 ,8 ,49 ,13 ,130 As show above, these are metrics from an application. Data is bucketed into 15 minutes and shows how many tables were loading in a given 15 minute window, total rows processed in million unit, total size of data in GB. The last column is average load time in seconds and this is what I need to predict into the future at point-in-time based on the 3 variables plus the datetime field. I have used a linear regression model and got the following. import pandas as pd from sklearn import linear_model import statsmodels.api as sm df=pd.read_csv("C:/Users/ABCDE/Downloads/PyTestdata.csv") display(df) X = df[['no_of_tables','total_rows_in_mill','total_bytes_gb']] # here we have 3 variables for multiple regression. Y = df['average_load_time'] print('Intercept: \n', regr.intercept_) print('Coefficients: \n', regr.coef_) new_no_of_tables=20 new_total_rows_in_mill=80 new_total_bytes_gb=20 print ('Predicted average_load_time: \n', regr.predict([[new_no_of_tables ,new_total_rows_in_mill,new_total_bytes_gb]]))However I am not sure how to include date_time into the model. For e.g I want to be able to answer question such as "What will be the average load time if you load 20 tables, 30 million rows, 20GB of data at 10/04/2019 01:00 ? At the moment, the model is able to predict just based on the variable fields, but not the date/time field. Any suggestions please ? I am new to python and machine learning, so any pointers or advise will be great. RE: predicting values at point in time - mk1216 - May-07-2019 Missed the regression library call in previous post, here is the full program: import pandas as pd from sklearn import linear_model import statsmodels.api as sm df=pd.read_csv("C:/Users/ABCDE/Downloads/PyTestdata.csv") display(df) X = df[['no_of_tables','total_rows_in_mill','total_bytes_gb']] # here we have 3 variables for multiple regression. Y = df['average_load_time'] # with sklearn regr_avg_load_time = linear_model.LinearRegression() regr_avg_load_time.fit(X, Y) print('Intercept: \n', regr_avg_load_time.intercept_) print('Coefficients: \n', regr_avg_load_time.coef_) new_no_of_tables=20 new_total_rows_in_mill=80 new_total_bytes_gb=20 print ('Predicted average_load_time: \n', regr_avg_load_time.predict([[new_no_of_tables ,new_total_rows_in_mill,new_total_bytes_gb]])) |