Basic data analysis and predictions

Basic data analysis and predictions - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Basic data analysis and predictions (/thread-24843.html)

Pages: 1 2

Basic data analysis and predictions - mates - Mar-06-2020

Hi guys.
Im kinda new to python but Im trying to do a basic data analysis and prediction for my diploma thesis. The diploma thesis is about predicting the future growth of the Chinese Air Transport market. I´ve got data about sold ticket from 1974 to 2018 and the goal is to predict the ascending trend by using machine learning to the year 2025 based on the historical data.

This is how far I´ve got.

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC



# Load dataset
filename = 'analyze_me.csv'
names = ['year', 'passengers', ]
dataset = read_csv('analyze_me.csv', names=names)



2
# head
print(dataset.head(50))

# Split-out validation dataset
array = dataset.values
x = array[:,1]
y = array[:,0]
X_train, X_validation, Y_train, Y_validation, = train_test_split(x, y, test_size=0.20, random_state=1)
print ()
print ()
print('x_train= ',X_train)
print ('X_validation = ',X_validation)
print('Y_train= ',Y_train)
print ('Y_validation = ',Y_validation)


# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=2, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
    
        # example of training a final classification model
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# fit final model
model = LogisticRegression()
model.fit(X, y)

The first problem is that my X_train and Y_train values are not in order when i print them, I think that this could hinder the accuracy of what Im trying to achieve here. Not Sure guys.

Output:x_train=  [183613132 611439830 158013351  52277000  19520000 136721623  86040642
 266293020 390878784   1540000  53234000  17000000   5000000 352795296
  61891807  72660653 318475924  37601000  27345000  55853100   2568000
  83671798  12500000   3236000 487960477   1000000  16596100    710000
  11080000   2519000   7300000   3836000   3942000  10000000 551234509
 292160158]
X_validation =  [  1110000   1050000 119789024  47564500  51770100 436183969  31312500
 229062099 191001220]
Y_train=  [2007 2018 2006 1997 1991 2005 2003 2010 2014 1978 1998 1988 1984 2013
 2000 2001 2012 1994 1992 1999 1980 2002 1987 1981 2016 1975 1990 1974
 1989 1979 1985 1983 1982 1986 2017 2011]
Y_validation =  [1977 1976 2004 1995 1996 2015 1993 2009 2008]

The second problem is that Im not sure whether I trained my model well enough, becasue I had to ditch k-fold cross-validation technique. I was getting an error that said n_splits are greater than 2.

Error:
ValueError: n_splits=2 cannot be greater than the number of members in each class.

Here is my original dataset.

Output: Year                  Passengers
0   1974                          710000
1   1975                         1000000
2   1976                         1050000
3   1977                         1110000
4   1978                         1540000
5   1979                         2519000
6   1980                         2568000
7   1981                         3236000
8   1982                         3942000
9   1983                         3836000
10  1984                         5000000
11  1985                         7300000
12  1986                        10000000
13  1987                        12500000
14  1988                        17000000
15  1989                        11080000
16  1990                        16596100
17  1991                        19520000
18  1992                        27345000
19  1993                        31312500
20  1994                        37601000
21  1995                        47564500
22  1996                        51770100
23  1997                        52277000
24  1998                        53234000
25  1999                        55853100
26  2000                        61891807
27  2001                        72660653
28  2002                        83671798
29  2003                        86040642
30  2004                       119789024
31  2005                       136721623
32  2006                       158013351
33  2007                       183613132
34  2008                       191001220
35  2009                       229062099
36  2010                       266293020
37  2011                       292160158
38  2012                       318475924
39  2013                       352795296
40  2014                       390878784
41  2015                       436183969
42  2016                       487960477
43  2017                       551234509
44  2018                       611439830

If any of you would have a suggestion if Im even trying to go the right way, I would really aprreciate that :). Thank you

RE: Basic data analysis and predictions - jefsummers - Mar-07-2020

First, looks like you are reversing the X and Ys. Y is what is being predicted. You have Y as the years, and I dont think you are trying to predict the year.

Then, you are throwing models in there that really dont go together. If you are trying to predict passengers from the year, linear regression (and/or polynomial regression) works. Suggest you read up on the algorithms on the SciKitLearn website - some of these are appropriate for numeric functions like this, some for clustering and unsupervised learning, some are classification.

Suggest linear regression only of the methods you have, with another option being a Deep Neural Net.

RE: Basic data analysis and predictions - mates - Mar-07-2020

Thank you very much for your reply. I will look at those models.

So how do you suggest I split the dataset correctly ? What data should I have as Y_train and Y_validattion ?

RE: Basic data analysis and predictions - jefsummers - Mar-07-2020

Standard with small datasets is 80-20 train and test. If you want to do train, validate, and test it would be more like 60-20-20. Recognize that you are not supposed to adjust the parameters to fix predictions on your test set, rather train on the train, see the results on validation and go back to adjust (avoid overfitting, etc) and when done prove you did a good job by running the predictions on your test set. Small set this may be hard, so you may have to compromise some and just use validation or test, though you will need to explain that in your paper.
So here is an example from one of my projects:

trainval_dataset = df.sample(frac=0.8,random_state=42)
test_dataset = df.drop(trainval_dataset.index)
train_dataset = trainval_dataset.sample(frac=0.8, random_state=42)
validate_dataset = trainval_dataset.drop(train_dataset.index)
print(f"Train {train_dataset.shape} Validate {validate_dataset.shape} Test {test_dataset.shape}")

trainval_dataset is the training and validation sets, with test_dataset as the test set (what remains from the total after removing the trainval). Then split trainval into training and validation. So, get 3 sets.
Seed of 42 is traditional, and besides being the answer to life, the universe, and everything carries no meaning.

So for you, you really just have 2 columns in your dataframe - year and population. Do the split, then take the year column as X and the population column as Y, and plot it. If it looks linear, do a linear regression. If it does not look linear consider polynomial.

RE: Basic data analysis and predictions - mates - Mar-07-2020

First of all thank you Jeff for helping me with this. Im trying to understand what are you trying to tell me, thats why I try to do a quick summary. Just to be sure if we are on the same page here.

My problem is that I have a small dataset, thats why I cannot work with my data and the models I put in my code. You are suggesting just plot my data X (years), Y (Passengers) and figure out which regression to use(Linear vs Polynomial). So are we ditching machine learning at all ? Do I understand it correctly ? If not, please feel free to correct my assumptions.

I plotted my data, and I think it is not linear, so I should use polynomial regression. What do you think ?

Graph

P.S. rok = year , Pocet prepravenych cestujucich = Passengers

First of all thank you Jeff for helping me with this. Im really trying to uderstand what do you mean splitting my datast into 3.

So my trainval_dataset should represent the whole array ? The whole years and numbers of passengers alike ?
What should than I put in test_dataset and train_dataset ?

Please ignore the last posts. I get now that I need to split the data to three, as you suggested. The part I dont clearly understand that what should be in my trainval_dataset, test_dataset and train_dataset.

Sorry for the mess with the posts, for some reason I cannot delete the last three so please react just on this one.

RE: Basic data analysis and predictions - jefsummers - Mar-08-2020

What I did was split off the test dataset first, leaving training and validation (trainval), then split that as well into training and validation, leaving 3 sets. Each split was as 80/20 split. I was just trying to get the 80/20 splits done to get 3 sets, and this method works.

Your graph looks exponential, so polynomial with an order of 2 should work.

So it looks like a polynomial regression will work. That is in the family of machine learning. The other methods you included were:
Logistic regression - kind of like linear or polynomial regression but for classifying data. For example, if looking at images of apples and oranges and deciding between them, this would be the choice.
KNN/K Clustering - used for clustering. I did an analysis of restaurants in Toronto and used it to find the restaurant districts.
Decision Tree Classifier - again used to classify, not to estimate a value
etc.
Anything that says classifier is not used for estimating values, rather for classifying types.
I don't know enough about some of the items you were importing to comment, but would restrict to regression types rather than classification types.

Now for other types of machine learning, (again, regression counts as machine learning), you could use Keras, Tensorflow, and a Deep Neural Network. I doubt you would get as good results given how your curve looks - polynomial regression really looks like the way to go. But, if you want to do the DNN approach I will help with that as well. You could then show the loss (mean squared error) in your sets and pick the method that gives the best results.

RE: Basic data analysis and predictions - mates - Mar-08-2020

Thank you Jeff for clarifying it to me . Now I´ve got an idea what was I doing wrong.

So for my data, do I need to split the data to 3 sets, just as you suggested ? Or do I need to do just the polynomial regression with my original dataset without splitting ?

RE: Basic data analysis and predictions - jefsummers - Mar-08-2020

Splitting will allow you to "prove your model" - create the regression using the training set, tweak the hyperparameters using validation, and prove you did it right with the test data.

Are you familiar with overfitting? That is when your model gets really good at predicting the training data but is really adjusted just for that and does poorly in predicting with the test data. That is what you want to avoid.

Using the split data helps you to avoid overfitting - if you are great with the training data but poor with validation, simplify the model.

RE: Basic data analysis and predictions - mates - Mar-08-2020

Ok Jeff, I´ve succesfully split my data. The years are in random order, is that okay ?

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import pandas as pd


#Split
data = pd.read_csv("analyza_casovych_radov.csv")
names = ['Rok', 'Pocet prepravenych cestujucich', ]
dataset = read_csv('analyza_casovych_radov.csv', names=names)


df = pd.DataFrame(dataset)
#print (df)


trainval_dataset = df.sample(frac=0.8,random_state=42)
test_dataset = df.drop(trainval_dataset.index)
train_dataset = trainval_dataset.sample(frac=0.8, random_state=42)
validate_dataset = trainval_dataset.drop(train_dataset.index)
print ()
print(f"Train {train_dataset.shape} Validate {validate_dataset.shape} Test {test_dataset.shape}")

print ()
print ()

print ('train_dataset= ')  
print (train_dataset)

print ()
print ('test_dataset= ')  
print (test_dataset)

print ()
print ('validate_dataset= ')
print (validate_dataset)

Output:Train (29, 2) Validate (7, 2) Test (9, 2)


train_dataset= 
     Rok  Pocet prepravenych cestujucich
36  2010                       266293020
19  1993                        31312500
5   1979                         2519000
40  2014                       390878784
42  2016                       487960477
21  1995                        47564500
31  2005                       136721623
32  2006                       158013351
8   1982                         3942000
15  1989                        11080000
3   1977                         1110000
23  1997                        52277000
39  2013                       352795296
35  2009                       229062099
1   1975                         1000000
13  1987                        12500000
16  1990                        16596100
41  2015                       436183969
24  1998                        53234000
25  1999                        55853100
30  2004                       119789024
26  2000                        61891807
34  2008                       191001220
43  2017                       551234509
2   1976                         1050000
0   1974                          710000
11  1985                         7300000
6   1980                         2568000
27  2001                        72660653

test_dataset= 
     Rok  Pocet prepravenych cestujucich
7   1981                         3236000
10  1984                         5000000
14  1988                        17000000
18  1992                        27345000
20  1994                        37601000
22  1996                        51770100
28  2002                        83671798
38  2012                       318475924
44  2018                       611439830

validate_dataset= 
     Rok  Pocet prepravenych cestujucich
4   1978                         1540000
12  1986                        10000000
17  1991                        19520000
9   1983                         3836000
37  2011                       292160158
29  2003                        86040642
33  2007                       183613132

So what now ? Should I build a model for polynomial regression ?

I also manage to plot a polynomial regression using the test, train and validation dataset.

Split_data_graph

RE: Basic data analysis and predictions - mates - Mar-08-2020

I´ve built a polynomial regression model based on the train dataset. But I have no clue if the result is good or how to tweak him.

train_regression

array = train_dataset.values
y = array[:,1].reshape(-1, 1)
X = array[:,0].reshape(-1, 1)

#print (X,y)


poly = PolynomialFeatures(degree = 5) 
X_poly = poly.fit_transform(X) 

poly.fit(X_poly, y) 
lin2 = LinearRegression() 
lin2.fit(X_poly, y)
plt.scatter(X, y, color = 'blue')
plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red')
plt.show()