Python Forum

Full Version: Basic data analysis and predictions
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi guys.
Im kinda new to python but Im trying to do a basic data analysis and prediction for my diploma thesis. The diploma thesis is about predicting the future growth of the Chinese Air Transport market. I´ve got data about sold ticket from 1974 to 2018 and the goal is to predict the ascending trend by using machine learning to the year 2025 based on the historical data.

This is how far I´ve got.

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC



# Load dataset
filename = 'analyze_me.csv'
names = ['year', 'passengers', ]
dataset = read_csv('analyze_me.csv', names=names)



2
# head
print(dataset.head(50))

# Split-out validation dataset
array = dataset.values
x = array[:,1]
y = array[:,0]
X_train, X_validation, Y_train, Y_validation, = train_test_split(x, y, test_size=0.20, random_state=1)
print ()
print ()
print('x_train= ',X_train)
print ('X_validation = ',X_validation)
print('Y_train= ',Y_train)
print ('Y_validation = ',Y_validation)


# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=2, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
    
        # example of training a final classification model
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# fit final model
model = LogisticRegression()
model.fit(X, y)
The first problem is that my X_train and Y_train values are not in order when i print them, I think that this could hinder the accuracy of what Im trying to achieve here. Not Sure guys.

Output:
x_train= [183613132 611439830 158013351 52277000 19520000 136721623 86040642 266293020 390878784 1540000 53234000 17000000 5000000 352795296 61891807 72660653 318475924 37601000 27345000 55853100 2568000 83671798 12500000 3236000 487960477 1000000 16596100 710000 11080000 2519000 7300000 3836000 3942000 10000000 551234509 292160158] X_validation = [ 1110000 1050000 119789024 47564500 51770100 436183969 31312500 229062099 191001220] Y_train= [2007 2018 2006 1997 1991 2005 2003 2010 2014 1978 1998 1988 1984 2013 2000 2001 2012 1994 1992 1999 1980 2002 1987 1981 2016 1975 1990 1974 1989 1979 1985 1983 1982 1986 2017 2011] Y_validation = [1977 1976 2004 1995 1996 2015 1993 2009 2008]
The second problem is that Im not sure whether I trained my model well enough, becasue I had to ditch k-fold cross-validation technique. I was getting an error that said n_splits are greater than 2.

Error:
ValueError: n_splits=2 cannot be greater than the number of members in each class.
Here is my original dataset.
Output:
Year Passengers 0 1974 710000 1 1975 1000000 2 1976 1050000 3 1977 1110000 4 1978 1540000 5 1979 2519000 6 1980 2568000 7 1981 3236000 8 1982 3942000 9 1983 3836000 10 1984 5000000 11 1985 7300000 12 1986 10000000 13 1987 12500000 14 1988 17000000 15 1989 11080000 16 1990 16596100 17 1991 19520000 18 1992 27345000 19 1993 31312500 20 1994 37601000 21 1995 47564500 22 1996 51770100 23 1997 52277000 24 1998 53234000 25 1999 55853100 26 2000 61891807 27 2001 72660653 28 2002 83671798 29 2003 86040642 30 2004 119789024 31 2005 136721623 32 2006 158013351 33 2007 183613132 34 2008 191001220 35 2009 229062099 36 2010 266293020 37 2011 292160158 38 2012 318475924 39 2013 352795296 40 2014 390878784 41 2015 436183969 42 2016 487960477 43 2017 551234509 44 2018 611439830
If any of you would have a suggestion if Im even trying to go the right way, I would really aprreciate that :). Thank you
First, looks like you are reversing the X and Ys. Y is what is being predicted. You have Y as the years, and I dont think you are trying to predict the year.

Then, you are throwing models in there that really dont go together. If you are trying to predict passengers from the year, linear regression (and/or polynomial regression) works. Suggest you read up on the algorithms on the SciKitLearn website - some of these are appropriate for numeric functions like this, some for clustering and unsupervised learning, some are classification.

Suggest linear regression only of the methods you have, with another option being a Deep Neural Net.
Thank you very much for your reply. I will look at those models.

So how do you suggest I split the dataset correctly ? What data should I have as Y_train and Y_validattion ?
Standard with small datasets is 80-20 train and test. If you want to do train, validate, and test it would be more like 60-20-20. Recognize that you are not supposed to adjust the parameters to fix predictions on your test set, rather train on the train, see the results on validation and go back to adjust (avoid overfitting, etc) and when done prove you did a good job by running the predictions on your test set. Small set this may be hard, so you may have to compromise some and just use validation or test, though you will need to explain that in your paper.
So here is an example from one of my projects:
trainval_dataset = df.sample(frac=0.8,random_state=42)
test_dataset = df.drop(trainval_dataset.index)
train_dataset = trainval_dataset.sample(frac=0.8, random_state=42)
validate_dataset = trainval_dataset.drop(train_dataset.index)
print(f"Train {train_dataset.shape} Validate {validate_dataset.shape} Test {test_dataset.shape}")
trainval_dataset is the training and validation sets, with test_dataset as the test set (what remains from the total after removing the trainval). Then split trainval into training and validation. So, get 3 sets.
Seed of 42 is traditional, and besides being the answer to life, the universe, and everything carries no meaning.

So for you, you really just have 2 columns in your dataframe - year and population. Do the split, then take the year column as X and the population column as Y, and plot it. If it looks linear, do a linear regression. If it does not look linear consider polynomial.
First of all thank you Jeff for helping me with this. Im trying to understand what are you trying to tell me, thats why I try to do a quick summary. Just to be sure if we are on the same page here.

My problem is that I have a small dataset, thats why I cannot work with my data and the models I put in my code. You are suggesting just plot my data X (years), Y (Passengers) and figure out which regression to use(Linear vs Polynomial). So are we ditching machine learning at all ? Do I understand it correctly ? If not, please feel free to correct my assumptions.

I plotted my data, and I think it is not linear, so I should use polynomial regression. What do you think ?

Graph

P.S. rok = year , Pocet prepravenych cestujucich = Passengers

First of all thank you Jeff for helping me with this. Im really trying to uderstand what do you mean splitting my datast into 3.

So my trainval_dataset should represent the whole array ? The whole years and numbers of passengers alike ?
What should than I put in test_dataset and train_dataset ?

First of all thank you Jeff for helping me with this. Im really trying to uderstand what do you mean splitting my datast into 3.

So my trainval_dataset should represent the whole array ? The whole years and numbers of passengers alike ?
What should than I put in test_dataset and train_dataset ?[/quote]

Please ignore the last posts. I get now that I need to split the data to three, as you suggested. The part I dont clearly understand that what should be in my trainval_dataset, test_dataset and train_dataset.

Sorry for the mess with the posts, for some reason I cannot delete the last three so please react just on this one.
What I did was split off the test dataset first, leaving training and validation (trainval), then split that as well into training and validation, leaving 3 sets. Each split was as 80/20 split. I was just trying to get the 80/20 splits done to get 3 sets, and this method works.

Your graph looks exponential, so polynomial with an order of 2 should work.

So it looks like a polynomial regression will work. That is in the family of machine learning. The other methods you included were:
Logistic regression - kind of like linear or polynomial regression but for classifying data. For example, if looking at images of apples and oranges and deciding between them, this would be the choice.
KNN/K Clustering - used for clustering. I did an analysis of restaurants in Toronto and used it to find the restaurant districts.
Decision Tree Classifier - again used to classify, not to estimate a value
etc.
Anything that says classifier is not used for estimating values, rather for classifying types.
I don't know enough about some of the items you were importing to comment, but would restrict to regression types rather than classification types.

Now for other types of machine learning, (again, regression counts as machine learning), you could use Keras, Tensorflow, and a Deep Neural Network. I doubt you would get as good results given how your curve looks - polynomial regression really looks like the way to go. But, if you want to do the DNN approach I will help with that as well. You could then show the loss (mean squared error) in your sets and pick the method that gives the best results.
Thank you Jeff for clarifying it to me . Now I´ve got an idea what was I doing wrong.

So for my data, do I need to split the data to 3 sets, just as you suggested ? Or do I need to do just the polynomial regression with my original dataset without splitting ?
Splitting will allow you to "prove your model" - create the regression using the training set, tweak the hyperparameters using validation, and prove you did it right with the test data.

Are you familiar with overfitting? That is when your model gets really good at predicting the training data but is really adjusted just for that and does poorly in predicting with the test data. That is what you want to avoid.

Using the split data helps you to avoid overfitting - if you are great with the training data but poor with validation, simplify the model.
Ok Jeff, I´ve succesfully split my data. The years are in random order, is that okay ?
# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import pandas as pd


#Split
data = pd.read_csv("analyza_casovych_radov.csv")
names = ['Rok', 'Pocet prepravenych cestujucich', ]
dataset = read_csv('analyza_casovych_radov.csv', names=names)


df = pd.DataFrame(dataset)
#print (df)


trainval_dataset = df.sample(frac=0.8,random_state=42)
test_dataset = df.drop(trainval_dataset.index)
train_dataset = trainval_dataset.sample(frac=0.8, random_state=42)
validate_dataset = trainval_dataset.drop(train_dataset.index)
print ()
print(f"Train {train_dataset.shape} Validate {validate_dataset.shape} Test {test_dataset.shape}")

print ()
print ()

print ('train_dataset= ')  
print (train_dataset)

print ()
print ('test_dataset= ')  
print (test_dataset)

print ()
print ('validate_dataset= ')
print (validate_dataset)
Output:
Train (29, 2) Validate (7, 2) Test (9, 2) train_dataset= Rok Pocet prepravenych cestujucich 36 2010 266293020 19 1993 31312500 5 1979 2519000 40 2014 390878784 42 2016 487960477 21 1995 47564500 31 2005 136721623 32 2006 158013351 8 1982 3942000 15 1989 11080000 3 1977 1110000 23 1997 52277000 39 2013 352795296 35 2009 229062099 1 1975 1000000 13 1987 12500000 16 1990 16596100 41 2015 436183969 24 1998 53234000 25 1999 55853100 30 2004 119789024 26 2000 61891807 34 2008 191001220 43 2017 551234509 2 1976 1050000 0 1974 710000 11 1985 7300000 6 1980 2568000 27 2001 72660653 test_dataset= Rok Pocet prepravenych cestujucich 7 1981 3236000 10 1984 5000000 14 1988 17000000 18 1992 27345000 20 1994 37601000 22 1996 51770100 28 2002 83671798 38 2012 318475924 44 2018 611439830 validate_dataset= Rok Pocet prepravenych cestujucich 4 1978 1540000 12 1986 10000000 17 1991 19520000 9 1983 3836000 37 2011 292160158 29 2003 86040642 33 2007 183613132
So what now ? Should I build a model for polynomial regression ?

I also manage to plot a polynomial regression using the test, train and validation dataset.

Split_data_graph
I´ve built a polynomial regression model based on the train dataset. But I have no clue if the result is good or how to tweak him.

train_regression

array = train_dataset.values
y = array[:,1].reshape(-1, 1)
X = array[:,0].reshape(-1, 1)

#print (X,y)


poly = PolynomialFeatures(degree = 5) 
X_poly = poly.fit_transform(X) 

poly.fit(X_poly, y) 
lin2 = LinearRegression() 
lin2.fit(X_poly, y)
plt.scatter(X, y, color = 'blue')
plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red')
plt.show()
Pages: 1 2