Mar-06-2020, 11:57 PM
Hi guys.
Im kinda new to python but Im trying to do a basic data analysis and prediction for my diploma thesis. The diploma thesis is about predicting the future growth of the Chinese Air Transport market. I´ve got data about sold ticket from 1974 to 2018 and the goal is to predict the ascending trend by using machine learning to the year 2025 based on the historical data.
This is how far I´ve got.
Im kinda new to python but Im trying to do a basic data analysis and prediction for my diploma thesis. The diploma thesis is about predicting the future growth of the Chinese Air Transport market. I´ve got data about sold ticket from 1974 to 2018 and the goal is to predict the ascending trend by using machine learning to the year 2025 based on the historical data.
This is how far I´ve got.
# Load libraries from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # Load dataset filename = 'analyze_me.csv' names = ['year', 'passengers', ] dataset = read_csv('analyze_me.csv', names=names) 2 # head print(dataset.head(50)) # Split-out validation dataset array = dataset.values x = array[:,1] y = array[:,0] X_train, X_validation, Y_train, Y_validation, = train_test_split(x, y, test_size=0.20, random_state=1) print () print () print('x_train= ',X_train) print ('X_validation = ',X_validation) print('Y_train= ',Y_train) print ('Y_validation = ',Y_validation) # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=2, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())) # example of training a final classification model from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y)The first problem is that my X_train and Y_train values are not in order when i print them, I think that this could hinder the accuracy of what Im trying to achieve here. Not Sure guys.
Output:x_train= [183613132 611439830 158013351 52277000 19520000 136721623 86040642
266293020 390878784 1540000 53234000 17000000 5000000 352795296
61891807 72660653 318475924 37601000 27345000 55853100 2568000
83671798 12500000 3236000 487960477 1000000 16596100 710000
11080000 2519000 7300000 3836000 3942000 10000000 551234509
292160158]
X_validation = [ 1110000 1050000 119789024 47564500 51770100 436183969 31312500
229062099 191001220]
Y_train= [2007 2018 2006 1997 1991 2005 2003 2010 2014 1978 1998 1988 1984 2013
2000 2001 2012 1994 1992 1999 1980 2002 1987 1981 2016 1975 1990 1974
1989 1979 1985 1983 1982 1986 2017 2011]
Y_validation = [1977 1976 2004 1995 1996 2015 1993 2009 2008]
The second problem is that Im not sure whether I trained my model well enough, becasue I had to ditch k-fold cross-validation technique. I was getting an error that said n_splits are greater than 2.Error:ValueError: n_splits=2 cannot be greater than the number of members in each class.
Here is my original dataset.Output: Year Passengers
0 1974 710000
1 1975 1000000
2 1976 1050000
3 1977 1110000
4 1978 1540000
5 1979 2519000
6 1980 2568000
7 1981 3236000
8 1982 3942000
9 1983 3836000
10 1984 5000000
11 1985 7300000
12 1986 10000000
13 1987 12500000
14 1988 17000000
15 1989 11080000
16 1990 16596100
17 1991 19520000
18 1992 27345000
19 1993 31312500
20 1994 37601000
21 1995 47564500
22 1996 51770100
23 1997 52277000
24 1998 53234000
25 1999 55853100
26 2000 61891807
27 2001 72660653
28 2002 83671798
29 2003 86040642
30 2004 119789024
31 2005 136721623
32 2006 158013351
33 2007 183613132
34 2008 191001220
35 2009 229062099
36 2010 266293020
37 2011 292160158
38 2012 318475924
39 2013 352795296
40 2014 390878784
41 2015 436183969
42 2016 487960477
43 2017 551234509
44 2018 611439830
If any of you would have a suggestion if Im even trying to go the right way, I would really aprreciate that :). Thank you