Basic data analysis and predictions - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Basic data analysis and predictions (/thread-24843.html) Pages:
1
2
|
Basic data analysis and predictions - mates - Mar-06-2020 Hi guys. Im kinda new to python but Im trying to do a basic data analysis and prediction for my diploma thesis. The diploma thesis is about predicting the future growth of the Chinese Air Transport market. I´ve got data about sold ticket from 1974 to 2018 and the goal is to predict the ascending trend by using machine learning to the year 2025 based on the historical data. This is how far I´ve got. # Load libraries from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # Load dataset filename = 'analyze_me.csv' names = ['year', 'passengers', ] dataset = read_csv('analyze_me.csv', names=names) 2 # head print(dataset.head(50)) # Split-out validation dataset array = dataset.values x = array[:,1] y = array[:,0] X_train, X_validation, Y_train, Y_validation, = train_test_split(x, y, test_size=0.20, random_state=1) print () print () print('x_train= ',X_train) print ('X_validation = ',X_validation) print('Y_train= ',Y_train) print ('Y_validation = ',Y_validation) # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=2, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())) # example of training a final classification model from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y)The first problem is that my X_train and Y_train values are not in order when i print them, I think that this could hinder the accuracy of what Im trying to achieve here. Not Sure guys. The second problem is that Im not sure whether I trained my model well enough, becasue I had to ditch k-fold cross-validation technique. I was getting an error that said n_splits are greater than 2. Here is my original dataset. If any of you would have a suggestion if Im even trying to go the right way, I would really aprreciate that :). Thank you
RE: Basic data analysis and predictions - jefsummers - Mar-07-2020 First, looks like you are reversing the X and Ys. Y is what is being predicted. You have Y as the years, and I dont think you are trying to predict the year. Then, you are throwing models in there that really dont go together. If you are trying to predict passengers from the year, linear regression (and/or polynomial regression) works. Suggest you read up on the algorithms on the SciKitLearn website - some of these are appropriate for numeric functions like this, some for clustering and unsupervised learning, some are classification. Suggest linear regression only of the methods you have, with another option being a Deep Neural Net. RE: Basic data analysis and predictions - mates - Mar-07-2020 Thank you very much for your reply. I will look at those models. So how do you suggest I split the dataset correctly ? What data should I have as Y_train and Y_validattion ? RE: Basic data analysis and predictions - jefsummers - Mar-07-2020 Standard with small datasets is 80-20 train and test. If you want to do train, validate, and test it would be more like 60-20-20. Recognize that you are not supposed to adjust the parameters to fix predictions on your test set, rather train on the train, see the results on validation and go back to adjust (avoid overfitting, etc) and when done prove you did a good job by running the predictions on your test set. Small set this may be hard, so you may have to compromise some and just use validation or test, though you will need to explain that in your paper. So here is an example from one of my projects: trainval_dataset = df.sample(frac=0.8,random_state=42) test_dataset = df.drop(trainval_dataset.index) train_dataset = trainval_dataset.sample(frac=0.8, random_state=42) validate_dataset = trainval_dataset.drop(train_dataset.index) print(f"Train {train_dataset.shape} Validate {validate_dataset.shape} Test {test_dataset.shape}")trainval_dataset is the training and validation sets, with test_dataset as the test set (what remains from the total after removing the trainval). Then split trainval into training and validation. So, get 3 sets. Seed of 42 is traditional, and besides being the answer to life, the universe, and everything carries no meaning. So for you, you really just have 2 columns in your dataframe - year and population. Do the split, then take the year column as X and the population column as Y, and plot it. If it looks linear, do a linear regression. If it does not look linear consider polynomial. RE: Basic data analysis and predictions - mates - Mar-07-2020 First of all thank you Jeff for helping me with this. Im trying to understand what are you trying to tell me, thats why I try to do a quick summary. Just to be sure if we are on the same page here. My problem is that I have a small dataset, thats why I cannot work with my data and the models I put in my code. You are suggesting just plot my data X (years), Y (Passengers) and figure out which regression to use(Linear vs Polynomial). So are we ditching machine learning at all ? Do I understand it correctly ? If not, please feel free to correct my assumptions. I plotted my data, and I think it is not linear, so I should use polynomial regression. What do you think ? Graph P.S. rok = year , Pocet prepravenych cestujucich = Passengers First of all thank you Jeff for helping me with this. Im really trying to uderstand what do you mean splitting my datast into 3. So my trainval_dataset should represent the whole array ? The whole years and numbers of passengers alike ? What should than I put in test_dataset and train_dataset ? First of all thank you Jeff for helping me with this. Im really trying to uderstand what do you mean splitting my datast into 3. So my trainval_dataset should represent the whole array ? The whole years and numbers of passengers alike ? What should than I put in test_dataset and train_dataset ?[/quote] Please ignore the last posts. I get now that I need to split the data to three, as you suggested. The part I dont clearly understand that what should be in my trainval_dataset, test_dataset and train_dataset. Sorry for the mess with the posts, for some reason I cannot delete the last three so please react just on this one. RE: Basic data analysis and predictions - jefsummers - Mar-08-2020 What I did was split off the test dataset first, leaving training and validation (trainval), then split that as well into training and validation, leaving 3 sets. Each split was as 80/20 split. I was just trying to get the 80/20 splits done to get 3 sets, and this method works. Your graph looks exponential, so polynomial with an order of 2 should work. So it looks like a polynomial regression will work. That is in the family of machine learning. The other methods you included were: Logistic regression - kind of like linear or polynomial regression but for classifying data. For example, if looking at images of apples and oranges and deciding between them, this would be the choice. KNN/K Clustering - used for clustering. I did an analysis of restaurants in Toronto and used it to find the restaurant districts. Decision Tree Classifier - again used to classify, not to estimate a value etc. Anything that says classifier is not used for estimating values, rather for classifying types. I don't know enough about some of the items you were importing to comment, but would restrict to regression types rather than classification types. Now for other types of machine learning, (again, regression counts as machine learning), you could use Keras, Tensorflow, and a Deep Neural Network. I doubt you would get as good results given how your curve looks - polynomial regression really looks like the way to go. But, if you want to do the DNN approach I will help with that as well. You could then show the loss (mean squared error) in your sets and pick the method that gives the best results. RE: Basic data analysis and predictions - mates - Mar-08-2020 Thank you Jeff for clarifying it to me . Now I´ve got an idea what was I doing wrong. So for my data, do I need to split the data to 3 sets, just as you suggested ? Or do I need to do just the polynomial regression with my original dataset without splitting ? RE: Basic data analysis and predictions - jefsummers - Mar-08-2020 Splitting will allow you to "prove your model" - create the regression using the training set, tweak the hyperparameters using validation, and prove you did it right with the test data. Are you familiar with overfitting? That is when your model gets really good at predicting the training data but is really adjusted just for that and does poorly in predicting with the test data. That is what you want to avoid. Using the split data helps you to avoid overfitting - if you are great with the training data but poor with validation, simplify the model. RE: Basic data analysis and predictions - mates - Mar-08-2020 Ok Jeff, I´ve succesfully split my data. The years are in random order, is that okay ? # Load libraries from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC import pandas as pd #Split data = pd.read_csv("analyza_casovych_radov.csv") names = ['Rok', 'Pocet prepravenych cestujucich', ] dataset = read_csv('analyza_casovych_radov.csv', names=names) df = pd.DataFrame(dataset) #print (df) trainval_dataset = df.sample(frac=0.8,random_state=42) test_dataset = df.drop(trainval_dataset.index) train_dataset = trainval_dataset.sample(frac=0.8, random_state=42) validate_dataset = trainval_dataset.drop(train_dataset.index) print () print(f"Train {train_dataset.shape} Validate {validate_dataset.shape} Test {test_dataset.shape}") print () print () print ('train_dataset= ') print (train_dataset) print () print ('test_dataset= ') print (test_dataset) print () print ('validate_dataset= ') print (validate_dataset) So what now ? Should I build a model for polynomial regression ?I also manage to plot a polynomial regression using the test, train and validation dataset. Split_data_graph RE: Basic data analysis and predictions - mates - Mar-08-2020 I´ve built a polynomial regression model based on the train dataset. But I have no clue if the result is good or how to tweak him. train_regression array = train_dataset.values y = array[:,1].reshape(-1, 1) X = array[:,0].reshape(-1, 1) #print (X,y) poly = PolynomialFeatures(degree = 5) X_poly = poly.fit_transform(X) poly.fit(X_poly, y) lin2 = LinearRegression() lin2.fit(X_poly, y) plt.scatter(X, y, color = 'blue') plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red') plt.show() |