Mar-07-2020, 03:54 PM
Standard with small datasets is 80-20 train and test. If you want to do train, validate, and test it would be more like 60-20-20. Recognize that you are not supposed to adjust the parameters to fix predictions on your test set, rather train on the train, see the results on validation and go back to adjust (avoid overfitting, etc) and when done prove you did a good job by running the predictions on your test set. Small set this may be hard, so you may have to compromise some and just use validation or test, though you will need to explain that in your paper.
So here is an example from one of my projects:
Seed of 42 is traditional, and besides being the answer to life, the universe, and everything carries no meaning.
So for you, you really just have 2 columns in your dataframe - year and population. Do the split, then take the year column as X and the population column as Y, and plot it. If it looks linear, do a linear regression. If it does not look linear consider polynomial.
So here is an example from one of my projects:
trainval_dataset = df.sample(frac=0.8,random_state=42) test_dataset = df.drop(trainval_dataset.index) train_dataset = trainval_dataset.sample(frac=0.8, random_state=42) validate_dataset = trainval_dataset.drop(train_dataset.index) print(f"Train {train_dataset.shape} Validate {validate_dataset.shape} Test {test_dataset.shape}")trainval_dataset is the training and validation sets, with test_dataset as the test set (what remains from the total after removing the trainval). Then split trainval into training and validation. So, get 3 sets.
Seed of 42 is traditional, and besides being the answer to life, the universe, and everything carries no meaning.
So for you, you really just have 2 columns in your dataframe - year and population. Do the split, then take the year column as X and the population column as Y, and plot it. If it looks linear, do a linear regression. If it does not look linear consider polynomial.