Python Forum
Linear Regression Python3 code giving weird solutions
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Linear Regression Python3 code giving weird solutions
#1
Dear Members,
Welcome to another noob question. I am learning to code in Python for scientific research.
I wrote a python code following the math for linear regression. When I use the dataset used in the website from where I referred the math all my intermediate steps match the solution provided in the worked-out example in the website. I checked by plotting the data on excel that everything matches.
Next, I used a small subset of a second data set and the Root Mean Square Error value comes out in the 200s while an excel plot says the RMSE should be 0.99, however, the slope and intercept value I calculate matches with the values in excel. I cannot track what is going on with my RMSE calculation.
Both data sets are available in my code.
Data set 1 between lines 22-24 and is commented out.
Data set 2 (the one causing the problem) can be found between lines 32-34
The RMSE code can be found between lines 105 to 140.

Any insight would be greatly appreciated.

'''
 Created on Fri Nov 01 2019 1:15:01 PM

  2019 Deep Sen
'''
'''
Linear Regression Math steps:
https://machinelearningmastery.com/simple-linear-regression-tutorial-for-machine-learning/

I wrote the python code so that I can understand every step. This is version 1.
Version 2 will have functions and user can enter data.
Version 3 will allow user to point to an csv file with x,y data.
Version 4 will have grahical user interface
'''
# Simple Linear regression

import matplotlib.pyplot as plt
import numpy as np

# Data

# Trial data in the link given above
#depth_x = np.array([1, 2, 4, 3, 5])
#age_y = np.array([1, 3, 3, 2, 5])

# When I use the data above, all my intermidiate solutions match all the intermidiate step solutions provided in the link above. An excel plot of the data gives the same slope, intercept and RMSE value so the mathematics shown is correct.

# But when I use the data from Waltham see data block below I get a really weird RMSE value in the two hundreds! What am I missing? The slope value and intercept value matches with an excel plot of the data below.

# Data in Mathematics Tool for Geologist by David Waltham, pg 21, Table 2.2

depth_x = np.array([0.5, 1.3, 2.47, 4.9, 8.2])
age_y = np.array([1020, 2376, 5008, 10203, 15986])

'''
Simple linear regression model y = b0 + b1*x
where x is the independent variable, y is the dependent variable, b0 is the intercept and b1 is the slope.
'''
#######################################

### Estimating b1, which represents the slope
'''
1) b1 can be estimated by :

b1 = sum((xi-mean(x)) * (yi-mean(y))) / sum((xi – mean(x))**2)

b1 = (ss_xy/ss_xx)**2

where xi and yi are the ith value of x and y in an array or a list.

'''
# Mean of depth_x and age_y

average_x = np.mean(depth_x)
average_y = np.mean(age_y)
print('Average Depth: ', average_x, 'm' '\nAverage Age:', average_y, 'yr')

# Calculating difference between depth_xi and average_x

diff_x = depth_x - average_x

#print('(xi - mean x): ',diff_x)

# Calculating difference between age_yi and average_y
diff_y = age_y - average_y

#print('(yi - mean y): ', diff_y)

# Products of the differences

p_xy = diff_x * diff_y

#print('Product of difference: ', p_xy)

# Sum of products

sp_xy = np.sum(p_xy)

#print('Sum of Products: ', sp_xy)

# Sum of the difference between xi and mean x

sp_xx = np.sum(diff_x**2)

#print('Sum of difference (xi - mean x): ', sp_xx)


# Calculating b1 = (sp_xy / sp_xx)**2

b1 = (sp_xy/sp_xx)

print('Slope: ', b1)

#####################################

# Estimating b0 which represents the intercept

'''
2) b0 can be estimated by:

b0 = mean(y) – b1 * mean(x)

'''
b0 = average_y - b1 * average_x

print('Intercept: ', b0)

######################################

# Calculaing the Root Mean Square Error

# Calculating predicted value of y

pred_y = b0 + (b1 * depth_x)

#print('Predicted y value: ', pred_y)

# RMSE = sqrt( sum( (pred_y – yi)^2 )/n )

# Square of difference between pred_y and yi

sqdiff_y = (pred_y - age_y)**2

#print('Square of (pred_y - yi): ', sqdiff_y)

#Sum of the Square of difference between pred_y and yi

s_sqdiff_y = np.sum(sqdiff_y)

#print('Sum of square of (pred_y - yi): ', s_sqdiff_y)

# Average of the Sum of the Square of difference between pred_y and yi

av_s_sqdiff_y = s_sqdiff_y / np.size(age_y)

#print('Average of the Sum of the Square of (pred_y - yi): ', av_s_sqdiff_y)

# Square root of Average of the Sum of the Square of difference between pred_y and yi

rmse = np.sqrt(av_s_sqdiff_y)

print('Root Mean Square Error: ', rmse)

##################################

# Plotting scatter plot of data

plt.scatter(depth_x, age_y, color='m', marker = 'o', s = 30)

# Plotting Linear fit

plt. plot(depth_x, pred_y, color='r')

plt.show()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Simple linear regression with interaction summary table Andrzej_Andrzej 0 227 Feb-21-2024, 07:44 AM
Last Post: Andrzej_Andrzej
  Linear regression doubt - Urgent kumarants 6 3,053 May-05-2020, 04:11 PM
Last Post: kumarants
  Too Many Indexers Error In regression code Bitten 3 3,411 Mar-25-2020, 12:14 AM
Last Post: Larz60+
  prediction using linear regression (extrapolation?) in a loop karlito 0 3,179 Feb-05-2020, 10:56 AM
Last Post: karlito
  Linear Regression on Time Series karlito 5 3,843 Jan-28-2020, 10:02 AM
Last Post: buran
  How to build linear regression by implementing Gradient Descent using only linear alg PythonSpeaker 1 2,159 Dec-01-2019, 05:35 PM
Last Post: Larz60+
  Cannot use pypy on Python3 code ErnestTBass 11 10,278 May-24-2018, 02:01 PM
Last Post: ErnestTBass
  What is wrong with this implementation of the cost function for linear regression? JoeB 1 3,174 Dec-23-2017, 10:05 AM
Last Post: buran
  help on exponential smoothing and linear regression hkyeung 1 3,105 Sep-02-2017, 09:31 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020