Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Obscure Error
#1
When I ran some python 3.73 code I got the following error.


Error:
ValueError Traceback (most recent call last) <ipython-input-26-6bc623b2ec79> in <module> 3 4 # Fit linear regression ----> 5 lin_reg_mod.fit(X_train, y_train) 6 7 # Make prediction on the testing data c:\users\newport_j\appdata\local\programs\python\python37\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, sample_weight) 501 else: 502 self.coef_, self._residues, self.rank_, self.singular_ = \ --> 503 linalg.lstsq(X, y) 504 self.coef_ = self.coef_.T 505 c:\users\newport_j\appdata\local\programs\python\python37\lib\site-packages\scipy\linalg\basic.py in lstsq(a, b, cond, overwrite_a, overwrite_b, check_finite, lapack_driver) 1219 if info < 0: 1220 raise ValueError('illegal value in %d-th argument of internal %s' -> 1221 % (-info, lapack_driver)) 1222 resids = np.asarray([], dtype=x.dtype) 1223 if m > n: ValueError: illegal value in 4-th argument of internal None
The code now shown produces it.

#!/usr/bin/env python
# coding: utf-8

# In[2]:


# Used for plotting data
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt

# Used for data storage and manipulation 
import numpy as np
import pandas as pd

# Used for Regression Modelling
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split

# Used for Acc metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# For stepwise regression
import statsmodels.api as sm

# box plots
import seaborn as sns
# pairplot
from seaborn import pairplot
# Correlation plot
from statsmodels.graphics.correlation import plot_corr


# In[3]:


# Load your data 
data = pd.read_csv("NFL data.csv")


# In[4]:


# adding .head() to your dataset allows you to see the first rows in the dataset. 
# Add a # inside the brackets to specificy how many rows are returned or else 5 rows are returned.
print(data.shape)
# (12144, 18)
data.head()


# In[5]:


# check for the null values in each column
data.isna().sum()


# In[6]:


# Gives you useful info about your data
data.info()


# In[7]:


# Gives you summary statistics on your numeric columns
data.describe()


# In[8]:


# return only rows where the year is greater than 2009
current = data[(data['schedule_season'] > 2009)]


# In[9]:


#  no warning message and no exception is raised
pd.options.mode.chained_assignment = None  # default='warn'
# Create a column titled home or away. This column will add a 1 to the row where the New England Patriots played at home 
# and a 0 for away games.
current['home_or_away'] = np.where(current['team_home'] == 'New England Patriots', 1, 0)


# In[10]:


# Return rows where New England Patriots are either the home or away team
current2 = current.loc[(current["team_home"] == "New England Patriots") | (current["team_away"] == "New England Patriots")]

# filter to certain columns
final = current2.filter(["team_home","team_away" , "score_home","score_away" ,"weather_temperature", "home_or_away", "over_under_line"])

# merge score_away & score_home into column 'score'
final['score'] = np.where(final['team_away'] == 'New England Patriots', final['score_away'], final['score_home'])


# Before showing our final dataset we will drop any rows with NA values.
final = final.dropna()
final.head()


# In[11]:


final['2_game_avg'] = final.score.rolling(window=2).mean()
final['5_game_avg'] = final.score.rolling(window=5).mean()

final.head()


# In[12]:


final = final.fillna(final.mean())


# In[13]:


# This time we're checking for Outliers. Check each columns min & max to make sure the # is plausible
final.describe()


# In[14]:


#  no warning message and no exception is raised
# pd.options.mode.chained_assignment = None  # default='warn'


# In[15]:


df = final[['weather_temperature', 'over_under_line','home_or_away', '2_game_avg','5_game_avg', 'score']]


# In[16]:


df.info()


# In[17]:


# Need to convert three columns to  float64 Dtype
df['home_or_away'] = df['home_or_away'].astype('float64')
df['over_under_line'] = df['over_under_line'].astype('float64')
df['score'] = df['score'].astype('float64')

df.info()


# In[18]:


plt.scatter(df['weather_temperature'], df['score'], color='red')
plt.title('weather temperature Vs Score', fontsize=14)
plt.xlabel('weather_temperature', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.grid(True)


# In[19]:


plt.scatter(df['over_under_line'], df['score'], color='red')
plt.title('over_under_line Vs Score', fontsize=14)
plt.xlabel('over_under_line', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.grid(True)


# In[20]:


plt.scatter(df['2_game_avg'], df['score'], color='red')
plt.title('2 game average Vs Score', fontsize=14)
plt.xlabel('2 game average', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.grid(True)


# In[21]:


plt.scatter(df['5_game_avg'], df['score'], color='red')
plt.title('5 game average Vs Score', fontsize=14)
plt.xlabel('5 game average', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.grid(True)


# In[22]:


sns.boxplot(x ="home_or_away", y = "score", data = df, palette="Set2")


# In[23]:


corr = df.corr()
corr


# In[24]:


# More optional EDA
pairplot(df)


# In[25]:


# More optional EDA
fig= plot_corr(corr,xnames=corr.columns)


# In[26]:


X = pd.DataFrame(df, columns = ['2_game_avg', 'home_or_away'])
y = pd.DataFrame(df, columns=['score'])

# WITH a random_state parameter:
#  (Same split every time! Note you can change the random state to any integer.)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Print the first element of each object.
print(X_train.head(1))
print(X_test.head(1))
print(y_train.head(1))
print(y_test.head(1))


# In[27]:


# Create linear regression model
lin_reg_mod = LinearRegression()

# Fit linear regression
lin_reg_mod.fit(X_train, y_train)

# Make prediction on the testing data
pred = lin_reg_mod.predict(X_test)


# In[28]:


# Get the slope and intercept of the line best fit.
print(lin_reg_mod.intercept_)


print(lin_reg_mod.coef_)


# In[29]:


# Calculate the Root Mean Square Error between the actual & predicted
test_set_rmse = (np.sqrt(mean_squared_error(y_test, pred)))

# Calculate the R^2 or coefficent of determination between the actual & predicted
test_set_r2 = r2_score(y_test, pred)

# Note that for rmse, the lower that value is, the better the fit
print(test_set_rmse)
# The closer towards 1, the better the fit
print(test_set_r2)


# In[30]:


df_results = y_test
df_results['Predicted'] = pred.ravel()
df_results['Residuals'] = abs(df_results['score']) - abs(df_results['Predicted'])
print(df_results)


# In[34]:


# Residual plot using df_result
fig = plt.figure(figsize=(10,7))
sns.residplot(x = "Predicted", y = "score",data = df_results, color='blue')

# Title and labels.
plt.title('Residuals', size=24)
plt.xlabel('Predicted', size=18)
plt.ylabel('Residual', size=18);


# In[33]:


# Plotting the actual vs predicted values
sns.lmplot(x='score', y='Predicted', data=df_results, fit_reg=False)
line_coords = np.arange(df_results.score.min().min(), df_results.Predicted.max().max())
plt.plot(line_coords, line_coords,  # X and y points
            color='darkorange', linestyle='--')
plt.xlabel('Actual Score', size=10)
plt.title('Actual vs. Predicted')


# In[30]:


# Plotting the residuals distribution
plt.subplots(figsize=(12, 6))
plt.title('Distribution of Residuals')
sns.distplot(df_results['Residuals'])
plt.show()


# In[35]:


df2 = df[['2_game_avg', 'home_or_away', 'score']]
corr2 = df2.corr()


# In[36]:


fig= plot_corr(corr2,xnames=corr2.columns)
The error is not clear to me. I am not that sophisticated a python programmer. I do not think this has to do with any
coding fundamentals.

This link gives a good explanation of why this error happens and how to fix. I just do not see how to apply it to my code.

https://stackoverflow.com/questions/6256...ng-sklearn

I cannot attach the NFL*.csv file, it is too big; even zipped it is too big.

I think the solution is in that link. I am not sure how to fix my code though.

Any help appreciated. Thanks in advance.

Respectfully,

ErnestTBass
Reply
#2
The link that you showed suggests that you are missing the Lapack library. What is you OS, how did you install Python, etc?
Reply
#3
The OS is Windows 10 Professional, 64 bit

Python version is 3.73
numpy is 1.18.1
pandas is 1.0.1
statsmodel is 0.11.0
matplotlib is 3.1.3
sklearn is 0.21.2

I do not remember how I installed python, it is version 3.73. The important modules and their versions are shown above.

I believe it is something to do with sklearn module.

If you notice in the link I presented in the first post it says the issue was solved, but it does not say how it was solved.

Respectfully,

EnestTBass
Reply
#4
I created a virtual environment and the various modules/libraries are:

Python - 3.6.1
numpy - 1.19.1
pandas - 1.1.1
statsmodel - 0.11.1
matplotlib - 3.3.1
sklearn - 0.23.2

The program ran perfectly with no errors. So there must be an error between at least one module from the previous post
and this post. I became suspicious when it appeared in the forms I checked that the program ran sometimes and sometimes it did not. I can only guess what and where the error is. The OS was again Windows 10 Professional.

What could possibly have been the error?

Respectfully,

ErnestTBass
Reply
#5
Sorry, this goes beyond what I can diagnose. The link that you posted says that another user solved the same error by using the anaconda distribution (or perhaps miniconda). There seems to be an issue with your install of the Lapack library but I have no way to reproduce the error in my system. At least you can run the code in the virtualenv...
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020