Python Forum
stats model OLS question/issue
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
stats model OLS question/issue
#1
Good Afternoon All,

I am a newbie with Python and coding and trying to model something. I am importing a CSV file and trying to use the OLS function to run a linear regression model. This is the code I'm using:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import scipy as sp

import statsmodels.formula.api as smf

import sys

from IPython.display import display, HTML

 

 

print(sys.path)

 

chem_film_data_df = pd.read_csv (r"chem_film_data.csv")

columns_to_keep = ['Date', 'pass_fail', 'C-IC_6MU', 'Nitric_Acid', 'Etch_Rate']

chem_film_data_df = chem_film_data_df[columns_to_keep]

print(chem_film_data_df)

 

display(HTML(chem_film_data_df.head().to_html()))

 

print("Number of rows in the dataset =", len(chem_film_data_df))

 

#plt.plot(chem_film_data_df['C-IC_6MU'], chem_film_data_df['pass_fail'],'o')

#plt.title('Test', fontsize=20)

#plt.xlabel('C-IC 6MU Test')

#plt.ylabel('pass-fail test')

#plt.show()

 

chem_film_model = smf.ols("pass_fail ~ C-IC_6MU + Nitric_Acid + Etch_Rate", chem_film_data_df).fit()

print(chem_film_model.summary())
I've added a few lines of code that allowed me to verify that it was importing the correct information from the CSV, which it is, but when it gets to line 27 it outputs an error. The commented out code was a test I ran to verify that data from the CSV was being imported and I was able to use it.

The test output is correct:
Output:
['C:\\Users\\JA21877\\Desktop\\python', 'C:\\Program Files\\Python38\\python38.zip', 'C:\\Program Files\\Python38\\DLLs', 'C:\\Program Files\\Python38\\lib', 'C:\\Program Files\\Python38', 'C:\\Users\\JA21877\\AppData\\Roaming\\Python\\Python38\\site-packages', 'C:\\Users\\JA21877\\AppData\\Roaming\\Python\\Python38\\site-packages\\win32', 'C:\\Users\\JA21877\\AppData\\Roaming\\Python\\Python38\\site-packages\\win32\\lib', 'C:\\Users\\JA21877\\AppData\\Roaming\\Python\\Python38\\site-packages\\Pythonwin', 'C:\\Program Files\\Python38\\lib\\site-packages', 'C:\\Users\\JA21877\\AppData\\Roaming\\Python\\Python38\\site-packages\\IPython\\extensions'] Date pass_fail C-IC_6MU Nitric_Acid Etch_Rate 0 10/1/2018 1 5.9050 87.191 0.309628 1 10/4/2018 1 5.9050 87.191 0.309628 2 10/4/2018 1 5.9050 87.191 0.354437 3 10/4/2018 1 5.9050 87.191 0.222378 4 10/4/2018 1 5.9050 87.191 0.133427 .. ... ... ... ... ... 220 7/13/2020 1 5.9710 91.839 0.211808 221 7/17/2020 1 6.0131 92.466 0.235234 222 7/22/2020 1 6.4769 94.186 0.342308 223 7/24/2020 1 6.4777 88.799 0.316139 224 7/29/2020 1 6.4337 94.687 0.314921 [225 rows x 5 columns] <IPython.core.display.HTML object> Number of rows in the dataset = 225
But then I receive this message:
Error:
Traceback (most recent call last): File "C:\Users\JA21877\Desktop\python\chem_film_predictive_model.py", line 27, in <module> chem_film_model = smf.ols("pass_fail ~ C-IC_6MU + Nitric_Acid + Etch_Rate", chem_film_data_df).fit() File "C:\Users\JA21877\AppData\Roaming\Python\Python38\site-packages\statsmodels\base\model.py", line 168, in from_formula tmp = handle_formula_data(data, None, formula, depth=eval_env, File "C:\Users\JA21877\AppData\Roaming\Python\Python38\site-packages\statsmodels\formula\formulatools.py", line 64, in handle_formula_data result = dmatrices(formula, Y, depth, return_type='dataframe', File "C:\Users\JA21877\AppData\Roaming\Python\Python38\site-packages\patsy\highlevel.py", line 309, in dmatrices (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env, File "C:\Users\JA21877\AppData\Roaming\Python\Python38\site-packages\patsy\highlevel.py", line 167, in _do_highlevel_design return build_design_matrices(design_infos, data, File "C:\Users\JA21877\AppData\Roaming\Python\Python38\site-packages\patsy\build.py", line 893, in build_design_matrices rows_checker.check(value.shape[0], name, origin) File "C:\Users\JA21877\AppData\Roaming\Python\Python38\site-packages\patsy\build.py", line 795, in check raise PatsyError(msg, origin) patsy.PatsyError: Number of rows mismatch between data argument and C (225 versus 1) pass_fail ~ C-IC_6MU + Nitric_Acid + Etch_Rate ^ [Finished in 3.1s]
Any idea what is going on? I'm not sure why its saying there is a row mismatch, when the output of the rows/columns shows they are the same. Thank you in advance for your help!
Reply
#2
I suspected that you used - in the column name; See docs, - is meaningful when writing formulae. Remove - from the column name (C-IC_6MU).
Reply
#3
(Aug-05-2020, 12:48 AM)scidam Wrote: I suspected that you used - in the column name; See docs, - is meaningful when writing formulae. Remove - from the column name (C-IC_6MU).

That worked. Thank you. I didn't think that would be an issue because when I used the file to simply plot the data there was no issue with the naming convention using the "-" character.

Its not really relevant to my problem, but do you know why it would work when using matplotlib module and not with the statsmodels module? As a newbie, it would be helpful to understand the differences so I can try and avoid making similar mistakes in the future.

Thank you again for your help!
Reply
#4
(Aug-05-2020, 12:02 PM)russoj5 Wrote: Its not really relevant to my problem, but do you know why it would work when using matplotlib module and not with the statsmodels module?

There is no magic with - character, it is just one character of a string. Statsmodels internally parses the formula given. It looks for +, ~ and - symbols in a string representing the formula. When statsmodels find -, it treats surrounding alphanumeric substrings C and IC_6MU as factor/column names (but your data frame doesn't have such columns). All this behavior is implemented in statsmodels to get it closer to R-like (formula) syntax.
When you call plt.plot(chem_film_data_df['C-IC_6MU'], chem_film_data_df['pass_fail'],'o'), only pandas selection
engine works: you get chem_film_data_df['C-IC_6MU'] and chem_film_data_df['pass_fail'] which are iterables (Pandas.Series instances); and these iterables are passed to the plot function.


When AFAIK, Matplotlib doesn't perform similar parsing. How
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  issue displaying summary of whole keras CNN model on TensorFlow with python Afrodizzyjack 0 1,619 Oct-27-2021, 04:07 PM
Last Post: Afrodizzyjack

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020