stats model OLS question/issue - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: stats model OLS question/issue (/thread-28819.html) |
stats model OLS question/issue - russoj5 - Aug-04-2020 Good Afternoon All, I am a newbie with Python and coding and trying to model something. I am importing a CSV file and trying to use the OLS function to run a linear regression model. This is the code I'm using: import numpy as np import matplotlib.pyplot as plt import pandas as pd import scipy as sp import statsmodels.formula.api as smf import sys from IPython.display import display, HTML print(sys.path) chem_film_data_df = pd.read_csv (r"chem_film_data.csv") columns_to_keep = ['Date', 'pass_fail', 'C-IC_6MU', 'Nitric_Acid', 'Etch_Rate'] chem_film_data_df = chem_film_data_df[columns_to_keep] print(chem_film_data_df) display(HTML(chem_film_data_df.head().to_html())) print("Number of rows in the dataset =", len(chem_film_data_df)) #plt.plot(chem_film_data_df['C-IC_6MU'], chem_film_data_df['pass_fail'],'o') #plt.title('Test', fontsize=20) #plt.xlabel('C-IC 6MU Test') #plt.ylabel('pass-fail test') #plt.show() chem_film_model = smf.ols("pass_fail ~ C-IC_6MU + Nitric_Acid + Etch_Rate", chem_film_data_df).fit() print(chem_film_model.summary())I've added a few lines of code that allowed me to verify that it was importing the correct information from the CSV, which it is, but when it gets to line 27 it outputs an error. The commented out code was a test I ran to verify that data from the CSV was being imported and I was able to use it. The test output is correct: But then I receive this message: Any idea what is going on? I'm not sure why its saying there is a row mismatch, when the output of the rows/columns shows they are the same. Thank you in advance for your help!
RE: stats model OLS question/issue - scidam - Aug-05-2020 I suspected that you used - in the column name; See docs, - is meaningful when writing formulae. Remove - from the column name (C-IC_6MU ).
RE: stats model OLS question/issue - russoj5 - Aug-05-2020 (Aug-05-2020, 12:48 AM)scidam Wrote: I suspected that you used That worked. Thank you. I didn't think that would be an issue because when I used the file to simply plot the data there was no issue with the naming convention using the "-" character. Its not really relevant to my problem, but do you know why it would work when using matplotlib module and not with the statsmodels module? As a newbie, it would be helpful to understand the differences so I can try and avoid making similar mistakes in the future. Thank you again for your help! RE: stats model OLS question/issue - scidam - Aug-06-2020 (Aug-05-2020, 12:02 PM)russoj5 Wrote: Its not really relevant to my problem, but do you know why it would work when using matplotlib module and not with the statsmodels module? There is no magic with - character, it is just one character of a string. Statsmodels internally parses the formula given. It looks for + , ~ and - symbols in a string representing the formula. When statsmodels find - , it treats surrounding alphanumeric substrings C and IC_6MU as factor/column names (but your data frame doesn't have such columns). All this behavior is implemented in statsmodels to get it closer to R-like (formula) syntax.When you call plt.plot(chem_film_data_df['C-IC_6MU'], chem_film_data_df['pass_fail'],'o') , only pandas selectionengine works: you get chem_film_data_df['C-IC_6MU'] and chem_film_data_df['pass_fail'] which are iterables (Pandas.Series instances); and these iterables are passed to the plot function.When AFAIK, Matplotlib doesn't perform similar parsing. How |