Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Coding Mechanics
#1
I have the following code snippet below:
import pandas as pd
import statsmodels.formula.api as sms

fat = pd.read_csv('https://s3-us-west-2.amazonaws.com/static-resources.zybooks.com/fat.csv')

# Response variable
Y = fat['body_fat_percent']

# Generates the linear regression model
# Multiple predictor variables are joined with +
model = sms.ols('Y ~ triceps_skinfold_thickness_mm + midarm_circumference_cm + thigh_circumference_cm', data = fat).fit()

# Prints a list of the fitted values for each sample
print(model.fittedvalues)
What I want to know is how Python knows to map the variable Y, which is set to a column in a dataframe in line 7, to the "Y" reference in the string in the ols method call on line 11. As far as I can tell, I am just setting Y and then doing nothing with it, but somehow Python knows to reference it in the method call.
Reply
#2
Look at

https://www.statsmodels.org/stable/gener...i.ols.html

and

https://www.statsmodels.org/stable/examp...g-formulas

So, in your code sms.ols() first argument is a string (it can be also Formula object, but in your case it is a str)

'Y ~ triceps_skinfold_thickness_mm + midarm_circumference_cm + thigh_circumference_cm'

This is the formula. On the left hand side of ~ is Y (dependent variable in the model).

Fore more info on patsy formula language look here: https://patsy.readthedocs.io/en/latest/
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Thanks so much for your response! I guess what I don't understand is how the code inside the ols method knows that "Y" = "body_fat_percent". I don't pass anything into that method to tell it to make that connection, but somehow it knows what Y is. Prior to the method call, I set the column set to a variable called Y. But I don't pass that variable value to the ols method. So how does the code inside the method know what Y is?
Reply
#4
(Dec-02-2020, 11:30 AM)321brian Wrote: I don't pass anything into that method to tell it to make that connection, but somehow it knows what Y is. Prior to the method call, I set the column set to a variable called Y. But I don't pass that variable value to the ols method. So how does the code inside the method know what Y is?
But you bind name Y to column fat['body_fat_percent'] on line 7!
It uses patsy as language to describe the model. It parse the string 'Y ~ triceps_skinfold_thickness_mm + midarm_circumference_cm + thigh_circumference_cm' and uses the name Y you defined as well as the other columns from fat DataFrame.

If you prefer you can drop line 7 and just use
model = sms.ols('body_fat_percent ~ triceps_skinfold_thickness_mm + midarm_circumference_cm + thigh_circumference_cm', data = fat).fit()
Here's again the link to patsy package docs: https://patsy.readthedocs.io/en/latest/
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Well, I think we are getting closer. What happens to the data frame when you do this binding operation? See to me, Y = df['columnname'] would only create a new address in memory and store a copy of that column object or pass the address in memory and create a new reference. But in either case, nothing would change about the column object. So for example let's say I have Y = 7. Nothing changes about 7 to tell me that I assigned it to Y. I would not be able to inspect 7 to figure this out. But 7 isn't a variable, I point it out to help explain my confusion. So let's say I did this: Y = 7, and then wrote A = Y. I would not be able to inspect Y in any way to know that I assigned it to A, at least in the programming languages I am familiar with. If I did this B = Y, C = Y, D = Y, I could not inspect Y to know that it had been assigned to B, C, or D.

But, if I understand your point, the act of assigning df['columnname'] to a variable materially change df so that I can inspect it to see where it is assigned? Or is Y a special global variable? Or is there some kind of container that all this code is running in that allows you to see all the variables and what they've been assigned? Maybe this is a question about the way scoping works in Python. In C#, this code would not work unless Y was declared elsewhere as a static, global variable. But taken as is Y is a locally scoped variable to the calling code and would not be accessible to the code inside the method call without it being explicitly passed into the method.

So my question is not about the patsy language...it's the mechanics of variable allocation and scope I think.
Reply
#6
(Dec-02-2020, 12:45 PM)321brian Wrote: See to me, Y = df['columnname'] would only create a new address in memory and store a copy of that column object or pass the address in memory and create a new reference.

OK, I think you need to read this, about names in python

https://nedbatchelder.com/text/names1.html

And certainly there is more to it, specifically how patsy parse the string and look up the names in the global space. Tha's why I keep pointing at docs.

Or maybe I don't understand your question 100% and someone else could step in...
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
I dunno. The patsy reference has a lot of content in it. So I'm not sure where in that content I find my answer. The other links cover a lot about scoping and assignments. Maybe somewhere in there I am missing an implication that ties back.

Y = df['somecolumn']

def ols(somestring, dataframe):
    for column in dataframe:
         #some how I check to see if this column was assigned to a name "Y"??
Reply
#8
My understanding is as follows:
1. patsy parse the formula string (or Formula object).
2. patsy lookup the execution environment to find any names that have been bound to objects in the execution environment from 1. The code for this is in https://github.com/pydata/patsy/blob/mas...sy/eval.py Maybe there are other parts too (e.g. here, where it use EvalEnvironment class which will be used to look up any variables referenced in termlists that cannot be found in data_iter_maker - e.g. your case of Y which is name not found in the dataframe passed as data argument), but this looks like the main part of the code that deals with the matter.
I really don't wanna go into details there - it's really very low level and as the comment at the beginning state
Quote:# Utilities that require an over-intimate knowledge of Python's execution environment.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020