Evaluate dataset with logistic regression

chisox721 · (This post was last modified: Jun-05-2019, 06:33 PM by chisox721.)

To start I'll just say that I do a lot of work with Python but I'm venturing into new territory with math/data plotting so bear with me. My dataset includes 4 columns - person, x, y coordinates and a binary response to those coordinates. With this data I'm looking to do a few different things.

- Return a probability value for each set of x,y coordinates
- Create some sort of graph (heatmap/density?) that will show the the likelihood of 0/1 for areas of the graph
- Evaluate subsets of the data using the 'person' column

Based on the research I've done sklearn.linear_model LogisticRegression seems to be the best way to go about this (have also toyed with pyGAM). As my script shows the furthest I've gotten is running the "predict_proba" function on the dataset but either I'm doing something wrong elsewhere or I just don't know how to intepret the results because they seem way off. If anyone can help me with this I would really appreciate it.

        
              data_df = frame[['person','x_value','y_value','binary_result']]
 
#Create a scatter plot of the x,y coordinates with regard to their binary result
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(1, 1, 1)
 
bin_res = [0,1]
bin_col = ['r','g']
 
for res,col in zip(bin_res,bin_col):
    plot_df = data_df[(data_df['binary_result']  == res)]
    ax.scatter(plot_df['x_value'], plot_df['y_value'], c=col, marker='.')   
 
plt.show()
 
#Execute logistic regression on the dataset
x = data_df[['x_value','y_value']]
y = data_df[['binary_result']]
 
log_reg = linear_model.LogisticRegression(solver='lbfgs').fit(x,np.ravel(y))
 
predictions = log_reg.predict(x)
predict_a = log_reg.predict_proba(x)
 
print(predict_a)

*FYI I've also asked this question on StackOverflow

michalmonday · Jun-06-2019, 02:17 PM

Hi, could you provide the dataset?

chisox721 · (This post was last modified: Jun-06-2019, 02:35 PM by chisox721.)

Full dataset is >100,000 rows but here is a sample

        
                 person  x_value  y_value  binary_result
0  Larry  -0.8308   1.8334            1.0
1  Jason   0.0220   1.5786            1.0
2  Tommy  -1.1826   1.9428            0.0
3  Frank  -1.4240   2.2711            0.0
4  Brian  -0.9892   1.7922            0.0

EDIT: Didn't realize you could include attachments. Sample file now included.

sample.csv (Size: 137.08 KB / Downloads: 1)

michalmonday · (This post was last modified: Jun-06-2019, 02:41 PM by michalmonday.)

Is it going to be possible to observe the seemingly way off results using this sample? Or would bigger sample be required?

Btw isn't it making predictions on the same dataset that was used for training?

        
              x = data_df[['x_value','y_value']]
y = data_df[['binary_result']]
  
log_reg = linear_model.LogisticRegression(solver='lbfgs').fit(x,np.ravel(y))
  
predictions = log_reg.predict(x)

That's what it looks like...

So that would be the reason if by any chance the results seem too accurate.

chisox721 · Jun-06-2019, 02:44 PM

I included an excel file that should contain enough of the data but what are your thoughts on this? When I run "predict_a" my expectation is that will return the probability% of the respective x,y values being 1 vs 0. If that is right then there is an issue with my code because, for example, a coordinate pair -4,-4 should have a probability of 0 and that doesn't seem to be the case.

michalmonday · (This post was last modified: Jun-06-2019, 02:51 PM by michalmonday.)

I'll take a look at the dataset in a moment but in general, normally you'd have 2 separate datasets with their corresponding binary results (also referred to as "labels")
1. Training one (used to fit the logistic regression model)
2. Testing one (used to verify how accurate are the predictions)

If you have 1 big dataset, you could split it (scikit has a method especially for that).

In the code you posted it seems that the same data is used to train the model and then it's being used for making predictions. This will probably result in "fake" great performance of the classifier.

chisox721 · Jun-06-2019, 03:01 PM

That makes sense but for whatever reason I didn't think it was the proper approach based on what I read on similar studies using R. Maybe I need to use a different method/approach? This is a quote from an article that discusses the same thing I'm attempting to do; it probably does a better job of explaining what I'm looking to accomplish.

Quote:Our goal is to use GAMs to learn about each umpire. To start, we grabbed pitch-level data from BS. Next, we fit a GAM for each umpire to identify the likelihood of taken pitches being called a strike and extrapolated from this model the percent chance a taken pitch is called a strike on each part of the plate. Finally, we compared each umpire’s estimated zone with one estimated on all umpires across the major leagues to roughly identify where each umpire has called either fewer or more strikes.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to make a double loop to evaluate this triple integral	Safinazsalem	1	1,564	Dec-26-2024, 06:58 AM Last Post: sakshi009
	Evaluate Calculations	yrstruly	0	2,017	Jun-17-2023, 06:51 PM Last Post: yrstruly
	Hosmer-Lemeshow test in logistic regression	Ninax	1	3,908	Feb-17-2021, 10:50 PM Last Post: Larz60+
	Can I evaluate a Chebyshev polynomial using a function	player1681	1	2,755	Nov-22-2019, 06:33 AM Last Post: scidam
	Logistic regression with Tensorflow	kiton	1	3,317	Nov-28-2018, 07:34 PM Last Post: kiton

Evaluate dataset with logistic regression

User Panel Messages

Announcements