Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Evaluate dataset with logistic regression
To start I'll just say that I do a lot of work with Python but I'm venturing into new territory with math/data plotting so bear with me. My dataset includes 4 columns - person, x, y coordinates and a binary response to those coordinates. With this data I'm looking to do a few different things.

- Return a probability value for each set of x,y coordinates
- Create some sort of graph (heatmap/density?) that will show the the likelihood of 0/1 for areas of the graph
- Evaluate subsets of the data using the 'person' column

Based on the research I've done sklearn.linear_model LogisticRegression seems to be the best way to go about this (have also toyed with pyGAM). As my script shows the furthest I've gotten is running the "predict_proba" function on the dataset but either I'm doing something wrong elsewhere or I just don't know how to intepret the results because they seem way off. If anyone can help me with this I would really appreciate it.

data_df = frame[['person','x_value','y_value','binary_result']]

#Create a scatter plot of the x,y coordinates with regard to their binary result
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(1, 1, 1)

bin_res = [0,1]
bin_col = ['r','g']

for res,col in zip(bin_res,bin_col):
    plot_df = data_df[(data_df['binary_result']  == res)]
    ax.scatter(plot_df['x_value'], plot_df['y_value'], c=col, marker='.')

#Execute logistic regression on the dataset
x = data_df[['x_value','y_value']]
y = data_df[['binary_result']]

log_reg = linear_model.LogisticRegression(solver='lbfgs').fit(x,np.ravel(y))

predictions = log_reg.predict(x)
predict_a = log_reg.predict_proba(x)

*FYI I've also asked this question on StackOverflow
Hi, could you provide the dataset?
Full dataset is >100,000 rows but here is a sample

   person  x_value  y_value  binary_result
0  Larry  -0.8308   1.8334            1.0
1  Jason   0.0220   1.5786            1.0
2  Tommy  -1.1826   1.9428            0.0
3  Frank  -1.4240   2.2711            0.0
4  Brian  -0.9892   1.7922            0.0
EDIT: Didn't realize you could include attachments. Sample file now included.
.csv   sample.csv (Size: 137.08 KB / Downloads: 1)
michalmonday likes this post
Is it going to be possible to observe the seemingly way off results using this sample? Or would bigger sample be required?

Btw isn't it making predictions on the same dataset that was used for training?
x = data_df[['x_value','y_value']]
y = data_df[['binary_result']]
log_reg = linear_model.LogisticRegression(solver='lbfgs').fit(x,np.ravel(y))
predictions = log_reg.predict(x)
That's what it looks like...

So that would be the reason if by any chance the results seem too accurate.
I included an excel file that should contain enough of the data but what are your thoughts on this? When I run "predict_a" my expectation is that will return the probability% of the respective x,y values being 1 vs 0. If that is right then there is an issue with my code because, for example, a coordinate pair -4,-4 should have a probability of 0 and that doesn't seem to be the case.
I'll take a look at the dataset in a moment but in general, normally you'd have 2 separate datasets with their corresponding binary results (also referred to as "labels")
1. Training one (used to fit the logistic regression model)
2. Testing one (used to verify how accurate are the predictions)

If you have 1 big dataset, you could split it (scikit has a method especially for that).

In the code you posted it seems that the same data is used to train the model and then it's being used for making predictions. This will probably result in "fake" great performance of the classifier.
That makes sense but for whatever reason I didn't think it was the proper approach based on what I read on similar studies using R. Maybe I need to use a different method/approach? This is a quote from an article that discusses the same thing I'm attempting to do; it probably does a better job of explaining what I'm looking to accomplish.

Quote:Our goal is to use GAMs to learn about each umpire. To start, we grabbed pitch-level data from BS. Next, we fit a GAM for each umpire to identify the likelihood of taken pitches being called a strike and extrapolated from this model the percent chance a taken pitch is called a strike on each part of the plate. Finally, we compared each umpire’s estimated zone with one estimated on all umpires across the major leagues to roughly identify where each umpire has called either fewer or more strikes.

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Can I evaluate a Chebyshev polynomial using a function player1681 1 386 Nov-22-2019, 06:33 AM
Last Post: scidam
  Logistic regression with Tensorflow kiton 1 907 Nov-28-2018, 07:34 PM
Last Post: kiton

Forum Jump:

Users browsing this thread: 1 Guest(s)