Bottom Page

• 0 Vote(s) - 0 Average
• 1
• 2
• 3
• 4
• 5
 Evaluate dataset with logistic regression chisox721 Silly Frenchman Posts: 27 Threads: 8 Joined: Oct 2017 Reputation: 0 Likes received: 1 #1 Jun-05-2019, 06:32 PM (This post was last modified: Jun-05-2019, 06:33 PM by chisox721. Edited 1 time in total.) To start I'll just say that I do a lot of work with Python but I'm venturing into new territory with math/data plotting so bear with me. My dataset includes 4 columns - person, x, y coordinates and a binary response to those coordinates. With this data I'm looking to do a few different things. - Return a probability value for each set of x,y coordinates - Create some sort of graph (heatmap/density?) that will show the the likelihood of 0/1 for areas of the graph - Evaluate subsets of the data using the 'person' column Based on the research I've done sklearn.linear_model LogisticRegression seems to be the best way to go about this (have also toyed with pyGAM). As my script shows the furthest I've gotten is running the "predict_proba" function on the dataset but either I'm doing something wrong elsewhere or I just don't know how to intepret the results because they seem way off. If anyone can help me with this I would really appreciate it. ```data_df = frame[['person','x_value','y_value','binary_result']] #Create a scatter plot of the x,y coordinates with regard to their binary result fig = plt.figure(figsize=(4,4)) ax = fig.add_subplot(1, 1, 1) bin_res = [0,1] bin_col = ['r','g'] for res,col in zip(bin_res,bin_col): plot_df = data_df[(data_df['binary_result'] == res)] ax.scatter(plot_df['x_value'], plot_df['y_value'], c=col, marker='.') plt.show() #Execute logistic regression on the dataset x = data_df[['x_value','y_value']] y = data_df[['binary_result']] log_reg = linear_model.LogisticRegression(solver='lbfgs').fit(x,np.ravel(y)) predictions = log_reg.predict(x) predict_a = log_reg.predict_proba(x) print(predict_a)```*FYI I've also asked this question on StackOverflow michalmonday Wafer-Thin Wafer Posts: 93 Threads: 1 Joined: May 2019 Reputation: 13 Likes received: 19 #2 Jun-06-2019, 02:17 PM Hi, could you provide the dataset? chisox721 Silly Frenchman Posts: 27 Threads: 8 Joined: Oct 2017 Reputation: 0 Likes received: 1 #3 Jun-06-2019, 02:35 PM (This post was last modified: Jun-06-2019, 02:35 PM by chisox721. Edited 2 times in total.) Full dataset is >100,000 rows but here is a sample ``` person x_value y_value binary_result 0 Larry -0.8308 1.8334 1.0 1 Jason 0.0220 1.5786 1.0 2 Tommy -1.1826 1.9428 0.0 3 Frank -1.4240 2.2711 0.0 4 Brian -0.9892 1.7922 0.0```EDIT: Didn't realize you could include attachments. Sample file now included.   sample.csv (Size: 137.08 KB / Downloads: 1) michalmonday likes this post michalmonday Wafer-Thin Wafer Posts: 93 Threads: 1 Joined: May 2019 Reputation: 13 Likes received: 19 #4 Jun-06-2019, 02:35 PM (This post was last modified: Jun-06-2019, 02:41 PM by michalmonday.) Is it going to be possible to observe the seemingly way off results using this sample? Or would bigger sample be required? Btw isn't it making predictions on the same dataset that was used for training? ```x = data_df[['x_value','y_value']] y = data_df[['binary_result']] log_reg = linear_model.LogisticRegression(solver='lbfgs').fit(x,np.ravel(y)) predictions = log_reg.predict(x)```That's what it looks like... So that would be the reason if by any chance the results seem too accurate. chisox721 Silly Frenchman Posts: 27 Threads: 8 Joined: Oct 2017 Reputation: 0 Likes received: 1 #5 Jun-06-2019, 02:44 PM I included an excel file that should contain enough of the data but what are your thoughts on this? When I run "predict_a" my expectation is that will return the probability% of the respective x,y values being 1 vs 0. If that is right then there is an issue with my code because, for example, a coordinate pair -4,-4 should have a probability of 0 and that doesn't seem to be the case. michalmonday Wafer-Thin Wafer Posts: 93 Threads: 1 Joined: May 2019 Reputation: 13 Likes received: 19 #6 Jun-06-2019, 02:51 PM (This post was last modified: Jun-06-2019, 02:51 PM by michalmonday. Edited 1 time in total.) I'll take a look at the dataset in a moment but in general, normally you'd have 2 separate datasets with their corresponding binary results (also referred to as "labels") 1. Training one (used to fit the logistic regression model) 2. Testing one (used to verify how accurate are the predictions) If you have 1 big dataset, you could split it (scikit has a method especially for that). In the code you posted it seems that the same data is used to train the model and then it's being used for making predictions. This will probably result in "fake" great performance of the classifier. chisox721 Silly Frenchman Posts: 27 Threads: 8 Joined: Oct 2017 Reputation: 0 Likes received: 1 #7 Jun-06-2019, 03:01 PM That makes sense but for whatever reason I didn't think it was the proper approach based on what I read on similar studies using R. Maybe I need to use a different method/approach? This is a quote from an article that discusses the same thing I'm attempting to do; it probably does a better job of explaining what I'm looking to accomplish. Quote:Our goal is to use GAMs to learn about each umpire. To start, we grabbed pitch-level data from BS. Next, we fit a GAM for each umpire to identify the likelihood of taken pitches being called a strike and extrapolated from this model the percent chance a taken pitch is called a strike on each part of the plate. Finally, we compared each umpire’s estimated zone with one estimated on all umpires across the major leagues to roughly identify where each umpire has called either fewer or more strikes. « Next Oldest | Next Newest »

Top Page

 Possibly Related Threads... Thread Author Replies Views Last Post Can I evaluate a Chebyshev polynomial using a function player1681 1 386 Nov-22-2019, 06:33 AM Last Post: scidam Logistic regression with Tensorflow kiton 1 907 Nov-28-2018, 07:34 PM Last Post: kiton

Forum Jump:

Users browsing this thread: 1 Guest(s)