Is it possible to recombine values in python?

synthex · Nov-29-2018, 10:42 AM

If i post in wrong branch, please let me know.
I was advised to submit this question on python forum, when i asked it in SQL MSDN forum
https://social.msdn.microsoft.com/Forums...ransactsql
So matter of question
i try be more clearly

I have data(data1.csv) for create multiple regression model (today we can have 10 X's, tomorrow is 20 X's I.E predictors, cause i have different models)

The main characteristic of the model is R ^ 2. better to be from 0.8 to 1.

Suppose I create a model manually, but it has R ^ 2 is small! Is it possible recombine the values of all variables in python, until R ^ 2 will be greater, or the maximum possible on this data?

under recombination values is means deleting row and substituting, i.e., iterating

to be more clear here example

data1.csv (can't attach csv file, i upload it on webshare)
data1.csv

Depended variable : PT_POOR (Y variable)

I manually conducted regression and get R^2=,043 (it is very bad)

let's delete Curt's row from data and conduct regression again. Without Curt, R^2=,40, better, but not ideally.

In this example, it was necessary just remove one line(Curt row).

But there are other cases when the values that interfere to create a model are scattered across the dataset.

let's examine it example 2(data2orig.csv)
original dataset

without these values(empty cell in data2.csv) the model has R^2=,70782713 it is good.
Note: deleted values are replaced by mean substitution. Hence a similar recombination of value is needed until the maximum possible R ^ 2 is obtained.

Note that deleted values are not deleted; those values that were taken out of the analysis can be viewed in new generated by python file "thebadvalue.csv"
so in output we have next tables

1.) cleaned input table(data2.csv without "bad" values) (in our case from example 2)
data2.csv

2.) table with beta coefficients(@slope) and R ^ 2 in matlab.csv
b R^2
Intercept 45,41402 0,707827
POP_CHNG -0,29028 0,707827
N_EMPLD 0,00176 0,707827
TAX_RATE 2,18822 0,707827
PT_PHONE -0,28344 0,707827
AGE -0,26575 0,707827
PT_RURAL 0,081 0,707827

view example of "thebadvalue.csv"
POP_CHNG N_EMPLD PT_POOR TAX_RATE PT_PHONE PT_RURAL AGE
Benton
Cannon
Carrol
Cheatheam
Cumberland
DeKalb
Dyer
Gibson 3040
Greene
Hawkins
Haywood
Henry
Houston
Humphreys
Jackson
Johnson
Lawrence
McNairy
Madison
Marshall
Maury
Montgomery
Morgan
Sevier
Shelby 11500
Sullivan
Trousdale 100
Unicoi
Wayne 100
Weakley
How to do it?

Sorry for my english, i am not native speaker.

***ichabod801*** · Nov-29-2018, 02:26 PM

This should be possible in Python, but I'm not sure if it's been implemented in a package yet. This is a common issue in statistics (variable selection). Common methods include forward, backward, and stepwise selection. Those depend on the statistical features of the model. However, the more statistical analysis you do on the model, the more you are reducing your theoretical degrees of freedom. I was taught that it's generally better to go to subject matter experts to get some idea of which variables are more important/likely to affect the model. Of course, this can run into the Money Ball problem of their being common biases in the "experts."

Anyway, once you understood the details of the selection methods, you could write a Python program to repeated generate the models, compare the model features (such as R-squared), select the next variable, and possibly repeat. You would need a good understanding of the statistical selection methods, and of the statistical packages in Python (numpy and pandas, at least).

**buran** · (This post was last modified: Nov-29-2018, 02:57 PM by buran.)

In addition to overfitting the model, without scientific/expert background that can explain the causality between dependent and independent variable(s) you can run in Spurious relationship problem

***ichabod801*** · Nov-29-2018, 03:17 PM

Also be wary of trying to increase R-squared. Anything you add to the model will increase R-squared. You can literally add the price of tea in China to any model, and it will increase the R-squared for that model. So R-squared cannot be the only thing you are looking at.

synthex · Nov-30-2018, 07:56 AM

Thank you, i will be think.

Sandberry · (This post was last modified: Sep-22-2023, 01:31 PM by Sandberry.)

(Nov-29-2018, 03:17 PM)ichabod801 Wrote: Also be wary of trying to increase R-squared. Anything you add to payday loans and to the model will increase R-squared. You can literally add the price of tea in China to any model, and it will increase the R-squared for that model. So R-squared cannot be the only thing you are looking at.

What's the worst case scenario if R-squared is increased? Is the model done with in that case?

***ichabod801*** · Nov-30-2018, 07:58 PM

(Nov-30-2018, 07:42 PM)Sandberry Wrote: What's the worst case scenario if R-squared is increased? Is the model done with in that case?

The model could be weaker because of too many variables, in the worst case it could be such a weak model it's useless. It may match your test data very well, but it won't match other data as well, lowering it's predictive value and any conclusions drawn from the parameters of the model.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Get max values based on unique values in another list - python	Antonio	8	8,430	Jun-12-2018, 07:49 PM Last Post: Mekire

Is it possible to recombine values in python?

User Panel Messages

Announcements