![]() |
Is it possible to recombine values in python? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Is it possible to recombine values in python? (/thread-14423.html) |
Is it possible to recombine values in python? - synthex - Nov-29-2018 If i post in wrong branch, please let me know. I was advised to submit this question on python forum, when i asked it in SQL MSDN forum https://social.msdn.microsoft.com/Forums/sqlserver/en-US/a5fdcc0f-1941-4cee-9413-c1963b6da935/is-it-possible-to-recombine-values-in-sql?forum=transactsql So matter of question i try be more clearly I have data(data1.csv) for create multiple regression model (today we can have 10 X's, tomorrow is 20 X's I.E predictors, cause i have different models) The main characteristic of the model is R ^ 2. better to be from 0.8 to 1. Suppose I create a model manually, but it has R ^ 2 is small! Is it possible recombine the values of all variables in python, until R ^ 2 will be greater, or the maximum possible on this data? under recombination values is means deleting row and substituting, i.e., iterating to be more clear here example data1.csv (can't attach csv file, i upload it on webshare) data1.csv Depended variable : PT_POOR (Y variable) I manually conducted regression and get R^2=,043 (it is very bad) let's delete Curt's row from data and conduct regression again. Without Curt, R^2=,40, better, but not ideally. In this example, it was necessary just remove one line(Curt row). But there are other cases when the values that interfere to create a model are scattered across the dataset. let's examine it example 2(data2orig.csv) original dataset without these values(empty cell in data2.csv) the model has R^2=,70782713 it is good. Note: deleted values are replaced by mean substitution. Hence a similar recombination of value is needed until the maximum possible R ^ 2 is obtained. Note that deleted values are not deleted; those values that were taken out of the analysis can be viewed in new generated by python file "thebadvalue.csv" so in output we have next tables 1.) cleaned input table(data2.csv without "bad" values) (in our case from example 2) data2.csv 2.) table with beta coefficients(@slope) and R ^ 2 in matlab.csv b R^2 Intercept 45,41402 0,707827 POP_CHNG -0,29028 0,707827 N_EMPLD 0,00176 0,707827 TAX_RATE 2,18822 0,707827 PT_PHONE -0,28344 0,707827 AGE -0,26575 0,707827 PT_RURAL 0,081 0,707827 view example of "thebadvalue.csv" POP_CHNG N_EMPLD PT_POOR TAX_RATE PT_PHONE PT_RURAL AGE Benton Cannon Carrol Cheatheam Cumberland DeKalb Dyer Gibson 3040 Greene Hawkins Haywood Henry Houston Humphreys Jackson Johnson Lawrence McNairy Madison Marshall Maury Montgomery Morgan Sevier Shelby 11500 Sullivan Trousdale 100 Unicoi Wayne 100 Weakley How to do it? Sorry for my english, i am not native speaker. RE: Is it possible to recombine values in python? - ichabod801 - Nov-29-2018 This should be possible in Python, but I'm not sure if it's been implemented in a package yet. This is a common issue in statistics (variable selection). Common methods include forward, backward, and stepwise selection. Those depend on the statistical features of the model. However, the more statistical analysis you do on the model, the more you are reducing your theoretical degrees of freedom. I was taught that it's generally better to go to subject matter experts to get some idea of which variables are more important/likely to affect the model. Of course, this can run into the Money Ball problem of their being common biases in the "experts." Anyway, once you understood the details of the selection methods, you could write a Python program to repeated generate the models, compare the model features (such as R-squared), select the next variable, and possibly repeat. You would need a good understanding of the statistical selection methods, and of the statistical packages in Python (numpy and pandas, at least). RE: Is it possible to recombine values in python? - buran - Nov-29-2018 In addition to overfitting the model, without scientific/expert background that can explain the causality between dependent and independent variable(s) you can run in Spurious relationship problem RE: Is it possible to recombine values in python? - ichabod801 - Nov-29-2018 Also be wary of trying to increase R-squared. Anything you add to the model will increase R-squared. You can literally add the price of tea in China to any model, and it will increase the R-squared for that model. So R-squared cannot be the only thing you are looking at. RE: Is it possible to recombine values in python? - synthex - Nov-30-2018 Thank you, i will be think. RE: Is it possible to recombine values in python? - Sandberry - Nov-30-2018 (Nov-29-2018, 03:17 PM)ichabod801 Wrote: Also be wary of trying to increase R-squared. Anything you add to payday loans and to the model will increase R-squared. You can literally add the price of tea in China to any model, and it will increase the R-squared for that model. So R-squared cannot be the only thing you are looking at. What's the worst case scenario if R-squared is increased? Is the model done with in that case? RE: Is it possible to recombine values in python? - ichabod801 - Nov-30-2018 (Nov-30-2018, 07:42 PM)Sandberry Wrote: What's the worst case scenario if R-squared is increased? Is the model done with in that case? The model could be weaker because of too many variables, in the worst case it could be such a weak model it's useless. It may match your test data very well, but it won't match other data as well, lowering it's predictive value and any conclusions drawn from the parameters of the model. |