Nov-23-2017, 05:14 PM
(This post was last modified: Nov-23-2017, 06:06 PM by asahdkhaled.)
Important: I asked that question also on Cross-Validated, but they refuse an answer because it didnt seem to be mathematical (but I think it is). So if there is a problem with my question, feel free to ask and I would like to give more details :)
Can someone maybe tell me how to format the code here The T Symbol isnt working for me
I have a column with continous values. I want to find out, which distribution describes my column the best. If my column is f.e.normal distributed
For me, there are 4 different approaches, shown in the code:
Lets assume that output.values() hold my values which I want to use for the kstest....
var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))
Which of the possible approaches do you think would be correct (also for maybe other distributions, like uniform?). Or what is your best way to do it?
My general question: How to determine the distribution of my column, by sampling one and doing "ks_2samp" or testing the column directly versus a specific distribution with "kstest"??
Here is the formatted code...
Can someone maybe tell me how to format the code here The T Symbol isnt working for me
I have a column with continous values. I want to find out, which distribution describes my column the best. If my column is f.e.normal distributed
For me, there are 4 different approaches, shown in the code:
Lets assume that output.values() hold my values which I want to use for the kstest....
var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))
Which of the possible approaches do you think would be correct (also for maybe other distributions, like uniform?). Or what is your best way to do it?
My general question: How to determine the distribution of my column, by sampling one and doing "ks_2samp" or testing the column directly versus a specific distribution with "kstest"??
Here is the formatted code...
var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values()) mini, maxi = min(output.values()), max(output.values()) a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length) loc, scale = n.fit(output.values()) n = norm(loc=loc, scale=scale) #possibility 1: ks_2samp print ks_2samp(output.values(),norm2) #possibility 2: kstest vs n.cdf() print kstest(output.values(), n.cdf) #possibility 3:kstest vs. 'norm' print kstest(output.values(), 'norm') #possibility 4: kstest vs. 'norm' with parameters print kstest(output.values(),'norm', (mean,std))