Python Forum
Python Kolmogorov Test - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Python Kolmogorov Test (/thread-6465.html)

Pages: 1 2


Python Kolmogorov Test - asahdkhaled - Nov-23-2017

Important: I asked that question also on Cross-Validated, but they refuse an answer because it didnt seem to be mathematical (but I think it is). So if there is a problem with my question, feel free to ask and I would like to give more details :)

Can someone maybe tell me how to format the code here Big Grin The T Symbol isnt working for me Smile



I have a column with continous values. I want to find out, which distribution describes my column the best. If my column is f.e.normal distributed
For me, there are 4 different approaches, shown in the code:

Lets assume that output.values() hold my values which I want to use for the kstest....

var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))


Which of the possible approaches do you think would be correct (also for maybe other distributions, like uniform?). Or what is your best way to do it?


My general question: How to determine the distribution of my column, by sampling one and doing "ks_2samp" or testing the column directly versus a specific distribution with "kstest"??

Here is the formatted code...

var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))



RE: Python Kolmogorov Test - heiner55 - Nov-25-2017

It is difficult to help you, because we cannot test your program without sample data.


RE: Python Kolmogorov Test - asahdkhaled - Nov-25-2017

Ok here the same code, with just example data :)


data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)
mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(data,norm2)
#possibility 2: kstest vs n.cdf()
print kstest(data, n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(data, 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(data,'norm', (mean,std))



RE: Python Kolmogorov Test - Larz60+ - Nov-25-2017

loc, scale = n.fit(data)
n = norm(loc=loc, scale=scale)
No value specified for n in first step


RE: Python Kolmogorov Test - asahdkhaled - Nov-25-2017

from scipy.stats import norm as n
import numpy as np
from scipy.stats import *


data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)
mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n_array = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(data,norm2)
#possibility 2: kstest vs n.cdf()
print kstest(data, n_array.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(data, 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(data,'norm', (mean,std))
Ah yeah sorry. The first n ist just the norm package from scipy.
The second n is just a variable. I renamed it to n_array


RE: Python Kolmogorov Test - Larz60+ - Nov-25-2017

It still must be defined, and have an initial value


RE: Python Kolmogorov Test - asahdkhaled - Nov-25-2017

I dont get what you mean. The code is working, n_array get filled by the new norm distribution. My question is just, how to apply the KS Test the best. There 4 possibilities, which one is the best


RE: Python Kolmogorov Test - Larz60+ - Nov-25-2017

It fails when I try to run on:
loc, scale = n.fit(data)
n is not defined (at least not in the snippet you provide) at this point.


RE: Python Kolmogorov Test - asahdkhaled - Nov-26-2017

ok, strange. For me its working like that. What python Version do you have?


RE: Python Kolmogorov Test - heiner55 - Nov-26-2017

Your programs runs well on my PC:

#!/usr/bin/python3
from scipy.stats import norm as n
import numpy as np
from scipy.stats import *
 
 
data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)

mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std

uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n_array = norm(loc=loc, scale=scale)

#possibility 1: ks_2samp
print(ks_2samp(data,norm2))
#possibility 2: kstest vs n.cdf()
print(kstest(data, n_array.cdf))
#possibility 3:kstest vs. 'norm'
print(kstest(data, 'norm'))
#possibility 4: kstest vs. 'norm' with parameters
print(kstest(data,'norm', (mean,std)))
Output:
Ks_2sampResult(statistic=0.17500000000000004, pvalue=0.53129868234519217) KstestResult(statistic=0.12434546981368133, pvalue=0.53568744474974306) KstestResult(statistic=0.99999999999999933, pvalue=0.0) KstestResult(statistic=0.12434546981368133, pvalue=0.53568744474974306)