Python Kolmogorov Test

asahdkhaled · (This post was last modified: Nov-23-2017, 06:06 PM by asahdkhaled.)

Important: I asked that question also on Cross-Validated, but they refuse an answer because it didnt seem to be mathematical (but I think it is). So if there is a problem with my question, feel free to ask and I would like to give more details :)

Can someone maybe tell me how to format the code here Big Grin

The T Symbol isnt working for me Smile

I have a column with continous values. I want to find out, which distribution describes my column the best. If my column is f.e.normal distributed
For me, there are 4 different approaches, shown in the code:

Lets assume that output.values() hold my values which I want to use for the kstest....

var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))

Which of the possible approaches do you think would be correct (also for maybe other distributions, like uniform?). Or what is your best way to do it?

My general question: How to determine the distribution of my column, by sampling one and doing "ks_2samp" or testing the column directly versus a specific distribution with "kstest"??

Here is the formatted code...

var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))

heiner55 · Nov-25-2017, 05:57 AM

It is difficult to help you, because we cannot test your program without sample data.

asahdkhaled · Nov-25-2017, 11:23 AM

Ok here the same code, with just example data :)

data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)
mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(data,norm2)
#possibility 2: kstest vs n.cdf()
print kstest(data, n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(data, 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(data,'norm', (mean,std))

**Larz60+** · Nov-25-2017, 12:23 PM

loc, scale = n.fit(data)
n = norm(loc=loc, scale=scale)

No value specified for n in first step

asahdkhaled · (This post was last modified: Nov-25-2017, 12:33 PM by asahdkhaled.)

from scipy.stats import norm as n
import numpy as np
from scipy.stats import *


data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)
mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n_array = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(data,norm2)
#possibility 2: kstest vs n.cdf()
print kstest(data, n_array.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(data, 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(data,'norm', (mean,std))

Ah yeah sorry. The first n ist just the norm package from scipy.
The second n is just a variable. I renamed it to n_array

**Larz60+** · Nov-25-2017, 12:51 PM

It still must be defined, and have an initial value

asahdkhaled · Nov-25-2017, 05:16 PM

I dont get what you mean. The code is working, n_array get filled by the new norm distribution. My question is just, how to apply the KS Test the best. There 4 possibilities, which one is the best

**Larz60+** · Nov-25-2017, 07:03 PM

It fails when I try to run on:

loc, scale = n.fit(data)

n is not defined (at least not in the snippet you provide) at this point.

asahdkhaled · Nov-26-2017, 10:46 AM

ok, strange. For me its working like that. What python Version do you have?

heiner55 · (This post was last modified: Nov-26-2017, 11:04 AM by heiner55.)

Your programs runs well on my PC:

#!/usr/bin/python3
from scipy.stats import norm as n
import numpy as np
from scipy.stats import *
 
 
data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)

mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std

uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n_array = norm(loc=loc, scale=scale)

#possibility 1: ks_2samp
print(ks_2samp(data,norm2))
#possibility 2: kstest vs n.cdf()
print(kstest(data, n_array.cdf))
#possibility 3:kstest vs. 'norm'
print(kstest(data, 'norm'))
#possibility 4: kstest vs. 'norm' with parameters
print(kstest(data,'norm', (mean,std)))

Output:Ks_2sampResult(statistic=0.17500000000000004, pvalue=0.53129868234519217)
KstestResult(statistic=0.12434546981368133, pvalue=0.53568744474974306)
KstestResult(statistic=0.99999999999999933, pvalue=0.0)
KstestResult(statistic=0.12434546981368133, pvalue=0.53568744474974306)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Kolmogorov-Smirnov Test	Lucky	0	1,916	May-13-2019, 03:30 PM Last Post: Lucky
	Test Normality in Python	StevenZut	3	3,116	Nov-25-2018, 04:57 PM Last Post: Larz60+

Python Kolmogorov Test

User Panel Messages

Announcements