Python Forum
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Kolmogorov Test
#1
Important: I asked that question also on Cross-Validated, but they refuse an answer because it didnt seem to be mathematical (but I think it is). So if there is a problem with my question, feel free to ask and I would like to give more details :)

Can someone maybe tell me how to format the code here Big Grin The T Symbol isnt working for me Smile



I have a column with continous values. I want to find out, which distribution describes my column the best. If my column is f.e.normal distributed
For me, there are 4 different approaches, shown in the code:

Lets assume that output.values() hold my values which I want to use for the kstest....

var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))


Which of the possible approaches do you think would be correct (also for maybe other distributions, like uniform?). Or what is your best way to do it?


My general question: How to determine the distribution of my column, by sampling one and doing "ks_2samp" or testing the column directly versus a specific distribution with "kstest"??

Here is the formatted code...

var, std, mean, length = np.var(output.values()), np.std(output.values()), np.mean(output.values()), len(output.values())
mini, maxi = min(output.values()), max(output.values())
a, b = (min(output.values()) - mean) / std, (max(output.values()) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(output.values())
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(output.values(),norm2)
#possibility 2: kstest vs n.cdf()
print kstest(output.values(), n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(output.values(), 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(output.values(),'norm', (mean,std))
Reply
#2
It is difficult to help you, because we cannot test your program without sample data.
Reply
#3
Ok here the same code, with just example data :)


data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)
mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(data,norm2)
#possibility 2: kstest vs n.cdf()
print kstest(data, n.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(data, 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(data,'norm', (mean,std))
Reply
#4
loc, scale = n.fit(data)
n = norm(loc=loc, scale=scale)
No value specified for n in first step
Reply
#5
from scipy.stats import norm as n
import numpy as np
from scipy.stats import *


data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)
mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std
uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n_array = norm(loc=loc, scale=scale)
#possibility 1: ks_2samp
print ks_2samp(data,norm2)
#possibility 2: kstest vs n.cdf()
print kstest(data, n_array.cdf)
#possibility 3:kstest vs. 'norm'
print kstest(data, 'norm')
#possibility 4: kstest vs. 'norm' with parameters
print kstest(data,'norm', (mean,std))
Ah yeah sorry. The first n ist just the norm package from scipy.
The second n is just a variable. I renamed it to n_array
Reply
#6
It still must be defined, and have an initial value
Reply
#7
I dont get what you mean. The code is working, n_array get filled by the new norm distribution. My question is just, how to apply the KS Test the best. There 4 possibilities, which one is the best
Reply
#8
It fails when I try to run on:
loc, scale = n.fit(data)
n is not defined (at least not in the snippet you provide) at this point.
Reply
#9
ok, strange. For me its working like that. What python Version do you have?
Reply
#10
Your programs runs well on my PC:

#!/usr/bin/python3
from scipy.stats import norm as n
import numpy as np
from scipy.stats import *
 
 
data= [36, 22, 24, 21, 22, 18, 14, 24, 28, 8, 22, 16, 16, 26, 17, 24, 24, 14, 15, 24, 21, 20, 19, 17, 13, 13, 17, 30, 17, 11, 45, 15, 19, 21, 15, 13, 14, 16, 25, 21]
var, std, mean, length = np.var(data), np.std(data), np.mean(data), len(data)

mini, maxi = min(data), max(data)
a, b = (min(data) - mean) / std, (max(data) - mean) / std

uniform, norm2 = np.random.uniform(mini, maxi, length), np.random.normal(mean, std, length)
loc, scale = n.fit(data)
n_array = norm(loc=loc, scale=scale)

#possibility 1: ks_2samp
print(ks_2samp(data,norm2))
#possibility 2: kstest vs n.cdf()
print(kstest(data, n_array.cdf))
#possibility 3:kstest vs. 'norm'
print(kstest(data, 'norm'))
#possibility 4: kstest vs. 'norm' with parameters
print(kstest(data,'norm', (mean,std)))
Output:
Ks_2sampResult(statistic=0.17500000000000004, pvalue=0.53129868234519217) KstestResult(statistic=0.12434546981368133, pvalue=0.53568744474974306) KstestResult(statistic=0.99999999999999933, pvalue=0.0) KstestResult(statistic=0.12434546981368133, pvalue=0.53568744474974306)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Kolmogorov-Smirnov Test Lucky 0 1,916 May-13-2019, 03:30 PM
Last Post: Lucky
  Test Normality in Python StevenZut 3 3,116 Nov-25-2018, 04:57 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020