[prev in list] [next in list] [prev in thread] [next in thread] 

List:       jakarta-commons-dev
Subject:    [jira] [Comment Edited] (MATH-1179) kolmogorovSmirnovTest poor performance in monteCarloP method
From:       "Phil Steitz (JIRA)" <jira () apache ! org>
Date:       2015-06-30 0:01:05
Message-ID: JIRA.12761624.1418570258000.51982.1435622465423 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/MATH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606534#comment-14606534 \
] 

Phil Steitz edited comment on MATH-1179 at 6/30/15 12:00 AM:
-------------------------------------------------------------

I did a little more research.  The 2-sample case is different because  the test \
statistic has a discrete distribution.  This makes exact computation possible, which \
we do for small m,n.  We use a naively computed Smirnov approximation for large n*m \
and Monte Carlo for "moderate" n*m by default.  There are two things we can do to \
improve this:

#  I can't find a freely available reference, but I am chasing down dead trees for a \
much better exact computation algorithm that I have seen referenced [1].  What the \
exactP code does now is simple and correct but ridiculously slow.  I conjecture that \
the code that I have not been able to figure out in R may be using the algorithm in \
[1] for small n,m.  By implementing a better exact algorithm, we can use it up to n * \
m <= 10000, which is the cut R uses.  That basically eliminates the need for the \
Monte Carlo stuff.  #  The sum computed in approximateP has known bad numerical \
properties and our code does not really do anything to correct for this.  The magic \
numbers in some of the code you reference above have to do with continuity \
corrections.  We need to research this a little more and apply the corrections to the \
computation of the sum.

[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two \
sample Kolmogorov–Smirnov  criterion D_mn(m<=n) Selected  Tables  in  Mathematical  \
Statistics 80–129  American Mathematical Society


was (Author: psteitz):
I did a little more research.  The 2-sample case is different because  the test \
statistic has a discrete distribution.  This makes exact computation possible, which \
we do for small m,n.  We use a naively computed Smirnov approximation for large n*m \
and Monte Carlo for "moderate" n*m by default.  There are two things we can do to \
improve this:

 1.  I can't find a freely available reference, but I am chasing down dead trees for \
a much better exact computation algorithm that I have seen referenced [1].  What the \
exactP code does now is simple and correct but ridiculously slow.  I conjecture that \
the code that I have not been able to figure out in R may be using the algorithm in \
[1] for small n,m.  By implementing a better exact algorithm, we can use it up to n * \
m <= 10000, which is the cut R uses.  That basically eliminates the need for the \
Monte Carlo stuff.  2.  The sum computed in approximateP has known bad numerical \
properties and our code does not really do anything to correct for this.  The magic \
numbers in some of the code you reference above have to do with continuity \
corrections.  We need to research this a little more and apply the corrections to the \
computation of the sum.

[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two \
sample Kolmogorov–Smirnov  criterion D_mn(m<=n) Selected  Tables  in  Mathematical  \
Statistics 80–129  American Mathematical Society

> kolmogorovSmirnovTest poor performance in monteCarloP method
> ------------------------------------------------------------
> 
> Key: MATH-1179
> URL: https://issues.apache.org/jira/browse/MATH-1179
> Project: Commons Math
> Issue Type: Bug
> Reporter: Gilad
> Fix For: 4.0
> 
> Attachments: KSTest-JavaAndR.txt, KSTestSnippet.txt
> 
> 
> I'm using the kolmogovSmirnovTest method to calculate pvalues.
> However, when i try running the test on two double[] of sizes 5 and 45 the results \
> take over 10 seconds to calculate. This seems very long, whereas in R it takes a \
> few miliseconds for the same calculation. I'd be very happy to hear any comment you \
> may have on the subject. Gilad



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic