[prev in list] [next in list] [prev in thread] [next in thread]
List: jakarta-commons-dev
Subject: [jira] [Comment Edited] (MATH-1179) kolmogorovSmirnovTest poor performance in monteCarloP method
From: "Phil Steitz (JIRA)" <jira () apache ! org>
Date: 2015-06-30 0:01:05
Message-ID: JIRA.12761624.1418570258000.51982.1435622465423 () Atlassian ! JIRA
[Download RAW message or body]
[ https://issues.apache.org/jira/browse/MATH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606534#comment-14606534 \
]
Phil Steitz edited comment on MATH-1179 at 6/30/15 12:00 AM:
-------------------------------------------------------------
I did a little more research. The 2-sample case is different because the test \
statistic has a discrete distribution. This makes exact computation possible, which \
we do for small m,n. We use a naively computed Smirnov approximation for large n*m \
and Monte Carlo for "moderate" n*m by default. There are two things we can do to \
improve this:
# I can't find a freely available reference, but I am chasing down dead trees for a \
much better exact computation algorithm that I have seen referenced [1]. What the \
exactP code does now is simple and correct but ridiculously slow. I conjecture that \
the code that I have not been able to figure out in R may be using the algorithm in \
[1] for small n,m. By implementing a better exact algorithm, we can use it up to n * \
m <= 10000, which is the cut R uses. That basically eliminates the need for the \
Monte Carlo stuff. # The sum computed in approximateP has known bad numerical \
properties and our code does not really do anything to correct for this. The magic \
numbers in some of the code you reference above have to do with continuity \
corrections. We need to research this a little more and apply the corrections to the \
computation of the sum.
[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two \
sample Kolmogorov–Smirnov criterion D_mn(m<=n) Selected Tables in Mathematical \
Statistics 80–129 American Mathematical Society
was (Author: psteitz):
I did a little more research. The 2-sample case is different because the test \
statistic has a discrete distribution. This makes exact computation possible, which \
we do for small m,n. We use a naively computed Smirnov approximation for large n*m \
and Monte Carlo for "moderate" n*m by default. There are two things we can do to \
improve this:
1. I can't find a freely available reference, but I am chasing down dead trees for \
a much better exact computation algorithm that I have seen referenced [1]. What the \
exactP code does now is simple and correct but ridiculously slow. I conjecture that \
the code that I have not been able to figure out in R may be using the algorithm in \
[1] for small n,m. By implementing a better exact algorithm, we can use it up to n * \
m <= 10000, which is the cut R uses. That basically eliminates the need for the \
Monte Carlo stuff. 2. The sum computed in approximateP has known bad numerical \
properties and our code does not really do anything to correct for this. The magic \
numbers in some of the code you reference above have to do with continuity \
corrections. We need to research this a little more and apply the corrections to the \
computation of the sum.
[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two \
sample Kolmogorov–Smirnov criterion D_mn(m<=n) Selected Tables in Mathematical \
Statistics 80–129 American Mathematical Society
> kolmogorovSmirnovTest poor performance in monteCarloP method
> ------------------------------------------------------------
>
> Key: MATH-1179
> URL: https://issues.apache.org/jira/browse/MATH-1179
> Project: Commons Math
> Issue Type: Bug
> Reporter: Gilad
> Fix For: 4.0
>
> Attachments: KSTest-JavaAndR.txt, KSTestSnippet.txt
>
>
> I'm using the kolmogovSmirnovTest method to calculate pvalues.
> However, when i try running the test on two double[] of sizes 5 and 45 the results \
> take over 10 seconds to calculate. This seems very long, whereas in R it takes a \
> few miliseconds for the same calculation. I'd be very happy to hear any comment you \
> may have on the subject. Gilad
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic