'Re: [Wekalist] Re: the number of folds'

[prev in list] [next in list] [prev in thread] [next in thread] 

List: wekalist
Subject: Re: [Wekalist] Re: the number of folds
From: Harri Saarikoski <harri.saarikoski () gmail ! com>
Date: 2010-12-12 5:40:05
Message-ID: AANLkTimDru0dpFo+iDUStRaBvudwSKQG53DXJr4_-V5V () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

2010/12/12 Ken Bloom <kbloom@gmail.com>

> On Sat, 11 Dec 2010 14:57:31 -0800, MaryLee wrote:
> 
> > Hi,
> > 
> > I want to learn the relationship between the number of folds in k- fold
> > cross-validation and the accuracy of the decision tree algorithm. I know
> > exactly what cross-validation does. it divides the data set and use one
> > of the partition for test and the rest for training the algorithm. the
> > problem is I have two data sets and when I increase the folds the
> > accuracy increases on first data set while decreases on another data
> > set. when I continue Increasing the number of folds the accuracy still
> > increase on the first data set while  still decreases on the second data
> > sets.
> 
> 
A layman explanation is that if we consider learning machines to have an
s-shaped learning curve, then in the first set this learning curve has not
yet peaked and additional training instances (from increasing folds) still
contribute to the pattern of features->classes that classifiers seek to
define. This is what usually happens.

Second set is different if and only if increasing k at *several *steps
(2..n) *systematically *decreases the accuracy. If (or should say when) it
does (cos it's not uncommon), we note the following: since evaluation
accuracy ultimately depends on the similarity of individual instances in
train vs test folds, we can note that *cv with fewer folds are exposed to
more randomness of sampling*, ie. if you run 2-fold cv (with 50% training
volume), there is a certain (higher) likelihood that those two folds *happen
*to be more alike each other (-> higher accuracy) than any 9 of the 10-fold
cv folds are to the 10th test fold. For an extreme example, if we have 4
instances that are 2 duplicates of the same 2 instances
(a-orig,a-copy,b-orig,b-copy), the likelihood for the duplicates (a|b-orig
in train, a|b-copy in test) to be dealt into the 2 (or k) training and test
folds (-> high accuracy) is in this case as great as a* instances forming
the training fold and b* instances the test fold (-> lower accuracy).
*
*-> Likelihood of accuracy decrease/increase as function of k can be
calculated from pairwise similarity of instances sampled into training vs
test folds.

Size of sets (as with folds) matters greatly to settling / jittering of
accuracy: the smaller the set the more likely accuracy is to jitter up and
down (and should be nothing to concern about). Laws of probability state
unequivocally that as training increases to infinity so does each
classifier's accuracy approximate some random level (which is something
described in eg. wikipedia entry on 'bayes optimal classifier'). If
evaluation reliability of various k is in doubt, higher k is obviously
preferred, since their training fold size is closer to the full training set
(whose overfitting tendency against unseen test instances is the true test
for overfit).

-> Multiple k-fold iterations (by varying the cv sampling seed, the -S
parameter in Stratifiedremovefolds filter or weka commandline for cv) could
be run to confirm that the average accuracy across those multiple iterations
begin to follow the 'normal' tendency of accuracy increasing with k.

best, Harri

ps. Classifier selection (J48->sth else) suggested here is in my view
trivial in a case where volume and sampling define how much accuracy is and
how much it varies.

> The first data set is normal -- it's what you hope will happen. With more
> data, the learning algorithm is learning more *useful* information about
> the data set. To deploy this alogrithm in the real world, you should try
> to have a large enough data set that the change in accuracy isn't very
> dramatic.
> 
> The second data set is an example of overfitting. With more data, the
> learning algorithm is learning more information about the training set,
> but that information is *useless* because it doesn't reflect what's going
> on in the testing set. You need to rethink what you're doing with the
> second data set, either by picking a different learning algorithm that's
> more resistant to overfitting, by taking a logical look at what features
> you're using to learn and determining which ones might be superfluous
> (and causing the overfitting), or by applying automatic feature selection.
> 
> 
> 
> --
> Chanoch (Ken) Bloom. PhD candidate. Linguistic Cognition Laboratory.
> Department of Computer Science. Illinois Institute of Technology.
> http://www.iit.edu/~kbloom1/ <http://www.iit.edu/%7Ekbloom1/>
> 
> 
> 
> _______________________________________________
> Wekalist mailing list
> Send posts to: Wekalist@list.scms.waikato.ac.nz
> List info and subscription status:
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette:
> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html<http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html>
>  
> 

-- 
-----------------
Harri M.T. Saarikoski
CEO, IdealX Corporation
Espoo, Finland
www.idealpredictions.com (English)
www.idealpredictions.com/fi (Suomi)

[Attachment #5 (text/html)]

<div class="gmail_quote">2010/12/12 Ken Bloom &lt;<a \
href="mailto:kbloom@gmail.com">kbloom@gmail.com</a>&gt; <blockquote \
class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, \
204, 204); padding-left: 1ex;"> <div class="im">On Sat, 11 Dec 2010 14:57:31 -0800, \
MaryLee wrote: 
&gt; Hi, 
&gt; 
&gt; I want to learn the relationship between the number of folds in k- fold 
&gt; cross-validation and the accuracy of the decision tree algorithm. I know 
&gt; exactly what cross-validation does. it divides the data set and use one 
&gt; of the partition for test and the rest for training the algorithm. the 
&gt; problem is I have two data sets and when I increase the folds the 
&gt; accuracy increases on first data set while decreases on another data 
&gt; set. when I continue Increasing the number of folds the accuracy still 
&gt; increase on the first data set while still decreases on the second data 
&gt; sets. 
 </div></blockquote><div> A layman explanation is that if we consider learning \
machines to have an s-shaped learning curve, then in the first set this learning \
curve has not yet peaked and additional training instances (from increasing folds) \
still contribute to the pattern of features-&gt;classes that classifiers seek to \
define. This is what usually happens. Second set is different if and only if \
increasing k at several steps (2..n) systematically decreases the \
accuracy. If (or should say when) it does (cos it&#39;s not uncommon), we note the \
following: since evaluation accuracy ultimately depends on the similarity of \
individual instances in train vs test folds, we can note that cv with fewer folds \
are exposed to more randomness of sampling, ie. if you run 2-fold cv (with 50% \
training volume), there is a certain (higher) likelihood that those two folds \
happen to be more alike each other (-&gt; higher accuracy) than any 9 of the \
10-fold cv folds are to the 10th test fold. For an extreme example, if we have 4 \
instances that are 2 duplicates of the same 2 instances \
(a-orig,a-copy,b-orig,b-copy), the likelihood for the duplicates (a|b-orig in train, \
a|b-copy in test) to be dealt into the 2 (or k) training and test folds (-&gt; high \
accuracy) is in this case as great as a* instances forming the training fold and b* \
instances the test fold (-&gt; lower accuracy). -&gt; Likelihood of \
accuracy decrease/increase as function of k can be calculated from pairwise \
similarity of instances sampled into training vs test folds.

Size of sets (as with folds) matters greatly to settling / jittering of accuracy: \
the smaller the set the more likely accuracy is to jitter up and down (and should be \
nothing to concern about). Laws of probability state unequivocally that as training \
increases to infinity so does each classifier&#39;s accuracy approximate some random \
level (which is something described in eg. wikipedia entry on &#39;bayes optimal \
classifier&#39;). If evaluation reliability of various k is in doubt, higher k is \
obviously preferred, since their training fold size is closer to the full training \
set (whose overfitting tendency against unseen test instances is the true test for \
overfit). -&gt; Multiple k-fold iterations (by varying the cv sampling seed, \
the -S parameter in Stratifiedremovefolds filter or weka commandline for cv) could be \
run to confirm that the average accuracy across those multiple iterations begin to \
follow the &#39;normal&#39; tendency of accuracy increasing with k. best, \
Harri ps. Classifier selection (J48-&gt;sth else) suggested here is in my view \
 trivial in a case where volume and sampling define how much accuracy 
is and how much it varies.

</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: \
1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im"> </div>The first \
data set is normal -- it&#39;s what you hope will happen. With more<br> data, the \
learning algorithm is learning more *useful* information about<br> the data set. To \
deploy this alogrithm in the real world, you should try<br> to have a large enough \
data set that the change in accuracy isn&#39;t very<br> dramatic.<br>
<br>
The second data set is an example of overfitting. With more data, the<br>
learning algorithm is learning more information about the training set,<br>
but that information is *useless* because it doesn&#39;t reflect what&#39;s going<br>
on in the testing set. You need to rethink what you&#39;re doing with the<br>
second data set, either by picking a different learning algorithm that&#39;s<br>
more resistant to overfitting, by taking a logical look at what features<br>
you&#39;re using to learn and determining which ones might be superfluous<br>
(and causing the overfitting), or by applying automatic feature selection.<br>
<font color="#888888"><br></font></blockquote><blockquote class="gmail_quote" \
style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); \
padding-left: 1ex;"><font color="#888888"> <br>
<br>
--<br>
Chanoch (Ken) Bloom. PhD candidate. Linguistic Cognition Laboratory.<br>
Department of Computer Science. Illinois Institute of Technology.<br>
<a href="http://www.iit.edu/%7Ekbloom1/" \
target="_blank">http://www.iit.edu/~kbloom1/</a><br> <br>
<br>
</font><br>_______________________________________________<br>
Wekalist mailing list<br>
Send posts to: <a href="mailto:Wekalist@list.scms.waikato.ac.nz">Wekalist@list.scms.waikato.ac.nz</a><br>
 List info and subscription status: <a \
href="https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist" \
target="_blank">https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist</a><br> \
List etiquette: <a href="http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html" \
target="_blank">http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html</a><br>
 <br></blockquote></div><br><br clear="all"><br>-- <br>-----------------<br>Harri \
M.T. Saarikoski<br>CEO, IdealX Corporation<br>Espoo, Finland<br><a \
href="http://www.idealpredictions.com" target="_blank">www.idealpredictions.com</a> \
(English)<br>

<a href="http://www.idealpredictions.com/fi" \
target="_blank">www.idealpredictions.com/fi</a> (Suomi)<br><br>

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/=
listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.=
html

[prev in list] [next in list] [prev in thread] [next in thread]