[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wekalist
Subject:    Re: [Wekalist] avoiding random sampling in Cross-Validation
From:       Eibe Frank <eibe () waikato ! ac ! nz>
Date:       2019-05-17 6:06:34
Message-ID: 83D134BA-89A7-43CD-A6D5-D9DFF33876CF () waikato ! ac ! nz
[Download RAW message or body]

The percentage split evaluation option (or a separately defined test set) is your \
only straightforward option. Under "More options…" in the Classify panel of the \
Explorer, you can tick "Preserve order for % Split". This will prevent shuffling of \
the data.

Normally k-fold cross-validation is problematic in this kind of scenario even if you \
don't shuffle because it's generally not informative to evaluate how well you predict \
the past from the future.

Cheers,
Eibe

> On 16/05/2019, at 7:18 AM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
> 
> The features in my data sets are derived from windows on time series.  A problem \
> presented by K-fold cross validation implementations is that they often determine \
> the K folds as random samples from the data set.  This presents a problem for time \
> series derived features, in that one may use features in training that overlap with \
> the lookback period for features in testing.   ML models tend to exploit these \
> overlaps leading to a model that does not generalize. 
> For example, feature "A" is derived from a prior 30min window on a sequence of 1min \
> events.  Assuming we produced features and labels across all minutes in our \
> timeseries (each with a 30min lookback), for a given data row there would be 29 \
> other rows that shared part of the time window with the data row in question.  With \
> random sampling the training set may contain some of the 30 rows and testing the \
> other portion of those.  The information overlap leads to overfitting. 
> Long story short, is there a way to override / option the cross-validation in Weka \
> to use contiguous regions in each of the folds?   While there may be a bit of \
> overlap between folds at the boundaries is largely mitigated by avoiding random \
> sampling. 
> Thanks
> --
> Jonathan Shore
> 
> _______________________________________________
> Wekalist mailing list
> Send posts to: Wekalist@list.waikato.ac.nz
> To subscribe, unsubscribe, etc., visit \
> https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: \
> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.waikato.ac.nz
To subscribe, unsubscribe, etc., visit \
https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: \
http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic