'Re: [Wekalist] Filter + NB != multiclassifer(filter + NB) = NB // car.arff'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wekalist
Subject:    Re: [Wekalist] Filter + NB != multiclassifer(filter + NB) = NB // car.arff
From:       Eibe Frank <eibe () waikato ! ac ! nz>
Date:       2018-05-29 2:53:49
Message-ID: 325835CF-C7BA-48E2-909D-82FBECB2EE2E () waikato ! ac ! nz
[Download RAW message or body]

> I have a dataset (car.arff) with 7 attributes. For this data set
> NaiveBayes gives the Correctly Classified Instances 85.5324 %
> 
> 1) For this dataset I apply filter (CFS):
> Preprocess -> Filter -> filters -> supervised -> attribute ->
> AttributeSelection (CFSSubsetEval + BestFirst). As a result I get 2
> attributes (safety and class). Than I apply NaiveBayes for this subset of
> attributes and I get the Correctly Classified Instances 70.0231 %. I
> confirmed this result using my own implementation of CFS and NaiveBayes.
> 
> 2) For the original dataset I apply meta classifier
> Classify -> classifiers -> meta -> AttributeSelectedClassifier
> (NaiveBayes, CFSSubsetEval + BestFirst). As result I get the following data:
> 
> - the number of selected atributes - the same as for filter (OK)
> - the classifier model - the same as for filter + NaiveBayes (OK)
> - Correctly Classified Instances = 85.5324 % (???)
> - Confusion Matrix - the same as for NaiveBayes (???)
> - the rest of statistics - the same as for NaiveBayes (???)
> 
> Why these two approaches to feature selection differ so much (or why
> multiclassifier selects features but the number of Correctly Classified
> Instances , Confusion Matrix and the rest of statistics are the same as
> for NaiveBayes without feature selection)?
> 
> 
> Weka-3-8-2.exe (the same problem with Weka-3-6-x).
> car.arff added as an attachement

How did you estimate performance? The model that WEKA shows in the Classify panel is \
the model built from the full dataset as loaded into the Preprocess panel. That is \
why the classifier model is the same.

If you use cross-validation or percentage split evaluation, the model(s) that are \
evaluated are actually models built from subsets of the full dataset. Perhaps CFS \
decided to use all attributes on those subsets.

In WEKA 3.8.2./3.9.2 there is an option available under "More options…" that allows \
you to print out all the models built from different training splits during the \
evaluation process.

Note that applying supervised features selection to the full dataset and then running \
a cross-validation (or similar) on the reduced data is not an appropriate method for \
obtaining performance estimates. Feature selection needs to be evaluated as part of \
the learning process. Otherwise, you can get optimistic performance estimates \
(although, in your case, the performance estimate seems to be very pessimistic).

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.waikato.ac.nz
List info and subscription status: \
https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: \
http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

[prev in list] [next in list] [prev in thread] [next in thread]