[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wekalist
Subject:    Re: [Wekalist] Has anyone converted the Reuters - 21578 to the ARFF
From:       Peter Reutemann <fracpete () waikato ! ac ! nz>
Date:       2006-12-19 21:11:34
Message-ID: 45885586.6010900 () waikato ! ac ! nz
[Download RAW message or body]

> I am  using WEKA for data mining purpose in my Master Thesis research work. I want \
> to add ontology to the classifier. My dataset has Reuters - 21578, 20 newsgroup and \
> semcor2.0.But I have some difficulties with the arff format. Actually I have no \
> idea how to convert the three kinds of dataset to their corresponding arff files. 

Normally, one has a class attribute and a string attribute that contains
the complete data of an article/text file. E.g., the 20 newsgroup
dataset could look like this ("..." denotes omissions):

@relation 20newsgroup

@attribute contents string
@attribute class {acq,corn,crude,earn,grain,...}

@data
'SHAD FAVORS SHORTENING DISCLOSURE ...',acq
'FIRST MICHIGAN BANCORP <FMBC.O> 3RD QTR ...',earn

One then uses the StringToWordVector to turn the string attribute into
word counts (see the Javadoc of this filter for more information), since
string attributes can't be processed normally by classifiers directly.

HTH

Cheers, Peter
-- 
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174



_______________________________________________
Wekalist mailing list
Wekalist@list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic