[prev in list] [next in list] [prev in thread] [next in thread] 

List:       abiword-dev
Subject:    recent OTS changes
From:       Nadav Rotem <nadavrotem () mail ! ru>
Date:       2003-07-20 17:19:46
[Download RAW message or body]

In the past few days I made quite a few changes to the Open text
Summarizer , the most noticeable one is the stemming support.

http://www.google.com/search?q=stemming

Stemming is the ability to take a word such as "runing" and trace it
back to its original form to the word "run". We use this feature to
group together all of the the derividivs of a certin words together. Now
what we group together these words ots will know what the subject of
this article is the one word "run" and not ("running","run", "runable"
...).  Now, when we encounter a sentence that talks about some "runner"
we will know to that it is related to the topic of the article. 

In previous versions this has not been done becuase "running"!="run";

The stemming support is now integrated into OTS. To allow stemming I had
to create a new file format that will hold the stemming rules. That
format is XML based. I used libxml2 (and not expat ) to do so.

file example:

<?xml version="1.0"?>
<dictionary lang="english">
  <stemmer>
    <post>
        <word>sses:s</word>
        <word>ies:i</word>
...
...
...
In this file you can see that in the stemmer part I defined a postfix
rule where every word that ends with "sses" will be replaced with one
"s";  This file also holds the previous .dic file under <dictionary>

The rules are not easy to define. Try to come up with a rule that will
trim the word "dogs" but wont touch the word "was". one way is to add a
rule <>was:was<>;


 I made LOTS of changed to the code and I am afraid to run valgrid on it
:-)

so please do check it and start porting your old .dic files to the new
format. 






[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic