'Tutorial: writing stemming rules in Open Text Summarizer'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       abiword-dev
Subject:    Tutorial: writing stemming rules in Open Text Summarizer
From:       Nadav Rotem <nadavrotem () mail ! ru>
Date:       2003-07-24 5:27:48
[Download RAW message or body]

In the past few weeks "stemming" support was added to the Open Text
Summarizer.
Stemming is the ability to take a word such as "running" and trace it
back to its original form to the word "run". We use this feature to
group together all of the thederivativs of a certain stem. For OTS,
keywords equal ideas; We need this ability to group to words together to
recognize that "I ran" and "I am running" are of similar ideas. 

The stemming process is govern by given rules. At the moment there are
two main rule groups. prefix and postfix. Each rule is defined as
["replace this" : "with that"]. A set of two stringsseparated by a
colon. The <postfix> will try to match the end of the word while the
<prefix> will try to match the beginning. The program will try to apply
each of the rules , from top to bottom, until one is matched. It will
apply ONLY ONE rule of each group. 

The stem rules are defined in en.xml (or any other language code
dot xml); They look like this:

<prefix>
	<rule>replaceThis:withThis</rule>
</prefix> 
<postfix>
	<rule>sses:s</rule>
	<rule>ing:</rule>
	<rule>went:go</rule>
</postfix> 

In the example file the program will replace each "sses" at the end of a
word with "s" , remove every "ing" from the end of a word and replace
the word "went" with "go". 

In the example the program will be able to tell that:

stem("went") == stem("going") == stem("go") == "go"

As Alan said "There are some grammar rules for this but because English
is such a bastard language they can be quite unreliable."

for example: we cant automatically drop the "s" at the end of the word
to remove plural because first it might end with "es" and second it may
be a word such as "was". One trick would be to place "es" before "s" and
"e" and maybe to have in the beginning a list of words that break our
algorithm. 

You can go wild with the list because it is O(N), where N is the number
of words in the article. We already have O(N^2) in some other place. 

In order to fully support the 24+ languages that OTS support 
we need to define the rules for each language to make this connection. 
I know that for many languages this feature is critical(russian for example).  


Shalom,
Nadav


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic