[prev in list] [next in list] [prev in thread] [next in thread]
List: kinosearch
Subject: [KinoSearch] Write a custom analyzer/tokenizer
From: gslin () gslin ! org (Gea-Suan Lin)
Date: 2007-07-28 3:35:16
Message-ID: 20070728102532.GA92380 () gslin ! org
[Download RAW message or body]
Hello all,
I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
KinoSearch, but I don't know how.
In fact I already write one for Plucene:
http://search.cpan.org/dist/Plucene-Analysis-UTF8/
http://code.google.com/p/plucene-analysis-utf8/
The algorithm is very simple. When a string with UTF-8 flag on, we can
use regular expression to extract it, and then generate unigram and
bigram list:
my $c = '';
while ($text =~ /([a-z\d]+|\S)/go) {
next if $1 =~ /\p{P}|\p{Z}/o;
$tok{$1} = 1;
$tok{$c . $1} = 1;
$c = $1;
}
Then keys %tok will be the list.
--
* Gea-Suan Lin (public key: Using https://keyserver.pgp.com/ to search)
* If you cannot convince them, confuse them. -- Harry S Truman
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic