[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kinosearch
Subject:    [KinoSearch] Write a custom analyzer/tokenizer
From:       gslin () gslin ! org (Gea-Suan Lin)
Date:       2007-07-28 3:35:16
Message-ID: 20070728102532.GA92380 () gslin ! org
[Download RAW message or body]

Hello all,

I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
KinoSearch, but I don't know how.

In fact I already write one for Plucene:

http://search.cpan.org/dist/Plucene-Analysis-UTF8/
http://code.google.com/p/plucene-analysis-utf8/

The algorithm is very simple. When a string with UTF-8 flag on, we can
use regular expression to extract it, and then generate unigram and
bigram list:

    my $c = '';
    while ($text =~ /([a-z\d]+|\S)/go) {
	next if $1 =~ /\p{P}|\p{Z}/o;
	$tok{$1} = 1;
	$tok{$c . $1} = 1;
	$c = $1;
    }

Then keys %tok will be the list.

-- 
* Gea-Suan Lin  (public key: Using https://keyserver.pgp.com/ to search)
* If you cannot convince them, confuse them.           -- Harry S Truman

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic