'Re: Korean character set in analysis'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Korean character set in analysis
From:       "Yiyi Sun" <yiyisun () yahoo ! com>
Date:       2001-11-29 12:47:41
[Download RAW message or body]

Hi,

I have made 3 classes for the Simplified Chinese. They are ChineseAnalyzer,
ChineseFilter and ChineseTokenizer. If you are interesting in those, I can
upload.

You can use a dictionary to extract nouns from the sentences.

Cheers!

Yiyi Sun

----- Original Message -----
From: "Junshik, Jeon" <locus@nextel.co.kr>
To: <lucene-dev@jakarta.apache.org>
Sent: Thursday, November 29, 2001 12:15 AM
Subject: Korean character set in analysis


> Hello,
>
> I've been testing lucene indexing and searching for Korean Language
documents.
> But, currently not support korean character set...
>
> So, I've changed some codes to work with korean character set.
>
>
> in
"jakarta-lucene\src\java\org\apache\lucene\analysis\standard\StandardTokeniz
er.jj" file.
>
> JavaCC option part..
> -----------------------------------------------------------------
> options {
>   STATIC = false;
> //IGNORE_CASE = true;
> //BUILD_PARSER = false;
>   UNICODE_INPUT = true; // <== changes : uncomment for korean character
set
>   USER_CHAR_STREAM = true;
>   OPTIMIZE_TOKEN_MANAGER = true;
> //DEBUG_TOKEN_MANAGER = true;
> }
>
> in TOKEN
> -----------------------------------------------------------------
> | < #LETTER:   // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\u3040"-"\u318f",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u3d2d",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",   // <== changes : added.. ( korean character
set in UNICODE )
>        "\uf900"-"\ufaff"
>       ]
>   >
>
> I hope these changes are added to CVS repository..
>
>
> Another question is how to analysis compound words.
>
> Compound word consist of nouns. I want to index, every nouns in compounds
word after analysis.
> but current TokenStream class has only "public Token next()" method.
>
> If you could let me know how to solve it?
>
> Regards,
>
> Junshik, Jeon (locus@nextel.co.kr)
>


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic