[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-dev
Subject: Re: Analyzers for various languages
From: "Che Dong" <chedong () hotmail ! com>
Date: 2002-12-31 6:59:19
[Download RAW message or body]
For asian language, Chinese Korean Japanese, bigram based word segment is easy way \
to solve the word segment problem. Bigram based word segment is: C1C2C3C4 => C1C2 \
C2C3 C3C4 (C# is single CJK charator term) I think the the make a StandardTokenizer \
can handle multi language mixed content : Chinese/English, Japanese/French mixed \
content.
In CJKTokenizer(modify from StopTokenizer) I use one char buffer remember previous \
CJK charactor to make overlap term(Ci + Ci-1)。 but in StandardTokenizer I still \
don't know how to make: T1T2T3T4 => T1T2 T2T3 T3T4. (T# is single CJK charator term)
for more article on word segment for asian languages:
http://www.google.com/search?q=chinese+word+segment+bigram
Regards
Che, Dong
----- Original Message -----
From: "Eric Isakson" <Eric.Isakson@sas.com>
To: <lucene-dev@jakarta.apache.org>
Sent: Saturday, December 07, 2002 12:40 AM
Subject: Analyzers for various languages
> Hi All,
>
> I want to volunteer to help get language modules organized into the CVS and builds.
>
> I've been lurking on the lists here for a couple months and working with and \
> getting familiar with Lucene. I'm investigating the use of lucene to support our \
> help system's fulltext search requirements. I have to build indices for multiple \
> languages. I just poked around the CVS archives and found only the German, Russian \
> and standard(English) analyzers in the core and nothing in the sandbox. In the list \
> archives I've found many references to folks using Lucene for several other \
> languages. I did find the CJKTokenizer, Dutch and French analyzers and have put \
> those into my tests. Is there somewhere these analyzers are organized that I might \
> get a hold of the sources for other languages to build into my toolset? There were \
> a couple mentioned that several of you appear to be using that I can't find the \
> sources for (most notably http://www.halyava.ru/do/org.apache.lucene.analysis.zip \
> <http://www.halyava.ru/do/org.apache.lucene.analysis.zip> which gives a "Cannot \
> find server" error).
> In order to meet the requirements for my product these are the languages I have to \
> support:
> Must Support
> ------------
> English
> Japanese
> Chinese
> Korean
> French
> German
> Italian
> Polish
>
> Not Sure Yet
> ------------
> Czech
> Danish
> Hebrew
> Hungarian
> Russian
> Spanish
> Swedish
>
> I understand the issues that were raised about putting language modules in the core \
> and then not being able to support them, but it seems they have not been put \
> anywhere. I would be willing to try and get them into a central place that people \
> can access them or help someone that is already working on that. I can't commit \
> today to being able to maintain or bugfix contributions, but should my company \
> adopt Lucene as our search engine (which seems likely at this point) I'll do what I \
> can to contribute back any fixes we make. I also have a personal interest in the \
> project since I've found Lucene quite interesting to be working with and I've \
> enjoyed learning about internationalizing java apps.
> I'll volunteer to help gather and organize these somewhere if I were given \
> committer rights to the appropriate area and folks would be willing to send me \
> their language modules.
> I recall some discussion about moving language modules out of the core, but I don't \
> think any decisions were made about where to put them (perhaps this is why they \
> aren't in the CVS at all). I was thinking perhaps give each language a sandbox \
> project or create language packages in the core build that could be enabled via \
> settings in the build.properties file. Using the build.properties file could allow \
> us to create a jar for each language during the core build so folks could install \
> just the language modules they want and if a language module starts breaking due to \
> changes in the core it could easily be turned off until fixes were made to that \
> module. I can start working on a setup like this in my local source tree next week \
> using the existing language modules in the core if you all think this would be a \
> good approach. If not, does anyone have a proposal for where these belong so we can \
> get some movement on getting them committed to CVS?
> Regards,
> Eric
> --
> Eric D. Isakson SAS Institute Inc.
> Application Developer SAS Campus Drive
> XML Technologies Cary, NC 27513
> (919) 531-3639 http://www.sas.com <http://www.sas.com>
>
>
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic