'Re: Analyzers for various languages'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Analyzers for various languages
From:       "Che Dong" <chedong () hotmail ! com>
Date:       2002-12-31 6:59:19
[Download RAW message or body]

For asian language, Chinese Korean Japanese,  bigram based word segment is easy way \
to solve the word segment problem.  Bigram based word segment is:  C1C2C3C4  =>  C1C2 \
C2C3 C3C4  (C# is single CJK charator term) I think the the make a StandardTokenizer \
can handle multi language mixed content : Chinese/English, Japanese/French mixed \
content. 

In CJKTokenizer(modify from StopTokenizer) I use one char buffer remember previous \
CJK charactor to make overlap term(Ci + Ci-1)。 but in StandardTokenizer I still \
don't know how to make: T1T2T3T4 => T1T2 T2T3 T3T4.  (T# is single CJK charator term)

for more article on word segment for asian languages:
http://www.google.com/search?q=chinese+word+segment+bigram

Regards

Che, Dong
----- Original Message ----- 
From: "Eric Isakson" <Eric.Isakson@sas.com>
To: <lucene-dev@jakarta.apache.org>
Sent: Saturday, December 07, 2002 12:40 AM
Subject: Analyzers for various languages


> Hi All, 
> 
> I want to volunteer to help get language modules organized into the CVS and builds.
> 
> I've been lurking on the lists here for a couple months and working with and \
> getting familiar with Lucene. I'm investigating the use of lucene to support our \
> help system's fulltext search requirements. I have to build indices for multiple \
> languages. I just poked around the CVS archives and found only the German, Russian \
> and standard(English) analyzers in the core and nothing in the sandbox. In the list \
> archives I've found many references to folks using Lucene for several other \
> languages. I did find the CJKTokenizer, Dutch and French analyzers and have put \
> those into my tests. Is there somewhere these analyzers are organized that I might \
> get a hold of the sources for other languages to build into my toolset? There were \
> a couple mentioned that several of you appear to be using that I can't find the \
> sources for (most notably http://www.halyava.ru/do/org.apache.lucene.analysis.zip \
> <http://www.halyava.ru/do/org.apache.lucene.analysis.zip>  which gives a "Cannot \
> find server" error).  
> In order to meet the requirements for my product these are the languages I have to \
> support:  
> Must Support 
> ------------ 
> English
> Japanese 
> Chinese 
> Korean 
> French 
> German 
> Italian 
> Polish 
> 
> Not Sure Yet 
> ------------ 
> Czech 
> Danish 
> Hebrew 
> Hungarian 
> Russian 
> Spanish 
> Swedish 
> 
> I understand the issues that were raised about putting language modules in the core \
> and then not being able to support them, but it seems they have not been put \
> anywhere. I would be willing to try and get them into a central place that people \
> can access them or help someone that is already working on that. I can't commit \
> today to being able to maintain or bugfix contributions, but should my company \
> adopt Lucene as our search engine (which seems likely at this point) I'll do what I \
> can to contribute back any fixes we make. I also have a personal interest in the \
> project since I've found Lucene quite interesting to be working with and I've \
> enjoyed learning about internationalizing java apps. 
> I'll volunteer to help gather and organize these somewhere if I were given \
> committer rights to the appropriate area and folks would be willing to send me \
> their language modules.  
> I recall some discussion about moving language modules out of the core, but I don't \
> think any decisions were made about where to put them (perhaps this is why they \
> aren't in the CVS at all). I was thinking perhaps give each language a sandbox \
> project or create language packages in the core build that could be enabled via \
> settings in the build.properties file. Using the build.properties file could allow \
> us to create a jar for each language during the core build so folks could install \
> just the language modules they want and if a language module starts breaking due to \
> changes in the core it could easily be turned off until fixes were made to that \
> module. I can start working on a setup like this in my local source tree next week \
> using the existing language modules in the core if you all think this would be a \
> good approach. If not, does anyone have a proposal for where these belong so we can \
> get some movement on getting them committed to CVS? 
> Regards,
> Eric
> -- 
> Eric D. Isakson        SAS Institute Inc. 
> Application Developer  SAS Campus Drive 
> XML Technologies       Cary, NC 27513 
> (919) 531-3639         http://www.sas.com <http://www.sas.com>  
> 
> 
> 


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic