'Re: Query text Tokenize issue'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Re: Query text Tokenize issue
From:       Erik Hatcher <erik () ehatchersolutions ! com>
Date:       2005-07-28 9:02:44
Message-ID: 8502390A-59AC-4461-A800-FC2DCEA21073 () ehatchersolutions ! com
[Download RAW message or body]


On Jul 27, 2005, at 7:26 PM, Indu Abeyaratna wrote:

>
> I have a field index as keyword. And have two records "J400-C-V1- 
> S10-T1" and
> "J400-C-V-S10-T1"
>
> When I search for  "J400-C-V1-S10-T1", it returns me matching  
> record, but
> when I Search for "J400-C-V-S10-T1" it doesn't return the matching  
> one.
>
> Further I found that "J400-C-V-S10-T1" is incorrectly tokenised to  
> "J400-C"
> and "V-S10-T1" but nothing like that happened to "J400-C-V1-S10-T1".
>
> This happens when there is combination like "?-?-" and its get  
> tokenised
> into "?" and "?-".
>
> I attached test case for further clarification.
>
> I am using StandardAnalyser and query parser.
>
> Is this a bug in the lucene or JavaCC??  Or am I missing something  
> here? any
> suggestion to get away with this?

It's not a "bug" per se.... but rather just how StandardAnalyzer  
works.  StandardAnalyzer is a general-purpose text analyzer, and  
cannot reasonably deal with this issue and also deal with the much  
more common scenario of "hyphenated-text" that should be split into  
separate tokens.

As Otis said, this is really the job for a custom analyzer.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic