'Re: How to retain % sign next to number during tokenization'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Re: How to retain % sign next to number during tokenization
From:       Amitesh Kumar <amiteshk116 () gmail ! com>
Date:       2023-09-21 13:24:33
Message-ID: CAD41JwM+wQrAj0OgjK7gtirpq+_mQfWdhYMNpLqzGW55yew9+A () mail ! gmail ! com
[Download RAW message or body]


Thank you! I will give it a try and share my findings with you all

Regards
Amitesh

On Thu, Sep 21, 2023 at 08:18 Uwe Schindler <uwe@thetaphi.de> wrote:

> The problem with WhitespaceTokenizer is that is splits only on
> whitespace. If you have text like "This is, was some test." then you get
> tokens like "is," and "test." including the punctuations.
>
> This is the reason why StandardTokenizer is normally used for human
> readable text. WhitespaceTokenizer is normally only used for special
> stuff like token lists (like tags) or uinque identifiers,...
>
> As quick workaround while still keeping the %, you can add a CharFilter
> like MappingCharFilter before the Tokenizer that replaces the "%" char
> by something else which is not stripped off. As this is done for both
> indexing and searching this does not hurt you. How about a "percent
> emoji"? :-)
>
> Another common "workaround" is also shown in some Solr default
> configurations typically used for product search: Those use
> WhitespaceTokenizer, followed by WordDelimiterFilter. WDF is then able
> to remove accents and handle stuff like product numbers correctly. There
> you can possibly make sure thet "%" survives.
>
> Uwe
>
> Am 20.09.2023 um 22:42 schrieb Amitesh Kumar:
> > Thanks Mikhail!
> >
> > I have tried all other tokenizers from Lucene4.4. In case of
> > WhitespaceTokwnizer, it loses romanizing of special chars like - etc
> >
> >
> > On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev <mkhl@apache.org> wrote:
> >
> >> Hello,
> >> Check the whitespace tokenizer.
> >>
> >> On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar <amiteshk116@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am facing a requirement change to get % sign retained in searches.
> e.g.
> >>>
> >>> Sample search docs:
> >>> 1. Number of boys 50
> >>> 2. My score was 50%
> >>> 3. 40-50% for pass score
> >>>
> >>> Search query: 50%
> >>> Expected results: Doc-2, Doc-3 i.e.
> >>> My score was
> >>> 1. 50%
> >>> 2. 40-50% for pass score
> >>>
> >>> Actual result: All 3 documents (because tokenizer strips off the % both
> >>> during indexing as well as searching and hence matches all docs with 50
> >> in
> >>> it.
> >>>
> >>> On the implementation front, I am using a set of filters like
> >>> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> >> tokenizer
> >>> StandardTokenizer.
> >>>
> >>> Per my analysis suggests, StandardTokenizer strips off the %  I am
> >> facing a
> >>> requirement change to get % sign retained in searches. e.g
> >>>
> >>> Sample search docs:
> >>> 1. Number of boys 50
> >>> 2. My score was 50%
> >>> 3. 40-50% for pass score
> >>>
> >>> Search query: 50%
> >>> Expected results: Doc-2, Doc-3 i.e.
> >>> My score was 50%
> >>> 40-50% for pass score
> >>>
> >>> Actual result: All 4 documents
> >>>
> >>> On the implementation front, I am using a set of filters like
> >>> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base
> >> tokenizer
> >>> StandardTokenizer.
> >>>
> >>> Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> >>> behavior.Has someone faced si
> <https://www.google.com/maps/search/behavior.Has+someone+faced+si?entry=gmail&source=g>milar
> requirement? Any help/guidance is
> >> highly
> >>> appreciated.
> >>>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic