[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: How to retain % sign next to number during tokenization
From:       Amitesh Kumar <amiteshk116 () gmail ! com>
Date:       2023-07-18 13:56:28
Message-ID: CAD41JwPXSs9VCi91SWYNre4_M51xK-j0WGtSDbT2DuZnR7r8eg () mail ! gmail ! com
[Download RAW message or body]

Sorry for duplicating the question.

On Tue, Jul 18, 2023 at 19:09 Amitesh Kumar <amitesh116@gmail.com> wrote:

> I am facing a requirement change to get % sign retained in searches. e.g.
>
> Sample search docs:
> 1. Number of boys 50
> 2. My score was 50%
> 3. 40-50% for pass score
>
> Search query: 50%
> Expected results: Doc-2, Doc-3 i.e.
> 1. My score was 50%

2. 40-50% for pass score
>
> Actual result: All 3 documents


(possibly because tokenizer strips off the % both during indexing as well
> as searching and hence matches all docs with 50 in it.)
>
> On the implementation front, I am using a set of filters like
> lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer
> StandardTokenizer.
>
> Per my analysis, StandardTOkenizer strips off the %  sign and hence the
> behavior.Has someone faced similar requirement? Any help/guidance is highly
> appreciated.
>
> Regards
> Amitesh
> --
> Regards,
> Amitesh
> Sent from Gmail Mobile
> (Please ignore typos)
>
-- 
Regards
Amitesh

[Attachment #3 (text/html)]

<div dir="auto">Sorry for duplicating the question.</div><div \
dir="auto"><br></div><div><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Tue, Jul 18, 2023 at 19:09 Amitesh Kumar &lt;<a \
href="mailto:amitesh116@gmail.com">amitesh116@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)" \
dir="auto">I am facing a requirement change to get % sign retained in searches. \
e.g.<br> <br>
Sample search docs:<br>
1. Number of boys 50<br>
2. My score was 50%<br>
3. 40-50% for pass score<br>
<br>
Search query: 50%<br>
Expected results: Doc-2, Doc-3 i.e.<br>
1.  <span style="color:rgb(0,0,0)">My score was  </span>50%</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)" \
dir="auto"> 2. 40-50% for pass score<br>
<br>
Actual result: All 3 documents </blockquote><div dir="auto"><br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)" \
dir="auto">(possibly because tokenizer strips off the % both during indexing as well \
as searching and hence matches all docs with 50 in it.)<br> <br>
On the implementation front, I am using a set of filters like<br>
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base tokenizer \
StandardTokenizer.<br><br> Per my analysis, StandardTOkenizer strips off the %   sign \
and hence the behavior.Has someone faced similar requirement? Any help/guidance is \
highly<br> appreciated.<br>
<br>
Regards<br>
Amitesh<br>
-- <br>
Regards,<br>
Amitesh<br>
Sent from Gmail Mobile<br>
(Please ignore typos)<br>
</blockquote></div></div><span class="gmail_signature_prefix">-- </span><br><div \
dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div \
dir="ltr">Regards<div>Amitesh</div></div></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic