'Re: Index for text with space'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    Re: Index for text with space
From:       Andy C <andycsolr () gmail ! com>
Date:       2021-10-25 13:28:50
Message-ID: CAAxJCMbo2jK73zRb7uTQ8puzZumEXgabkX34s-BqQyZtd1Rnfw () mail ! gmail ! com
[Download RAW message or body]


I would think your problem goes beyond 1 and 2 characters words not being
indexed.

With your current field type definition, if someone searches for "can" it
will retrieve documents that contain any word that start with "can". So
"candidate", canadian", "cantina", etc.

Is this really the desired search behavior?

On Mon, Oct 25, 2021 at 8:48 AM Dave <hastings.recursive@gmail.com> wrote:

> You can pre process the query to remove anything not indexed (less than 3
> characters) but that initial scheme decision was a mistake, and should be
> remedied and reindexed.
>
> > On Oct 25, 2021, at 8:36 AM, son hoang <sonhoangnz@gmail.com> wrote:
> >
> > Is there any way in the query so that I do not need to reindex the
> whole data?
> >
> >> On 2021/10/23 15:39:18, Walter Underwood <wunder@wunderwood.org>
> wrote:
> >> Agreed. There is a simple fix. Index all the words. Also, stop using
> EdgeNgramFilter.
> >> That is only used for completion, not word search.
> >>
> >> wunder
> >> Walter Underwood
> >> wunder@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>>> On Oct 23, 2021, at 4:31 AM, Dave <hastings.recursive@gmail.com>
> wrote:
> >>>
> >>> Why ever would you not index less than three characters?
> >>> "To be or not to be"
> >>> Seems like a significant search
> >>>
> >>>> On Oct 23, 2021, at 7:28 AM, son hoang <sonhoangnz@gmail.com> wrote:
> >>>>
> >>>> Yep, words less than 3 chars will not be indexed. But if "Al Abbas"
> text can be separated into a token "Abbas" (and "Al"  but it is not counted
> as a token as it has 2 chars only) then we can apply OR condition in the
> query?
> >>>>
> >>>>> On 2021/10/22 14:37:51, Andy C <andycsolr@gmail.com> wrote:
> >>>>> The issue looks to me to be with the use of EdgeNGramFilterFactory
> in your
> >>>>> field type. You have configured it with minGramSize="3" and have not
> >>>>> specified preserveOriginal="true".
> >>>>>
> >>>>> So words less than 3 characters will not be indexed, and therefore
> can't be
> >>>>> searched.
> >>>>>
> >>>>> See
> >>>>>
> https://solr.apache.org/guide/8_8/filter-descriptions.html#edge-n-gram-filter
> >>>>>
> >>>>> - Andy -
> >>>>>
> >>>>>> On Fri, Oct 22, 2021 at 10:12 AM son hoang <sonhoangnz@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Thanks, Thamiz
> >>>>>>
> >>>>>> It seems that I have index=StandardTokenizerFactory causing the
> issue
> >>>>>>
> >>>>>> I do not want to re-index. Is there any solution ? Should I have
> query
> >>>>>> "OR" so that the search can return  "Al Abbas" when I have  "Al
> Abbas" in
> >>>>>> the query field  (eg: there is a OR match "Abbas" ?
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> On 2021/10/21 07:56:20, Thamizhazhagan B <Thamizhazhagan.X.B@kp.org
> >
> >>>>>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Create a copy field as below and use this copyfield in your query..
> >>>>>>>
> >>>>>>> <copyField source="_name" dest="itemFullName"/>
> >>>>>>> <field name="itemFullName" type="itemFullName_type" stored="true"
> >>>>>> indexed="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> >>>>>>>
> >>>>>>> <fieldType name="itemFullName_type" class="solr.TextField"
> >>>>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100"
> >>>>>> multiValued="false">
> >>>>>>>  <analyzer type="index">
> >>>>>>>    <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>>    <filter class="solr.StopFilterFactory" words="stopwords.txt"
> >>>>>> ignoreCase="true"/>
> >>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>  </analyzer>
> >>>>>>>  <analyzer type="query">
> >>>>>>>    <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>>    <filter class="solr.StopFilterFactory" words="stopwords.txt"
> >>>>>> ignoreCase="true"/>
> >>>>>>>    <filter class="solr.SynonymFilterFactory" expand="true"
> >>>>>> ignoreCase="true" synonyms="synonyms.txt"/>
> >>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>  </analyzer>
> >>>>>>> </fieldType>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Thamizh
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: son hoang <sonhoangnz@gmail.com>
> >>>>>>> Sent: Thursday, October 21, 2021 8:19 AM
> >>>>>>> To: users@solr.apache.org
> >>>>>>> Subject: Index for text with space
> >>>>>>>
> >>>>>>> Caution: This email came from outside Kaiser Permanente. Do not
> open
> >>>>>> attachments or click on links if you do not recognize the sender.
> >>>>>>>
> >>>>>>>
> ______________________________________________________________________
> >>>>>>> Hello
> >>>>>>>
> >>>>>>> I have a config like this:
> >>>>>>>
> >>>>>>> <fieldtype name="tok" class="solr.TextField"
> positionIncrementGap="100">
> >>>>>>>          <analyzer type="index">
> >>>>>>>              <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>              <filter class="solr.ASCIIFoldingFilterFactory"/>
> >>>>>>>              <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>      <filter class="solr.EdgeNGramFilterFactory" minGramSize="3"
> >>>>>>> maxGramSize="15"/>
> >>>>>>>          </analyzer>
> >>>>>>>          <analyzer type="query">
> >>>>>>>              <tokenizer class="solr.StandardTokenizerFactory" />
> >>>>>>>              <filter class="solr.ASCIIFoldingFilterFactory"/>
> >>>>>>>              <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>      <!-- <filter class="solr.EdgeNGramFilterFactory"
> minGramSize="3"
> >>>>>>> maxGramSize="15"/> -->
> >>>>>>>          </analyzer>
> >>>>>>>  </fieldtype>
> >>>>>>>
> >>>>>>> Using this config:
> >>>>>>>
> >>>>>>> 1. When I search for "Abbas", the result for "Al Abbas" appears.
> >>>>>>>
> >>>>>>> 2. When I search for "Al Abbas" in the search field, I get no
> results.
> >>>>>>>
> >>>>>>> It seems that "Al Abbas" is not indexed. What I should do in the
> config
> >>>>>> so #2 can return the result
> >>>>>>>
> >>>>>>> Many thanks
> >>>>>>> NOTICE TO RECIPIENT:  If you are not the intended recipient of this
> >>>>>> e-mail, you are prohibited from sharing, copying, or otherwise
> using or
> >>>>>> disclosing its contents.  If you have received this e-mail in
> error, please
> >>>>>> notify the sender immediately by reply e-mail and permanently
> delete this
> >>>>>> e-mail and any attachments without reading, forwarding or saving
> them.
> >>>>>> v.173.295  Thank you.
> >>>>>>>
> >>>>>>
> >>>>>
> >>
> >>
>


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic