'Re: Searching words with spaces for word without spaces in solr'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    Re: Searching words with spaces for word without spaces in solr
From:       sunshine glass <sunshineglassof2day () gmail ! com>
Date:       2014-07-31 17:21:30
Message-ID: CAFy2D-eiYoK2uu6db2QxRt=21S16FJw+KmSYP9jpjnTC_05b1A () mail ! gmail ! com
[Download RAW message or body]


*Point 1:*
On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James <James.Dyer@ingramcontent.com>
 wrote:

> If a user is searching on "ice cream" but your index has "icecream", you
> can treat this like a spelling error.  WordBreakSolrSpellChecker would
> identify the fact that  while "ice cream" is not in your index, "icecream"
> and then you can re-query for the corrected version without the space.
>

What if I have  1M records for "ice cream" & same number for "icecream".
Then trick will not work here. What is desire in this case is that either I
search for "ice cream" or "icecream", Solr should return 2M results.

*Point 2:*
On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James <James.Dyer@ingramcontent.com>
 wrote:
The problem with solving this with analyers, is that you can analyze
"ice-cream" as either "ice cream" or "icecream" (split or catenate on
hyphen).  You can even analyze "IceCream > Ice Cream" (catenate on case
change).  But how is your analyzer going to know that "icecream" should
index as two tokens: "ice" "cream" ?  You're asking analysis to do too much
in this case. This is where spellcheck can bridge the gap.

I don't want "icecream" to be indexed as "ice" or "cream". I agree that
this is not feasible. What I am looking forward is to create shingles at
query time as well. In more words, while querying "ice cream", Can't it
search as "ice" or "cream" or "icecream".
That is forming shingles at query time.

There is a long list of such words in my inde. So, I does want to implement
via synonym filter factory.


On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James <James.Dyer@ingramcontent.com>
wrote:

> If a user is searching on "ice cream" but your index has "icecream", you
> can treat this like a spelling error.  WordBreakSolrSpellChecker would
> identify the fact that  while "ice cream" is not in your index, "icecream"
> and then you can re-query for the corrected version without the space.
>
> The problem with solving this with analyers, is that you can analyze
> "ice-cream" as either "ice cream" or "icecream" (split or catenate on
> hyphen).  You can even analyze "IceCream > Ice Cream" (catenate on case
> change).  But how is your analyzer going to know that "icecream" should
> index as two tokens: "ice" "cream" ?  You're asking analysis to do too much
> in this case.  This is where spellcheck can bridge the gap.
>
> Of course, if you have a discrete list of words you want split like this,
> then you can do it with analysis using index-time synonyms.  In this case,
> you need to provide it with the list.  See
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> for more information.
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: sunshine glass [mailto:sunshineglassof2day@gmail.com]
> Sent: Thursday, July 31, 2014 10:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Searching words with spaces for word without spaces in solr
>
> I am not clear with this. This link is related to spell check. Can you
> elaborate it more ?
>
>
> On Wed, Jul 30, 2014 at 9:17 PM, Dyer, James <James.Dyer@ingramcontent.com
> >
> wrote:
>
> > In addition to the analyzer configuration you're using, you might want to
> > also use WordBreakSolrSpellChecker to catch possible matches that can't
> > easily be solved through analysis.  For more information, see the section
> > for it at
> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> > -----Original Message-----
> > From: sunshine glass [mailto:sunshineglassof2day@gmail.com]
> > Sent: Wednesday, July 30, 2014 9:38 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Searching words with spaces for word without spaces in solr
> >
> > This is the new configuration:
> >
> >     <fieldType name="text" class="solr.TextField"
> > > positionIncrementGap="100">
> > >       <analyzer type="index">
> > >         <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
> > >         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > > outputUnigrams="true" tokenSeparator=""/>
> > >         <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >         <filter class="solr.LowerCaseFilterFactory"/>
> > >         <filter class="solr.SnowballPorterFilterFactory"
> > > language="English" protected="protwords.txt"/>
> > >           <filter class="solr.SynonymFilterFactory"
> > > synonyms="stemmed_synonyms_text_prime_index.txt" ignoreCase="true"
> > > expand="true"/>
> > >       </analyzer>
> > >       <analyzer type="query">
> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
> > >         <filter class="solr.LowerCaseFilterFactory"/>
> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords_text_prime_search.txt" enablePositionIncrements="true"
> > />
> > >         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > > outputUnigrams="true" tokenSeparator=""/>
> > >         <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
> > >         <filter class="solr.SnowballPorterFilterFactory"
> > > language="English" protected="protwords.txt"/>
> > >       </fieldType>
> > >
> > >
> > These are current docs in my index:
> >
> > <result name="response" numFound="3" start="0">
> > <doc>
> > <str name="id">2</str>
> > <str name="title">Icecream</str>
> > <long name="_version_">1475063961342705664</long>
> > </doc>
> > <doc>
> > <str name="id">3</str>
> > <str name="title">Ice-cream</str>
> > <long name="_version_">1475063961344802816</long>
> > </doc>
> > <doc>
> > <str name="id">1</str>
> > <str name="title">Ice Cream</str>
> > <long name="_version_">1475063961203245056</long>
> > </doc>
> > </result>
> > </response>
> >
> > Query:
> >
> http://localhost:8983/solr/collection1/select?q=title:ice+cream&debug=true
> >
> > Response:
> >
> > <result name="response" numFound="2" start="0">
> > <doc>
> > <str name="id">1</str>
> > <str name="title">Ice Cream</str>
> > <long name="_version_">1475063961203245056</long>
> > </doc>
> > <doc>
> > <str name="id">3</str>
> > <str name="title">Ice-cream</str>
> > <long name="_version_">1475063961344802816</long>
> > </doc>
> > </result>
> > <lst name="debug">
> > <str name="rawquerystring">title:ice cream</str>
> > <str name="querystring">title:ice cream</str>
> > <str name="parsedquery">
> > (+(title:ice DisjunctionMaxQuery((title:cream))))/no_coord
> > </str>
> > <str name="parsedquery_toString">+(title:ice (title:cream))</str>
> > <lst name="explain">
> > <str name="1">
> > 0.875 = (MATCH) sum of: 0.4375 = (MATCH) weight(title:ice in 0)
> > [DefaultSimilarity], result of: 0.4375 = score(doc=0,freq=2.0 =
> > termFreq=2.0 ), product of: 0.70710677 = queryWeight, product of: 1.0 =
> > idf(docFreq=2, maxDocs=3) 0.70710677 = queryNorm 0.61871845 = fieldWeight
> > in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 =
> > termFreq=2.0 1.0 = idf(docFreq=2, maxDocs=3) 0.4375 = fieldNorm(doc=0)
> > 0.4375 = (MATCH) weight(title:cream in 0) [DefaultSimilarity], result of:
> > 0.4375 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.70710677 =
> > queryWeight, product of: 1.0 = idf(docFreq=2, maxDocs=3) 0.70710677 =
> > queryNorm 0.61871845 = fieldWeight in 0, product of: 1.4142135 =
> > tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 1.0 = idf(docFreq=2,
> > maxDocs=3) 0.4375 = fieldNorm(doc=0)
> > </str>
> > <str name="3">
> > 0.70710677 = (MATCH) sum of: 0.35355338 = (MATCH) weight(title:ice in 2)
> > [DefaultSimilarity], result of: 0.35355338 = score(doc=2,freq=1.0 =
> > termFreq=1.0 ), product of: 0.70710677 = queryWeight, product of: 1.0 =
> > idf(docFreq=2, maxDocs=3) 0.70710677 = queryNorm 0.5 = fieldWeight in 2,
> > product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 =
> > idf(docFreq=2, maxDocs=3) 0.5 = fieldNorm(doc=2) 0.35355338 = (MATCH)
> > weight(title:cream in 2) [DefaultSimilarity], result of: 0.35355338 =
> > score(doc=2,freq=1.0 = termFreq=1.0 ), product of: 0.70710677 =
> > queryWeight, product of: 1.0 = idf(docFreq=2, maxDocs=3) 0.70710677 =
> > queryNorm 0.5 = fieldWeight in 2, product of: 1.0 = tf(freq=1.0), with
> freq
> > of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=2, maxDocs=3) 0.5 =
> > fieldNorm(doc=2)
> > </str>
> > </lst>
> >
> > Still not working ????
> >
> >
> > On Fri, May 30, 2014 at 9:21 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> > > I'd spend some time with the admin/analysis page to understand the
> exact
> > > tokenization going on here. For instance, sequencing the
> > > shinglefilterfactory before worddelimiterfilterfactory may produce
> > > "interesting" resutls. And then throwing the Snowball factory at it and
> > > putting synonyms in front.... I suspect you're not indexing or
> searching
> > > what you think you are.
> > >
> > > Second, what happens when you query with &debug=query? That'll show you
> > > what the search string looks like.
> > >
> > > If that doesn't help, please post the results of looking at those
> things
> > > here, that'll provide some information for us to work with.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Fri, May 30, 2014 at 3:32 AM, sunshine glass <
> > > sunshineglassof2day@gmail.com> wrote:
> > >
> > > > Hi Folks,
> > > >
> > > > Any updates ??
> > > >
> > > >
> > > > On Wed, May 28, 2014 at 12:13 PM, sunshine glass <
> > > > sunshineglassof2day@gmail.com> wrote:
> > > >
> > > > > Dear Team,
> > > > >
> > > > > How can I handle compound word searches in solr ?.
> > > > > How can i search "hand bag" if I have "handbag" in my index. While
> > > using
> > > > > shingle in query analyzer, the query "ice cube" creates three
> tokens
> > as
> > > > > "ice","cube", "icecube". Only ice and cubes are searched but not
> > > > > "icecubes".i.e not working for pair though I am using shingle
> filter.
> > > > >
> > > > > Here's the schema config.
> > > > >
> > > > >
> > > > >    1.  <fieldType name="text" class="solr.TextField"
> > > > >    positionIncrementGap="100">
> > > > >    2.       <analyzer type="index">
> > > > >    3.         <filter class="solr.SynonymFilterFactory"
> > > > >    synonyms="synonyms_text_prime_index.txt" ignoreCase="true"
> > > > expand="true"/>
> > > > >    4.         <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > > >    5.         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > > >    6.          <filter class="solr.ShingleFilterFactory"
> > > > >    maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
> > > > >    7.          <filter class="solr.WordDelimiterFilterFactory"
> > > > >    catenateWords="1" catenateNumbers="1" catenateAll="1"
> > > > preserveOriginal="1"
> > > > >    generateWordParts="1" generateNumberParts="1"/>
> > > > >    8.         <filter class="solr.LowerCaseFilterFactory"/>
> > > > >    9.         <filter class="solr.SnowballPorterFilterFactory"
> > > > >    language="English" protected="protwords.txt"/>
> > > > >    10.       </analyzer>
> > > > >    11.       <analyzer type="query">
> > > > >    12.         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > > >    13.         <filter class="solr.SynonymFilterFactory"
> > > > >    synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> > > > >    14.         <filter class="solr.ShingleFilterFactory"
> > > > >    maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
> > > > >    15.         <filter class="solr.WordDelimiterFilterFactory"
> > > > >    preserveOriginal="1"/>
> > > > >    16.         <filter class="solr.LowerCaseFilterFactory"/>
> > > > >    17.         <filter class="solr.SnowballPorterFilterFactory"
> > > > >    language="English" protected="protwords.txt"/>
> > > > >    18.       </analyzer>
> > > > >    19.     </fieldType>
> > > > >
> > > > >    Any help is appreciated.
> > > > >
> > > > >
> > > >
> > >
> >
>


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic