'Re: Problems to get suggestions from an intermediate word using AnalyzingSuggester'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Re: Problems to get suggestions from an intermediate word using AnalyzingSuggester
From:       Michael McCandless <lucene () mikemccandless ! com>
Date:       2013-03-26 11:43:18
Message-ID: CAL8PwkbfPgzYfr1bSYwJuA8nCaq2OPakKsY+JLLxbF_bvSPBYA () mail ! gmail ! com
[Download RAW message or body]

AnalyzingSuggester only matches by prefix, by design.

You can try AnalyzingInfixSuggester, which is currently two
alternative patches on
https://issues.apache.org/jira/browse/LUCENE-4845

And please post back any feedback you have on the issue ... as the
issue stands I don't think either approach will be committed any time
soon.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Mar 26, 2013 at 3:45 AM, Andres Garcia <hgarcia@fi.upm.es> wrote:
> Hi all,
>
>
> My use case is very simple, given a string I would like to suggest all th=
e
> possible urls that contain that string (given the limitations of the
> tokenizer and suggester). So far I have created a custom analyzer and
> tokenizer to parse urls, and that analyzer is used to create an
> AnalyzingSuggester object. When I look for a suggestion using a prefix of=
 a
> url it works fine. However when I use an in between word I don=92t get an=
y
> suggestion.
>
>
> Let=92s see my test case. I have a unique suggestion entry =93www.google.=
com=94
> in my TermFreq array.  If I search a suggestion for =93www=94 it returns =
the
> url. If I search a suggestion for =93google=94 the result is empty.
>
>
> My tokenizer splits the suggestion entry into the following tuples
> (token,offset): (www,0:3),(google,4:10),(com,11:14). Please note that I=
=92m
> getting rid of the dots
>
>
> The automaton created for this entry is:
>
> state 0 [reject]: w -> 1 state 1 [reject]: w -> 2 state 2 [reject]: w -> =
3
> state 3 [reject]: \\U00000100 -> 4 state 4 [reject]: g -> 5 state 5
> [reject]: o -> 6 state 6 [reject]: o -> 7 state 7 [reject]:  g -> 8 state=
 8
> [reject]: l -> 9 state 9 [reject]: e -> 10 state 10 [reject]: \\U00000100
> -> 11 state 11 [reject]: c -> 12 state 12 [reject]: o -> 13 state 13
> [reject]: m -> 14 state 14 [accept]:
>
>
> When I print the fst I get this: =93wwwgooglecom=94
>
>
> The automaton created for =93google=94
>
> Initial state: 0 state 0 [reject]: g -> 1 state 1 [reject]: o -> 2 state =
2
> [reject]: o -> 3 state 3 [reject]: g -> 4 state 4 [reject]: l -> 5 state =
5
> [reject]: e -> 6 state 6 [accept]:
>
>
> I think I have a problem with my tokenizer (I=92m not an expert) and this=
 is
> affecting the creation of the first automaton. I really don=92t know how =
to
> get this fixed, any advice?
>
>
> best regards!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic