[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-user
Subject: Re: Problems to get suggestions from an intermediate word using AnalyzingSuggester
From: Michael McCandless <lucene () mikemccandless ! com>
Date: 2013-03-26 11:43:18
Message-ID: CAL8PwkbfPgzYfr1bSYwJuA8nCaq2OPakKsY+JLLxbF_bvSPBYA () mail ! gmail ! com
[Download RAW message or body]
AnalyzingSuggester only matches by prefix, by design.
You can try AnalyzingInfixSuggester, which is currently two
alternative patches on
https://issues.apache.org/jira/browse/LUCENE-4845
And please post back any feedback you have on the issue ... as the
issue stands I don't think either approach will be committed any time
soon.
Mike McCandless
http://blog.mikemccandless.com
On Tue, Mar 26, 2013 at 3:45 AM, Andres Garcia <hgarcia@fi.upm.es> wrote:
> Hi all,
>
>
> My use case is very simple, given a string I would like to suggest all th=
e
> possible urls that contain that string (given the limitations of the
> tokenizer and suggester). So far I have created a custom analyzer and
> tokenizer to parse urls, and that analyzer is used to create an
> AnalyzingSuggester object. When I look for a suggestion using a prefix of=
a
> url it works fine. However when I use an in between word I don=92t get an=
y
> suggestion.
>
>
> Let=92s see my test case. I have a unique suggestion entry =93www.google.=
com=94
> in my TermFreq array. If I search a suggestion for =93www=94 it returns =
the
> url. If I search a suggestion for =93google=94 the result is empty.
>
>
> My tokenizer splits the suggestion entry into the following tuples
> (token,offset): (www,0:3),(google,4:10),(com,11:14). Please note that I=
=92m
> getting rid of the dots
>
>
> The automaton created for this entry is:
>
> state 0 [reject]: w -> 1 state 1 [reject]: w -> 2 state 2 [reject]: w -> =
3
> state 3 [reject]: \\U00000100 -> 4 state 4 [reject]: g -> 5 state 5
> [reject]: o -> 6 state 6 [reject]: o -> 7 state 7 [reject]: g -> 8 state=
8
> [reject]: l -> 9 state 9 [reject]: e -> 10 state 10 [reject]: \\U00000100
> -> 11 state 11 [reject]: c -> 12 state 12 [reject]: o -> 13 state 13
> [reject]: m -> 14 state 14 [accept]:
>
>
> When I print the fst I get this: =93wwwgooglecom=94
>
>
> The automaton created for =93google=94
>
> Initial state: 0 state 0 [reject]: g -> 1 state 1 [reject]: o -> 2 state =
2
> [reject]: o -> 3 state 3 [reject]: g -> 4 state 4 [reject]: l -> 5 state =
5
> [reject]: e -> 6 state 6 [accept]:
>
>
> I think I have a problem with my tokenizer (I=92m not an expert) and this=
is
> affecting the creation of the first automaton. I really don=92t know how =
to
> get this fixed, any advice?
>
>
> best regards!
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic