[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-dev
Subject: Re: [jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer
From: Michael Sokolov <msokolov () gmail ! com>
Date: 2018-09-30 19:42:26
Message-ID: CAGUSZHDY_SOEBJ0afOTPN6uZ9Aua9humbCm=40F7927cNp99vg () mail ! gmail ! com
[Download RAW message or body]
My current usage of this filter requires it to be a filter, since I need to
precede it with other filters. I think the idea of not touching offsets
preserves more flexibility, and since the offsets are already unreliable,
we wouldn't be losing much.
On Sun, Sep 30, 2018, 11:32 AM Alan Woodward (JIRA) <jira@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633406#comment-16633406
> ]
>
> Alan Woodward commented on LUCENE-8516:
> ---------------------------------------
>
> Another solution would be for WordDelimiterGraphFilter to no longer amend
> offsets. So all token parts would be stored with the offsets of the
> original undelimited token.
>
> > Make WordDelimiterGraphFilter a Tokenizer
> > -----------------------------------------
> >
> > Key: LUCENE-8516
> > URL: https://issues.apache.org/jira/browse/LUCENE-8516
> > Project: Lucene - Core
> > Issue Type: Task
> > Reporter: Alan Woodward
> > Assignee: Alan Woodward
> > Priority: Major
> > Attachments: LUCENE-8516.patch
> >
> >
> > Being able to split tokens up at arbitrary points in a filter chain, in
> effect adding a second round of tokenization, can cause any number of
> problems when trying to keep tokenstreams to contract. The most common
> offender here is the WordDelimiterGraphFilter, which can produce broken
> offsets in a wide range of situations.
> > We should make WDGF a Tokenizer in its own right, which should preserve
> all the functionality we need, but make reasoning about the resulting
> tokenstream much simpler.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
[Attachment #3 (text/html)]
<div dir="auto">My current usage of this filter requires it to be a filter, since I \
need to precede it with other filters. I think the idea of not touching offsets \
preserves more flexibility, and since the offsets are already unreliable, we \
wouldn't be losing much.</div><br><div class="gmail_quote"><div dir="ltr">On Sun, \
Sep 30, 2018, 11:32 AM Alan Woodward (JIRA) <<a \
href="mailto:jira@apache.org">jira@apache.org</a>> wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><br> [ <a \
href="https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin \
.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633406#comment-16633406" \
rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.j \
ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633406#comment-16633406</a> \
] <br> <br>
Alan Woodward commented on LUCENE-8516:<br>
---------------------------------------<br>
<br>
Another solution would be for WordDelimiterGraphFilter to no longer amend offsets. \
So all token parts would be stored with the offsets of the original undelimited \
token.<br> <br>
> Make WordDelimiterGraphFilter a Tokenizer<br>
> -----------------------------------------<br>
><br>
> Key: LUCENE-8516<br>
> URL: <a \
href="https://issues.apache.org/jira/browse/LUCENE-8516" rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-8516</a><br> > \
Project: Lucene - Core<br> > Issue Type: Task<br>
> Reporter: Alan Woodward<br>
> Assignee: Alan Woodward<br>
> Priority: Major<br>
> Attachments: LUCENE-8516.patch<br>
><br>
><br>
> Being able to split tokens up at arbitrary points in a filter chain, in effect \
adding a second round of tokenization, can cause any number of problems when trying \
to keep tokenstreams to contract. The most common offender here is the \
WordDelimiterGraphFilter, which can produce broken offsets in a wide range of \
situations.<br> > We should make WDGF a Tokenizer in its own right, which should \
preserve all the functionality we need, but make reasoning about the resulting \
tokenstream much simpler.<br> <br>
<br>
<br>
--<br>
This message was sent by Atlassian JIRA<br>
(v7.6.3#76005)<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href="mailto:dev-unsubscribe@lucene.apache.org" \
target="_blank" rel="noreferrer">dev-unsubscribe@lucene.apache.org</a><br> For \
additional commands, e-mail: <a href="mailto:dev-help@lucene.apache.org" \
target="_blank" rel="noreferrer">dev-help@lucene.apache.org</a><br> <br>
</blockquote></div>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic