'Re: [jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: [jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer
From:       Michael Sokolov <msokolov () gmail ! com>
Date:       2018-09-30 19:42:26
Message-ID: CAGUSZHDY_SOEBJ0afOTPN6uZ9Aua9humbCm=40F7927cNp99vg () mail ! gmail ! com
[Download RAW message or body]

My current usage of this filter requires it to be a filter, since I need to
precede it with other filters. I think the idea of not touching offsets
preserves more flexibility, and since the offsets are already unreliable,
we wouldn't be losing much.

On Sun, Sep 30, 2018, 11:32 AM Alan Woodward (JIRA) <jira@apache.org> wrote:

> 
> [
> https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633406#comment-16633406
>  ]
> 
> Alan Woodward commented on LUCENE-8516:
> ---------------------------------------
> 
> Another solution would be for WordDelimiterGraphFilter to no longer amend
> offsets.  So all token parts would be stored with the offsets of the
> original undelimited token.
> 
> > Make WordDelimiterGraphFilter a Tokenizer
> > -----------------------------------------
> > 
> > Key: LUCENE-8516
> > URL: https://issues.apache.org/jira/browse/LUCENE-8516
> > Project: Lucene - Core
> > Issue Type: Task
> > Reporter: Alan Woodward
> > Assignee: Alan Woodward
> > Priority: Major
> > Attachments: LUCENE-8516.patch
> > 
> > 
> > Being able to split tokens up at arbitrary points in a filter chain, in
> effect adding a second round of tokenization, can cause any number of
> problems when trying to keep tokenstreams to contract.  The most common
> offender here is the WordDelimiterGraphFilter, which can produce broken
> offsets in a wide range of situations.
> > We should make WDGF a Tokenizer in its own right, which should preserve
> all the functionality we need, but make reasoning about the resulting
> tokenstream much simpler.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 


[Attachment #3 (text/html)]

<div dir="auto">My current usage of this filter requires it to be a filter, since I \
need to precede it with other filters. I think the idea of not touching offsets \
preserves more flexibility, and since the offsets are already unreliable, we \
wouldn&#39;t be losing much.</div><br><div class="gmail_quote"><div dir="ltr">On Sun, \
Sep 30, 2018, 11:32 AM Alan Woodward (JIRA) &lt;<a \
href="mailto:jira@apache.org">jira@apache.org</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><br>  [ <a \
href="https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin \
.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=16633406#comment-16633406" \
rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.j \
ira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=16633406#comment-16633406</a> \
] <br> <br>
Alan Woodward commented on LUCENE-8516:<br>
---------------------------------------<br>
<br>
Another solution would be for WordDelimiterGraphFilter to no longer amend offsets.   \
So all token parts would be stored with the offsets of the original undelimited \
token.<br> <br>
&gt; Make WordDelimiterGraphFilter a Tokenizer<br>
&gt; -----------------------------------------<br>
&gt;<br>
&gt;                          Key: LUCENE-8516<br>
&gt;                          URL: <a \
href="https://issues.apache.org/jira/browse/LUCENE-8516" rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-8516</a><br> &gt;        \
Project: Lucene - Core<br> &gt;               Issue Type: Task<br>
&gt;                  Reporter: Alan Woodward<br>
&gt;                  Assignee: Alan Woodward<br>
&gt;                  Priority: Major<br>
&gt;              Attachments: LUCENE-8516.patch<br>
&gt;<br>
&gt;<br>
&gt; Being able to split tokens up at arbitrary points in a filter chain, in effect \
adding a second round of tokenization, can cause any number of problems when trying \
to keep tokenstreams to contract.   The most common offender here is the \
WordDelimiterGraphFilter, which can produce broken offsets in a wide range of \
situations.<br> &gt; We should make WDGF a Tokenizer in its own right, which should \
preserve all the functionality we need, but make reasoning about the resulting \
tokenstream much simpler.<br> <br>
<br>
<br>
--<br>
This message was sent by Atlassian JIRA<br>
(v7.6.3#76005)<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href="mailto:dev-unsubscribe@lucene.apache.org" \
target="_blank" rel="noreferrer">dev-unsubscribe@lucene.apache.org</a><br> For \
additional commands, e-mail: <a href="mailto:dev-help@lucene.apache.org" \
target="_blank" rel="noreferrer">dev-help@lucene.apache.org</a><br> <br>
</blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic