'[jira] Updated: (LUCENE-2167) Implement StandardTokenizer with the'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-dev
Subject:    [jira] Updated: (LUCENE-2167) Implement StandardTokenizer with the
From:       "Robert Muir (JIRA)" <jira () apache ! org>
Date:       2010-06-30 14:21:52
Message-ID: 26290295.133851277907712103.JavaMail.jira () thor
[Download RAW message or body]


     [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel \
]

Robert Muir updated LUCENE-2167:
--------------------------------

    Attachment: LUCENE-2167.patch

ok here is a patch file. before applying it, you have to run these commands:

{noformat}
# original grammar -> ClassicTokenizerImpl
svn move modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.java \
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.java
 svn move modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.jflex \
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.jflex
 # this one is not needed, this patch becomes the new grammar
svn delete modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.java
 svn delete modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.jflex
 # expose the old tokenizer, not just via Version, but also as \
ClassicAnalyzer/Tokenizer/Filter svn copy \
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java \
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicAnalyzer.java
 svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java \
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizer.java
 svn copy modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java \
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicFilter.java
 svn copy modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestStandardAnalyzer.java \
modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestClassicAnalyzer.java
 # temporarily edit solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java \
(change the $Id hossman.... to just $Id$) # apply the patch.
{noformat}

if you want to iterate on the patch, make your changes and generate a patch with 'svn \
--no-diff-deleted'.

some notes:
* patch is against 4.0, but i think we can do this in 3.1. all the back compat is \
                preserved, etc. we just gotta figure a few things out. all the tests \
                pass though.
* The patch is large mainly because of the DFA size. I have some concerns about \
this... the email/url stuff seems to be the culprit, as the UAX#29 generated class is \
                only 12KB, about the same size as our existing standardtokenizer.
* I gave backwards compat (you get the old behavior) with Version, but also setup \
ClassicAnalyzer/Tokenizer/Filter for those that want the...not so \
                international-friendly old version, for its company Identification, \
                etc.
* I modified token types for icu to be more consistent with this.
* StandardFilter is currently a no-op for the new grammar. In my opinion this is a \
place to implement the 'more sophisticated' logic that the standard mentions for \
certain scripts. We can use token types (IDEOGRAPHIC, SOUTHEAST_ASIAN) to drive this. \
This way the standardanalyzer is a reasonable tokenizer for most languages.

So, not completely sure this is the best approach, but it is one... the patch is \
still rough around the edges but at least now we can iterate more easily on it.


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
> 
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/analyzers
> Affects Versions: 3.1
> Reporter: Shyamal Prasad
> Assignee: Robert Muir
> Priority: Minor
> Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, \
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, \
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, \
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, \
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, \
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, \
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip 
> Original Estimate: 0.5h
> Remaining Estimate: 0.5h
> 
> It would be really nice for StandardTokenizer to adhere straight to the standard as \
> much as we can with jflex. Then its name would actually make sense. Such a \
> transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, \
> as its javadoc claims: bq. This should be a good tokenizer for most \
> European-language documents The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can \
> stay with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic