'Re: Test code for regex queries'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Test code for regex queries
From:       Paul Elschot <paul.elschot () xs4all ! nl>
Date:       2005-11-26 16:11:08
Message-ID: 200511261711.08143.paul.elschot () xs4all ! nl
[Download RAW message or body]

On Friday 25 November 2005 11:14, Erik Hatcher wrote:
> 
> On 24 Nov 2005, at 20:26, Erik Hatcher wrote:
> >> There are some older regex implementations in java, but I
> >> have no idea about the licences and the availabiility.
> >> Doesn't apache have one somewhere?
> >
> > Two actually!  ORO and Regexp.  Here's ORO - <http:// 
> > jakarta.apache.org/oro/> (link to Regexp from there)
> >
> > I'll dig into those soon and see what useful goodies lurk within.
> 
>  From perusing the API via Javadocs, Regexp mentioned just what we  
> need, but I didn't see the same sort of thing with ORO.  So I pulled  
> down Jakarta Regexp and dropped it in.  I had to add a getter for a  
> package protected internal "prefix" to REProgram, but once I did  
> that, here are some passing tests...
> 
>      assertEquals(1, getPrefix("a[bc]*"));
>      assertEquals(2, getPrefix("a\\$[bc]*"));
>      assertEquals(0, getPrefix("r?over"));
> 
> 
>    private int getPrefix(String expression) {
>      REProgram program = new RECompiler().compile(expression);
>      char[] prefix = program.getPrefix();
>      return prefix == null ? 0 : prefix.length;
>    }
> 
> Quite promising!  The REProgram has the full parse tree as  
> "instructions", so it'd be possible to use this for clever rotation  
> also, I believe.  I'm sure Regexp doesn't support the full Perl5  
> syntax that Java's regex package does, but it seems to be good enough  
> for the basic regex syntax.
> 
> A couple of issues... 1) to use this additional library, (Span) 
> RegexQuery should be pulled into contrib/regex, 2) It'd be a little  
> awkward to use Jakarta Regexp to determine the prefix and potentially  
> be used for rotation logic, and then use JDK regex for the actual  
> matching.  I have no data to say which has faster matching, or  
> another pros/cons, just that it could potentially mismatch.  I'm  
> inclined to swap completely to Jakarta Regexp for matching as well,  
> at least for the time being in order to keep things in sync and  
> benefit from more clever term enumeration.  The time saved in term  
> enumeration seems likely to more than make up for matching speed  
> differences.
> 
> Thoughts?

I think I'll add a prefix regex facility to the surround language first.
That is, I'll try and make a token definition for a prefixed regex in the 
surround parser.
The operator I mentioned before is too verbose. Perhaps
double quotes can be used to enclose a regex, possibly prefixed
by literal characters. The prefix may also need quotes.

Double quotes would work to enclose a regex when they
have no special meaning in a regex, which I think is the case.
Using another regex package for the actual parsing and matching
might then be possible by overriding some methods in the parser.
I hope the regex compiler and matcher from java.util.regex have
interfaces, otherwise interfaces will have to made to allow
different regex implementations.

To have term rotation built into a query parser requires some
way to know which field is being queried, eg. by an overridable
getFieldQuery(fieldName, queryTerm) method in the parser,
and let this method rotate the queryTerm when rotation is needed
for the field.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic