[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Lucene does NOT use UTF-8.
From:       Doug Cutting <cutting () apache ! org>
Date:       2005-08-31 17:04:35
Message-ID: 4315E323.90702 () apache ! org
[Download RAW message or body]

Wolfgang Hoschek wrote:
> I don't know if it matters for Lucene usage. But if using  
> CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
> significant problem, it's probably due to startup/init time of these  
> methods for individually converting many small strings, not  inherently 
> due to UTF-8 usage. I'm confident that a custom UTF-8  implementation 
> can almost completely eliminate these issues. I've  done this before for 
> binary XML with great success, and it could  certainly be done for 
> lucene just as well. Bottom line: It's probably  an issue that can be 
> dealt with via proper impl; it probably  shouldn't dictate design 
> directions.

Good point.  Currently Lucene already has its own (buggy) UTF-8 
implementation for performance, so that wouldn't really be a big change.

The big question now seems to be whether the stored character sequence 
lengths should be in bytes or characters.  Bytes might be fast and 
simple (whether we implement our own UTF-8 in Java or not) but are not 
back-compatible.  So do we bite the bullet and make a very incompatible 
change to index formats?  Or do we make these counts be unicode 
characters (which is mostly back-compatible) and make the code a bit 
more awkward?  Some implementations would be nice to see just how 
awkward things get.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic