[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-dev
Subject: Fwd: Standard or Modified UTF-8?
From: Marvin Humphrey <marvin () rectangular ! com>
Date: 2005-08-27 14:04:28
Message-ID: 0BFDC271-7560-4019-B1C6-2AA871AE4DA0 () rectangular ! com
[Download RAW message or body]
Greets,
It was suggested that I move this to the developers list from the
users list...
-- Marvin Humphrey
Begin forwarded message:
From: Marvin Humphrey <marvin@rectangular.com>
Date: August 26, 2005 4:51:27 PM PDT
To: java-user@lucene.apache.org
Subject: Standard or Modified UTF-8?
Reply-To: java-user@lucene.apache.org
Greets,
As part of my attempt to speed up Plucene and establish index
compatibility between Plucene and Java Lucene, I'm porting
InputStream and OutputStream to XS (the C API for accessing Perl's
guts), and I believe I have found a documentation bug in the file-
format spec at...
http://lucene.apache.org/java/docs/fileformats.html
"Lucene writes unicode character sequences using the standard UTF-8
encoding."
Snooping the code in OutputStream, it looks like you are writing
modified UTF-8 -- NOT standard -- because a null byte is written
using the two-byte form.
else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
writeByte((byte)(0xC0 | (code >> 6)));
writeByte((byte)(0x80 | (code & 0x3F)));
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
Can someone please confirm that the intention is to write modified
UTF-8?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic