[prev in list] [next in list] [prev in thread] [next in thread]
List: kinosearch
Subject: [KinoSearch] Highlighter and UTF-8 in 0.14
From: marvin () rectangular ! com (Marvin Humphrey)
Date: 2006-11-16 15:32:04
Message-ID: F43BC091-C4C7-4FE2-B711-3DB92AF69D69 () rectangular ! com
[Download RAW message or body]
On Nov 16, 2006, at 2:15 PM, Eric LIAO Kehuang wrote:
> Thanks Marvin. If I may voice some opinion on this: Having read
> most of the discussions on the forum regarding KS support for
> Unicode, I strongly agree that the all-Unicode approach is the way
> to go. ISO-8859-1 users can easily prepare for indexing and
> convert data back to ISO-8859-1 using Encode for displaying results
> on their web sites. That takes only a little effort on top of the
> out-of-the-box KS functionality. I feel the advantage of all-
> Unicode much outweighs their effort to add the transcoding layer at
> both ends. In today's world we have multilingual text to search,
> in different non-Unicode encodings. It's no longer just "Western
> European" languages any more. Unicode is one good way to tie it
> all together.
I agree. I'm not looking forward to the support burden that's going
to be a side-effect of forcing noobs and everyone else into Unicode-
land, though. :(
> Can you confirm that aside from Highlighter, indexing/searching of
> utf-8 data works correctly in 0.14? Or, is the tokenization also
> broken?
Tokenization of UTF-8 is broken. And that's precisely the problem.
If I fix it, by forcing everything into UTF-8 up-front, then analysis
chains won't produce the same output as they do now. That's bad,
because if you have an existing index which was prepared using one
analyzer behavior, and you search it using another analyzer
behavior, you'll get incorrect results.
> (For example, would 0.14 break a French word like "imm?diat" into
> "imm" and "diat"?)
Assuming imm?diat is encoded using UTF-8, then the exact misbehavior
would depend on your locale settings, but it's almost certain to be
wrong for any Unicode code point above 127. Right now, the SvUTF8
flag gets stripped by TokenBatch, so the analysis chain always thinks
it's getting 8-bit data. "use locale" is in effect within Tokenizer,
so just how the tokenizing regex misbehaves varies.
> I really look forward to 0.20 :)
At some point I'll start making devel releases of 0.20 available.
People who use it must be prepared to reindex with every upgrade,
since the file format will be in flux.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic