'[KinoSearch] Highlighter and UTF-8 in 0.14'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kinosearch
Subject:    [KinoSearch] Highlighter and UTF-8 in 0.14
From:       marvin () rectangular ! com (Marvin Humphrey)
Date:       2006-11-16 15:32:04
Message-ID: F43BC091-C4C7-4FE2-B711-3DB92AF69D69 () rectangular ! com
[Download RAW message or body]


On Nov 16, 2006, at 2:15 PM, Eric LIAO Kehuang wrote:

> Thanks Marvin.  If I may voice some opinion on this: Having read  
> most of the discussions on the forum regarding KS support for  
> Unicode, I strongly agree that the all-Unicode approach is the way  
> to go.  ISO-8859-1 users can easily prepare for indexing and  
> convert data back to ISO-8859-1 using Encode for displaying results  
> on their web sites.  That takes only a little effort on top of the  
> out-of-the-box KS functionality.  I feel the advantage of all- 
> Unicode much outweighs their effort to add the transcoding layer at  
> both ends.  In today's world we have multilingual text to search,  
> in different non-Unicode encodings.  It's no longer just "Western  
> European" languages any more.  Unicode is one good way to tie it  
> all together.

I agree.  I'm not looking forward to the support burden that's going  
to be a side-effect of forcing noobs and everyone else into Unicode- 
land, though.  :(

> Can you confirm that aside from Highlighter, indexing/searching of  
> utf-8 data works correctly in 0.14?  Or, is the tokenization also  
> broken?

Tokenization of UTF-8 is broken.  And that's precisely the problem.   
If I fix it, by forcing everything into UTF-8 up-front, then analysis  
chains won't produce the same output as they do now.  That's bad,  
because if you have an existing index which was prepared using one  
analyzer behavior, and  you search it using another analyzer  
behavior, you'll get incorrect results.

> (For example, would 0.14 break a French word like "imm?diat" into  
> "imm" and "diat"?)

Assuming imm?diat is encoded using UTF-8, then the exact misbehavior  
would depend on your locale settings, but it's almost certain to be  
wrong for any Unicode code point above 127.  Right now, the SvUTF8  
flag gets stripped by TokenBatch, so the analysis chain always thinks  
it's getting 8-bit data.  "use locale" is in effect within Tokenizer,  
so just how the tokenizing regex misbehaves varies.

> I really look forward to 0.20 :)

At some point I'll start making devel releases of 0.20 available.   
People who use it must be prepared to reindex with every upgrade,  
since the file format will be in flux.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic