[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-i18n-doc
Subject:    Re: Fwd: Re: baloo_queryparser... again
From:       Denis Steckelmacher <steckdenis () yahoo ! fr>
Date:       2014-09-14 12:29:41
Message-ID: 54158A35.5030504 () yahoo ! fr
[Download RAW message or body]

On 09/14/2014 11:36 AM, Franklin Weng wrote:
> The idea of the space separation seems to be suitable for
> English/Latin/Western languages (please correct me if I'm wrong), but
> for Chinese, the problem would be, in Chinese a character = a word, and
> a Chinese sentence may mixed with numbers even English words.  For example,
>
> English: "I have $1 star|stars."
> Chinese can be: "我有 $1 顆星|我有$1顆星"
>
> English: "rated as $1; rated $1; score|scored $1"
> Chinese can be: "得分為 $1;得 $1 分;得$1分"
>
> Here $1 in Chinese may contain one-digit number, two-digit number, or
> even more, not necessary. Does spaces here matter?
>
> English: written|created|composed by $1; author is $1; by $1"
> Chinese can be: "由 $1 撰寫;作者為 $1;由 $1 作曲"
>
> Here $1 can be two, three or four Chinese words(characters), or English
> names.  Does spaces here matter?
>
> English: "$1 is $2;$1 $2"
> Chinese: "$1 是 $2;$1 $2;$1 為 $2"
>
> If the answer to space separation is "N", what would it do when it meet
> "$1 $2" or "$1 是 $2" where $1 and $2 can be anything (Chinese
> characters, English words, numbers, etc)?
>
> The problems of Chinese would also apply to Japanese and Korean
> languages, since we all used "square words".
>

Humm, this is very interesting. I did not think about numbers, and 
currently, if you ask for a split at each letter, English words and 
numbers are also split after each character. I have to fix that.

By the way, I have a question: if you have a mixture of Chinese 
characters and English words in a sentence, are the English words or 
numbers separated from the rest using spaces (for instance, "我有 4 顆 
星") or not ("我有4顆星")?

Both should be easy to fix (provided that QChar has a method that tells 
me if the character is a "western letter" one or a square character). 
When this will be fixed, consider that the parser adds spaces around 
each Chinese character. Your pattern therefore need to be "我 有 $1 顆 
星" (note the space between each Chinese character). This pattern will 
match two Chinese words (that the user will type without spaces), then a 
number (that can consist of several digits), then two other Chinese words.

Denis

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic