From kde-i18n-doc Sun Sep 14 12:29:41 2014 From: Denis Steckelmacher Date: Sun, 14 Sep 2014 12:29:41 +0000 To: kde-i18n-doc Subject: Re: Fwd: Re: baloo_queryparser... again Message-Id: <54158A35.5030504 () yahoo ! fr> X-MARC-Message: https://marc.info/?l=kde-i18n-doc&m=141069905611003 On 09/14/2014 11:36 AM, Franklin Weng wrote: > The idea of the space separation seems to be suitable for > English/Latin/Western languages (please correct me if I'm wrong), but > for Chinese, the problem would be, in Chinese a character = a word, and > a Chinese sentence may mixed with numbers even English words. For example, > > English: "I have $1 star|stars." > Chinese can be: "我有 $1 顆星|我有$1顆星" > > English: "rated as $1; rated $1; score|scored $1" > Chinese can be: "得分為 $1;得 $1 分;得$1分" > > Here $1 in Chinese may contain one-digit number, two-digit number, or > even more, not necessary. Does spaces here matter? > > English: written|created|composed by $1; author is $1; by $1" > Chinese can be: "由 $1 撰寫;作者為 $1;由 $1 作曲" > > Here $1 can be two, three or four Chinese words(characters), or English > names. Does spaces here matter? > > English: "$1 is $2;$1 $2" > Chinese: "$1 是 $2;$1 $2;$1 為 $2" > > If the answer to space separation is "N", what would it do when it meet > "$1 $2" or "$1 是 $2" where $1 and $2 can be anything (Chinese > characters, English words, numbers, etc)? > > The problems of Chinese would also apply to Japanese and Korean > languages, since we all used "square words". > Humm, this is very interesting. I did not think about numbers, and currently, if you ask for a split at each letter, English words and numbers are also split after each character. I have to fix that. By the way, I have a question: if you have a mixture of Chinese characters and English words in a sentence, are the English words or numbers separated from the rest using spaces (for instance, "我有 4 顆 星") or not ("我有4顆星")? Both should be easy to fix (provided that QChar has a method that tells me if the character is a "western letter" one or a square character). When this will be fixed, consider that the parser adds spaces around each Chinese character. Your pattern therefore need to be "我 有 $1 顆 星" (note the space between each Chinese character). This pattern will match two Chinese words (that the user will type without spaces), then a number (that can consist of several digits), then two other Chinese words. Denis