[prev in list] [next in list] [prev in thread] [next in thread]
List: icu-bugrfe
Subject: [icu-bug] incoming/2692
From: jtcsv () jtcsv ! com
Date: 2003-02-13 0:08:23
[Download RAW message or body]
schererm moved PR#2692 from incoming to collation URL: http://www.jtcsv.com/cgi-bin/icu-bugs?findid=2692
====> ORIGINAL MESSAGE FOLLOWS <====
From: kentk@cs.chalmers.se
Date: Thu Feb 6 09:07:49 2003
Subject: collation rules for hu (Hungarian) [RESUBMIT]
Full_Name: Kent Karlsson
Version:
OS: all
PROJECT: ICU4C,ICU4J and ICU4JNI
JAVA:
Submission from: (NULL) (129.16.214.213)
Resubmitted, due to some error corrections
---------------------------------------------------
//// Hungarian (hu, hu_*; hun); magyar, ungerska
//// aá ä? b c [cs] d (([dz]? [dzs]?)) eé f g [gy] h ií j k l [ly] m n [ny]
//// oó ö[o-dbl-acute][õ] p q r s [sz] t ty uú ü[u-dbl-acute][û] v w x y z
[zs]
//// ccs as cscs? ggy as gygy? lly as lyly? nny as nyny? ssz as szsz?
//// tty as tyty? zzs as zszs? According to Hungarian dictionaries: YES!
//// EXPERIMENTAL, not sure ICU can handle this properly; but it should handle
it...
//// That is, ssz (e.g.) should be ordered sz followed by sz, and sz should be
//// ordered as a separate letter (after s); similarly for the other Hungarian
digraphs.
///
//(here an intentional "gy" is denoted as "G"):
// there are words that contain
// GG (written ggy resp. gy-gy) [meggyes]
// (collation ok with the rules below; if they work at all...)
// there are *composite* words that contain
// GG (written gygy resp. gy-gy) [jegygyürü, egygyermekes]
// (not a problem case, but at SHY does not hurt)
// there are *composite* words that contain
// gG (written ggy resp. g-gy) [régiséggyűjtő, üveggyár]
// (a soft hyphen between the g and the gy [i.e. at the subword boundary]
// would stop the ggy as gygy (GG) rule below)
// there are composite words that contain
// Gg (written gyg resp. gy-g) [hegygerinc]
// (not a problem case, but a SHY does not hurt)
// [similarly for the other digraphs; word are collated (and hyphenated)
// according to the intentional spelling, not the actual spelling]
//// dz, DZ, dzs, DZS? MAKES NO noticable DIFFERENCE!
//// Ä? ae, oe, (ue)? [casefirst ?]? [backwards 2]?
//// g-comma? l-comma? n-comma, n-acute, n-tilde? t-comma? etc.?
"[normalization on]"
" & AE" // order ae-ligature as a variant of AE
" << \u00E6" // LATIN SMALL LETTER AE
" <<< \u00C6" // LATIN CAPITAL LETTER AE
" & C" //
" < cs" // ccs (not at a word boundary within a composite word) is ordered as
cscs
" <<< \uFF43\uFF53"
" <<< Cs" //
" <<< \uFF23\uFF53"
" <<< CS" //
" <<< \uFF23\uFF33"
" << c\u030C" //
" <<< \uFF43\u030C" // FULLWIDTH LATIN SMALL LETTER C with COMBINING CARON
" <<< C\u030C" //
" <<< \uFF23\u030C" // FULLWIDTH LATIN CAPITAL LETTER C with COMBINING CARON
" << c\u0301" //
" <<< \uFF43\u0301" // FULLWIDTH LATIN SMALL LETTER C with COMBINING ACUTE
ACCENT
" <<< C\u0301" //
" <<< \uFF23\u0301" // FULLWIDTH LATIN CAPITAL LETTER C with COMBINING ACUTE
ACCENT
" & CSCS" ////// reordered below...; AFTER anchoring; cscs<<<ccs<<<CSCS<<<CCS?
or sim.?
" <<< ccs" //
" <<< \uFF43\uFF43\uFF53"
" <<< cscs" //
" <<< \uFF43\uFF53\uFF43\uFF53"
" <<< CCS" //
" <<< \uFF23\uFF23\uFF33"
" <<< CSCS" //
" <<< \uFF23\uFF33\uFF23\uFF33"
" & G" //
" < gy" // ggy (not at a word boundary within a composite word) is ordered as
gygy
" <<< \uFF47\uFF59"
" <<< Gy" //
" <<< \uFF27\uFF59"
" <<< GY" //
" <<< \uFF27\uFF39"
" & GYGY" ////// reordered below...; AFTER anchoring
" <<< ggy" //
" <<< \uFF47\uFF47\uFF59"
" <<< gygy" //
" <<< \uFF47\uFF59\uFF47\uFF59"
" <<< GGY" //
" <<< \uFF27\uFF27\uFF39"
" <<< GYGY" //
" <<< \uFF27\uFF39\uFF27\uFF39"
" & L" //
" < ly" // lly (not at a word boundary within a composite word) is ordered as
lyly
" <<< \uFF4C\uFF59"
" <<< Ly" //
" <<< \uFF2C\uFF59"
" <<< LY" //
" <<< \uFF2C\uFF39"
" & LYLY" ////// reordered below...; AFTER anchoring
" <<< lly" //
" <<< \uFF4C\uFF4C\uFF59"
" <<< lyly" //
" <<< \uFF4C\uFF59\uFF4C\uFF59"
" <<< LLY" //
" <<< \uFF2C\uFF2C\uFF39"
" <<< LYLY" //
" <<< \uFF2C\uFF39\uFF2C\uFF39"
//// "'t" is short for "het" (the) in Duch (and Afrikaans?) is ordered as "t",
and
//// "'n" is short for "ein" (one, you) in Afrikaans is ordred as "n"
"& \u2019n" // note that \u2019 should be ignored at levels 1-3
" = \u0149" // LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
" & N" //
" < ny" // nny (not at a word boundary within a composite word) is ordered as
nyny
" <<< \uFF4E\uFF59"
" <<< Ny" //
" <<< \uFF2E\uFF59"
" <<< NY" //
" <<< \uFF2E\uFF39"
" & NYNY" ////// reordered below...; AFTER anchoring
" <<< nny" //
" <<< \uFF4E\uFF4E\uFF59"
" <<< nyny" //
" <<< \uFF4E\uFF59\uFF4E\uFF59"
" <<< NNY" //
" <<< \uFF2E\uFF2E\uFF39"
" <<< NYNY" //
" <<< \uFF2E\uFF39\uFF2E\uFF39"
" & OE" // order oe-ligature as a variant of OE?
" << \u0153" // LATIN SMALL LIGATURE OE
" <<< \u0152" // LATIN CAPITAL LIGATURE OE
// Ö and long-Ö [and Õ, Ô which are sometimes used as approximations to
long-ö]
" & O" // after O and Ó
" < o\u0308" // LATIN SMALL LETTER O with COMBINING DIAERESIS
" <<< \uFF4F\u0308" // FULLWIDTH LATIN SMALL LETTER O with COMBINING
DIAERESIS
" <<< O\u0308" // LATIN CAPITAL LETTER O with COMBINING DIAERESIS
" <<< \uFF2F\u0308" // FULLWIDTH LATIN CAPITAL LETTER O with COMBINING
DIAERESIS
" << o\u030B" // LATIN SMALL LETTER O with COMBINING DOUBLE ACUTE ACCENT
" <<< \uFF4F\u030B" // FULLWIDTH LATIN SMALL LETTER O with COMBINING DOUBLE
ACUTE ACCENT
" <<< O\u030B" // LATIN CAPITAL LETTER O with COMBINING DOUBLE ACUTE ACCENT
" <<< \uFF2F\u030B" // FULLWIDTH LATIN CAPITAL LETTER O with COMBINING DOUBLE
ACUTE ACCENT
" << o\u0303" // LATIN SMALL LETTER O with COMBINING TILDE
" <<< \uFF4F\u0303" // FULLWIDTH LATIN SMALL LETTER O with COMBINING TILDE
" <<< O\u0303" // LATIN CAPITAL LETTER O with COMBINING TILDE
" <<< \uFF2F\u0303" // FULLWIDTH LATIN CAPITAL LETTER O with COMBINING TILDE
" << o\u0302" // LATIN SMALL LETTER O with COMBINING CIRCUMFLEX ACCENT
" <<< \uFF4F\u0302" // FULLWIDTH LATIN SMALL LETTER O with COMBINING
CIRCUMFLEX ACCENT
" <<< O\u0302" // LATIN CAPITAL LETTER O with COMBINING CIRCUMFLEX ACCENT
" <<< \uFF2F\u0302" // FULLWIDTH LATIN CAPITAL LETTER O with COMBINING
CIRCUMFLEX ACCENT
" & S" //
" < sz" // ssz (not at a word boundary within a composite word) is ordered as
szsz
" <<< \uFF53\uFF5A"
" <<< Sz" //
" <<< \uFF33\uFF5A"
" <<< SZ" //
" <<< \uFF33\uFF3A"
" << s\u030C" //
" <<< \uFF53\u030C" // FULLWIDTH LATIN SMALL LETTER S with COMBINING CARON
" <<< S\u030C" //
" <<< \uFF33\u030C" // FULLWIDTH LATIN CAPITAL LETTER S with COMBINING CARON
" << s\u0301" //
" <<< \uFF53\u0301" // FULLWIDTH LATIN SMALL LETTER S with COMBINING ACUTE
ACCENT
" <<< S\u0301" //
" <<< \uFF33\u0301" // FULLWIDTH LATIN CAPITAL LETTER S with COMBINING ACUTE
ACCENT
" & SZSZ" ////// reordered below...; AFTER anchoring
" <<< ssz" //
" <<< \uFF53\uFF53\uFF5A"
" <<< szsz" //
" <<< \uFF53\uFF5A\uFF53\uFF5A"
" <<< SSZ" //
" <<< \uFF33\uFF33\uFF3A"
" <<< SZSZ" //
" <<< \uFF33\uFF3A\uFF33\uFF3A"
" & T" //
" < ty" // tty (not at a word boundary within a composite word) is ordered as
tyty
" <<< \uFF54\uFF59"
" <<< Ty" //
" <<< \uFF34\uFF59"
" <<< TY" //
" <<< \uFF34\uFF39"
" & TYTY" ////// reordered below...; AFTER anchoring
" <<< tty" //
" <<< \uFF54\uFF54\uFF59"
" <<< tyty" //
" <<< \uFF54\uFF59\uFF54\uFF59"
" <<< TTY" //
" <<< \uFF34\uFF34\uFF39"
" <<< TYTY" //
" <<< \uFF34\uFF39\uFF34\uFF39"
// Ü and long-Ü [and U-tilde and Û which are sometimes used as
approximations to long-ü]
" & U" // after U and Ú
" < u\u0308" // LATIN SMALL LETTER U with COMBINING DIAERESIS
" <<< \uFF55\u0308" // FULLWIDTH LATIN SMALL LETTER U with COMBINING
DIAERESIS
" <<< U\u0308" // LATIN CAPITAL LETTER U with COMBINING DIAERESIS
" <<< \uFF35\u0308" // FULLWIDTH LATIN CAPITAL LETTER U with COMBINING
DIAERESIS
" << u\u030B" // LATIN SMALL LE
====> MESSAGE TRUNCATED AT 8192 <====
_______________________________________________
icu-bugrfe mailing list
icu-bugrfe@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-bugrfe
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic