[prev in list] [next in list] [prev in thread] [next in thread] 

List:       icu-bugrfe
Subject:    Notification: incoming/1706
From:       jtcsv () jtcsv ! com
Date:       2002-02-24 22:52:48
[Download RAW message or body]

ICU bug tracking authorized access notification

yves moved PR#1706 from incoming to returned
Message summary for PR#1706
	From: yves@realnames.com
	Subject: u_charType() returns incorrect results
	Date: unknown
	0 replies 	0 followups
	Notes: fixed by j1709


====> ORIGINAL MESSAGE FOLLOWS <====

From: yves@realnames.com
To: jtcsv@jtcsv.com
Subject: u_charType() returns incorrect results

Full_Name: Yves Arrouye
Version: cvs
OS: all
ICU_Component: data
project: ICU4C
Submission from: (NULL) (63.251.238.8)
Submitted by: yves


Following my report on Unicode 1.0 names not appearing... I am writing a simple
program to dump char categories that have no 2.0 Unicode name, so I know which
ones to look for when I'll change u_charName. I wrote:

anselm-linux% cat no20name.c

#include <stdio.h>
#include <stdlib.h>
#include <unicode/uchar.h>

static const char *catnames[] = {
    "unassigned",
    "uppercase letter",
    "lowercase letter",
    "titlecase letter",
    "modifier letter",
    "other letter",
    "non spacing mark",
    "enclosing mark",
    "combining spacing mark",
    "decimal digit number",
    "letter number",
    "other number",
    "space separator",
    "line separator",
    "paragraph separator",
    "control",
    "format",
    "private use area",
    "surrogate",
    "dash punctuation",   
    "start punctuation",
    "end punctuation",
    "connector punctuation",
    "other punctuation",
    "math symbol",
    "currency symbol",
    "modifier symbol",
    "other symbol",
    "initial punctuation",
    "final punctuation"
};

main() {
    UChar32 cp;
    int i, bad[U_CHAR_CATEGORY_COUNT];

    memset(bad, 0, sizeof(bad));
 
    for (cp = 0; cp <= UCHAR_MAX_VALUE; ++cp) {
        int cat;

        if (!bad[cat = u_charType(cp)]) {
            char name[128];
            UErrorCode status = U_ZERO_ERROR;

            if (!u_charName(cp, U_UNICODE_CHAR_NAME, name, sizeof(name),
&status)) {
                printf("%d <%s> (U+%04X)\n", cat, catnames[cat], cp);
                bad[cat] = 1;
            }
        }
    }
}

When I run this against today's head, I get:

gabier% ./no20name
15 <control> (U+0000)
12 <space separator> (U+0009)
14 <paragraph separator> (U+000A)
13 <line separator> (U+000C)
0 <unassigned> (U+0220)
18 <surrogate> (U+D800)
17 <private use area> (U+E000)
gabier% 

I am puzzled by the categorization of U+0009, U+000A, and U+000C which in
UnicodeData.txt are indicated as being of category Cc (U_CONTROL_CHAR, value
15). I think there is a problem in the compiled Unicode data.

For cross-checking, just displaying u_charType(0x0009) displays 12 too instead
of 15.

YA


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic