'Re: Kanji hyphen character'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       icu4c-support
Subject:    Re: Kanji hyphen character
From:       Markus Scherer <markus.scherer () jtcsv ! com>
Date:       2002-10-14 16:36:59
[Download RAW message or body]

ICU converters mark mappings with whether they are roundtrip mappings (|0), fallback \
mappings (one-way mappings from Unicode to codepage but not back, |1), or reverse \
fallbacks (one-way cp->Unicode, |3). Fallback mappings except for fallbacks from \
Private-Use code points are turned off by default. They should not be used for text \
that is to be processed further (because they map not to the same but only a \
"similar" character) but only for display purposes. You can turn on fallback mappings \
with the ucnv_setFallback() API, see unicode/ucnv.h.

For more information see \
http://oss.software.ibm.com/icu/userguide/conversion-data.html More details also \
below.

Best regards,
markus

Mohammad Sajid , Noida wrote:

> Hi,
> I am trying to convert a file from "UTF-8" encoding to
> "ibm-930" (EBCDIC_STATEFUL Katakana-Kanji Host Mixed).
> it converts properly but the Kanji hyphen character 
> whose code point is 0XFF0D,is not converted properly...
> the byte sequence for this character in "ibm-930" is \x42\x60.
> but I am getting \xFE\xFE.

... which is the EBCDIC-mixed substitution character. Substitution is the default \
error handling behavior.

> on thing more I checked in icu data folder, the file ibm-930.ucm
> defines all character's byte sequences...
> it contains lines like...
> ....
> ....
> <UFF0A> \x42\x5C |0
> <UFF0B> \x42\x4E |0
> <UFF0C> \x42\x6B |0
> <UFF0D> \x42\x60 |1

-> indicates a fallback mapping, i.e., U+FF0D is similar to, but not the same \
character as, ibm-930's 4260. If you look further, you will see that there is a \
roundtrip mapping <U2212> \x42\x60 |0
which means that U+2212 and ibm-930's 4260 are really the same character. You can map \
two code points to a single byte sequence, but when mapping back to Unicode, you can \
only map to one of the code points, and that one has the roundtrip marker.

U+2212 and U+FF0D are

2212;MINUS SIGN;Sm;0;ET;;;;;N;;;;;
FF0D;FULLWIDTH HYPHEN-MINUS;Pd;0;ET;<wide> 002D;;;;N;;;;;

They are related, but not even closely enough to have a compatibility decomposition \
from one to the other. Instead, U+FF0D has a compatibility decomposition to the ASCII \
minus sign U+002D.

As said above, you can turn on fallback mappings with ucnv_setFallback() - but you \
need to be aware that you are losing information in such mappings.

> <UFF0E> \x42\x4B |0
> <UFF0F> \x42\x61 |0
> ....
> ....
> can you please tell me  meaning of 1 in  <UFF0D> \x42\x60 |1
> because I checked all charcaters that comes in this  1's cataegory
> fails in conversion.

_______________________________________________
icu4c-support@oss.software.ibm.com - icu4c-support mailing list
To Un/Subscribe:
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-support

[prev in list] [next in list] [prev in thread] [next in thread]