'Re: RFR: 6928542: Chinese characters in RTF are not decoded [v7]'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-2d-dev
Subject:    Re: RFR: 6928542: Chinese characters in RTF are not decoded [v7]
From:       Prasanta Sadhukhan <psadhukhan () openjdk ! org>
Date:       2023-10-26 12:47:35
Message-ID: Kyv1SbJe-ftf4oZUvN1QokUWBN7z5IjSU3uux65Pf7U=.77c78d92-d103-43ee-9b26-cde5bc58681d () github ! com
[Download RAW message or body]

On Thu, 21 Sep 2023 16:21:05 GMT, Ichiroh Takiguchi <itakiguchi@openjdk.org> wrote:

> > "character set of font" (font charset) table was created by "Rich Text Format \
> > Specification 1.9.1" \
> > https://interoperability.blob.core.windows.net/files/Archive_References/[MSFT-RTF].pdf
> >  It refers windgi.h
> > https://learn.microsoft.com/en-us/windows/win32/api/wingdi/ns-wingdi-textmetrica
> > 
> > Test files and testcase are in bugid \
> > [JDK-6928542](https://bugs.openjdk.org/browse/JDK-6928542) 
> > Additional change:
> > Special character `\line` should `\n`
> > 
> > Additional information:
> > 
> > Add 2 hash tables
> > - fcharsetToCP: Predefined conversion table, `fcharset` with number control word, \
> > from control word to Java charset name, `fcharset0` refers `windows-1252` Java \
> >                 charset name
> > - fcharsetTable: Conversion table for each RTF file, `f` control word with \
> > number, from integer font numbers to Charset font charsets, In case of \
> > `{\f0\fnil\fcharset0 Segoe UI;}`, `0` refers Java Charset `windows-1252` 
> > When RTF Character Set control word (like `\mac`) is used, unmappable character \
> > returns \u0000 and it's not written into RTF text.. When fcharset control word is \
> > used, unmappable character returns \uFFFD (it's the same as replacement character \
> > on decoder), \u0000 is used for DBCS lead byte detection. If `f` or `par` control \
> > word is there and lead byte is remains on byte buffer for decoder, this byte data \
> > is as invalid character and write \uFFFD into RTF text. 
> > If `f` control word is used without `fcharset`, `translationTable` char array is \
> > used. If `f` control word is used with `fcharset`, predefined Java Charset name \
> > is used (if missing, ISO8859_1 is used for fallback). 
> > **Note:** Following GitHub actions were failed
> > linux-cross-compile / build (riscv64), I opened following JBS.
> > > [JDK-8314624](https://bugs.openjdk.org/browse/JDK-8314624) GHA: RISC-V \
> > > cross-build was failed
> 
> Ichiroh Takiguchi has updated the pull request incrementally with one additional \
> commit since the last revision: 
> 6928542: Chinese characters in RTF are not decoded

For me the added regression test still fails with the fix in WIndows 10...anything I \
need to do more as a prerequisite?


Read data^M
=========^M
Gr\\u00fcezi -  Switzerland 0^M
\\u0082\\u00b1\\u0082\\u00f1\\u0082\\u00c9\\u0082\\u00bf\\u0082\\u00cd - Japanese \
128^M \\u00be\\u00c8\\u00b3\\u00e7\\u00c7\\u00cf\\u00bc\\u00bc\\u00bf\\u00e4 - Korean \
129^M \\u00c4\\u00e3\\u00ba\\u00c3 - China 134^M
\\u00bbO\\u00c6W - Traditional Chinese - Taiwan 136^M
\\u00e3\\u00e5\\u00e9\\u00e1 \\u00f3\\u00ef\\u00f5 - Greek 161^M
A\\u00f0a\\u00e7 - Turkish (Tree) 162^M
\\u00fe - Vietnam currency 163^M
\\u00f9\\u00c8\\u00d1\\u00ec\\u00e5\\u00c9\\u00ed - Hebrew 177^M
\\u00e3\\u00d1\\u00cd\\u00c8\\u00c7 - Arabic 178^M
A\\u00e8i\\u00fb - Lithuanian (Thank you) 186^M
\\u00c7\\u00e4\\u00f0\\u00e0\\u00e2\\u00f1\\u00f2\\u00e2\\u00f3\\u00e9\\u00f2\\u00e5 \
- Russian 204^M \\u00ca\\u00c7\\u00d1\\u00ca\\u00b4\\u00d5 - Thailand 222^M
cze\\uc48f - Polish 238^M
^M
Expected data^M
=============^M
Gr\\u00fcezi -  Switzerland 0^M
\\u3053\\u3093\\u306b\\u3061\\u306f - Japanese 128^M
\\uc548\\ub155\\ud558\\uc138\\uc694 - Korean 129^M
\\u4f60\\u597d - China 134^M
\\u81fa\\u7063 - Traditional Chinese - Taiwan 136^M
\\u03b3\\u03b5\\u03b9\\u03b1 \\u03c3\\u03bf\\u03c5 - Greek 161^M
A\\u011fa\\u00e7 - Turkish (Tree) 162^M
\\u20ab - Vietnam currency 163^M
\\u05e9\\u05b8\\u05c1\\u05dc\\u05d5\\u05b9\\u05dd - Hebrew 177^M
\\u0645\\u0631\\u062d\\u0628\\u0627 - Arabic 178^M
A\\u010di\\u016b - Lithuanian (Thank you) 186^M
\\u0417\\u0434\\u0440\\u0430\\u0432\\u0441\\u0442\\u0432\\u0443\\u0439\\u0442\\u0435 \
- Russian 204^M \\u0e2a\\u0e27\\u0e31\\u0e2a\\u0e14\\u0e35 - Thailand 222^M
cze\\u015b\\u0107 - Polish 238^M
^M
java.lang.RuntimeException: Test failed^M
        at RTFReadFontCharsetTest.main(RTFReadFontCharsetTest.java:114)^

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13553#issuecomment-1781050285


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic