[prev in list] [next in list] [prev in thread] [next in thread]
List: xerces-j-dev
Subject: RE: Problems with ISO-8859-1 and UTF-8 encodings
From: Michael Glavassevich <mrglavas () ca ! ibm ! com>
Date: 2007-08-02 16:28:56
Message-ID: OF6D530E3A.69F0BF36-ON8525732B.0058A723-8525732B.005A89E1 () ca ! ibm ! com
[Download RAW message or body]
Hi Inma,
xmlutf8.getBytes() doesn't return what you think. Both
ByteArrayOutputStream.toString() [1] and String.getBytes() [2] use the
default encoding (which is probably ISO-8859-1 on your system) for
converting between bytes -> chars and chars -> bytes. You can fix this by
specifying the encoding on these methods, but if I were you I'd avoid
doing the conversions altogether and just create the
StreamSource/StreamResult with a java.io.StringReader/java.io.StringWriter
instead.
Thanks.
[1]
http://java.sun.com/javase/6/docs/api/java/io/ByteArrayOutputStream.html#toString()
[2] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes()
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Robert Houben <Robert.Houben@fusionware.net> wrote on 08/02/2007 11:36:34
AM:
> Hi Inma,
>
> The last line of your first block you have:
> return baos.toString();
> Note that when you do ?toString()? on the byte array it will return
> a string in Java internal form, not UTF8. I?m guessing that in your
> next block of code, xmlutf8 is the result of the first block. This
> means that when you getBytes() from it, you are getting bytes that
> are no longer in UTF8 form.
>
> HTH,
>
> From: Inma Marín López [mailto:inma@dif.um.es]
> Sent: Thursday, August 02, 2007 12:53 AM
> To: j-users@xerces.apache.org
> Subject: Problems with ISO-8859-1 and UTF-8 encodings
>
> Hi all,
>
> I have some problems with ISO-5589-1 and UTF-8 encodings in XML
> documents. Concretely, I have this ISO-8859-1 - encoded XML document:
>
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <DOCUMENTO>
> <PERFILES>Á</PERFILES>
> <PERFILES>É</PERFILES>
> <PERFILES>Í</PERFILES>
> <PERFILES>Ó</PERFILES>
> <PERFILES>Ú</PERFILES>
> </DOCUMENTO>
>
> Then I UTF-8 - encode it, by means of the following piece of code:
>
> Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
> StreamSource ds = new StreamSource(new
> ByteArrayInputStream(xmliso88191.getBytes()));
> transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> transformer.transform(ds,new StreamResult(baos));
> return baos.toString();
>
> to obtain this XML document:
>
> <?xml version="1.0" encoding="utf-8"?>
> <DOCUMENTO>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> </DOCUMENTO>
>
> Next, I ISO-8859-1- encode this document (UTF-8 encoded):
>
> Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
> StreamSource ds = new StreamSource(new
> ByteArrayInputStream(xmlutf8.getBytes()));
> transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> transformer.transform(ds,new StreamResult(baos));
> return baos.toString();
>
> But I can not get it. Instead, I obtain the following exception:
>
> [Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.
> javax.xml.transform.TransformerException: org.xml.sax.
> SAXParseException: Invali
> byte 2 of 2-byte UTF-8 sequence.
> at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:449)
> at codificacion.PruebasCodificacion.
> encodeISO88891(PruebasCodificacion.
> ava:302)
> at codificacion.PruebasCodificacion.
> prueba(PruebasCodificacion.java:73)
> at
codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)
> Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte
> UTF-8 sequen
> e.
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
> at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:432)
>
>
> Is this process correct? Supposing that it is, it seems the
> exception is due to ?Ã?? characters (?Á? and ?Í? UTF-8 ? encoding),
> so I would like to know how I could UTF-8 - encode ?Á? and ?Í?
> characters and then, back them to ISO-8859-1 encoding.
>
> Could anybody be so kind as to help me, please?
>
> Thank you very much in advance.
> Inma.
>
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic