'RE: Problems with ISO-8859-1 and UTF-8 encodings'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-j-dev
Subject:    RE: Problems with ISO-8859-1 and UTF-8 encodings
From:       Michael Glavassevich <mrglavas () ca ! ibm ! com>
Date:       2007-08-02 16:28:56
Message-ID: OF6D530E3A.69F0BF36-ON8525732B.0058A723-8525732B.005A89E1 () ca ! ibm ! com
[Download RAW message or body]

Hi Inma,

xmlutf8.getBytes() doesn't return what you think. Both 
ByteArrayOutputStream.toString() [1] and String.getBytes() [2] use the 
default encoding (which is probably ISO-8859-1 on your system) for 
converting between bytes -> chars and chars -> bytes. You can fix this by 
specifying the encoding on these methods, but if I were you I'd avoid 
doing the conversions altogether and just create the 
StreamSource/StreamResult with a java.io.StringReader/java.io.StringWriter 
instead.

Thanks.

[1] 
http://java.sun.com/javase/6/docs/api/java/io/ByteArrayOutputStream.html#toString()
[2] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes()

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Robert Houben <Robert.Houben@fusionware.net> wrote on 08/02/2007 11:36:34 
AM:

> Hi Inma,
> 
> The last line of your first block you have:
> return baos.toString();
> Note that when you do ?toString()? on the byte array it will return 
> a string in Java internal form, not UTF8.  I?m guessing that in your
> next block of code, xmlutf8 is the result of the first block.  This 
> means that when you getBytes() from it, you are getting bytes that 
> are no longer in UTF8 form.
> 
> HTH,
> 
> From: Inma Marín López [mailto:inma@dif.um.es] 
> Sent: Thursday, August 02, 2007 12:53 AM
> To: j-users@xerces.apache.org
> Subject: Problems with ISO-8859-1 and UTF-8 encodings
> 
> Hi all,
> 
>  I have some problems with ISO-5589-1 and UTF-8 encodings in XML 
> documents. Concretely, I have this ISO-8859-1 - encoded XML document:
> 
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <DOCUMENTO>
> <PERFILES>Á</PERFILES>
> <PERFILES>É</PERFILES>
> <PERFILES>Í</PERFILES>
> <PERFILES>Ó</PERFILES>
> <PERFILES>Ú</PERFILES>
> </DOCUMENTO> 
> 
> Then I UTF-8 - encode it, by means of the following piece of code:
> 
>             Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
>             StreamSource ds = new StreamSource(new 
> ByteArrayInputStream(xmliso88191.getBytes()));
>             transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");
>             ByteArrayOutputStream baos = new ByteArrayOutputStream();
>             transformer.transform(ds,new StreamResult(baos));
>             return baos.toString();
> 
> to obtain this XML document:
> 
> <?xml version="1.0" encoding="utf-8"?>
> <DOCUMENTO>
> <PERFILES>Ă?</PERFILES>
> <PERFILES>Ă?</PERFILES>
> <PERFILES>Ă?</PERFILES>
> <PERFILES>Ă?</PERFILES>
> <PERFILES>Ă?</PERFILES>
> </DOCUMENTO>
> 
> Next, I ISO-8859-1- encode this document (UTF-8 encoded):
> 
>             Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
>             StreamSource ds = new StreamSource(new 
> ByteArrayInputStream(xmlutf8.getBytes()));
> transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
>             ByteArrayOutputStream baos = new ByteArrayOutputStream();
>             transformer.transform(ds,new StreamResult(baos));
>             return baos.toString();
> 
> But I can not get it. Instead, I obtain the following exception:
> 
> [Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.
> javax.xml.transform.TransformerException: org.xml.sax.
> SAXParseException: Invali
>  byte 2 of 2-byte UTF-8 sequence.
>         at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:449)
>         at codificacion.PruebasCodificacion.
> encodeISO88891(PruebasCodificacion.
> ava:302)
>         at codificacion.PruebasCodificacion.
> prueba(PruebasCodificacion.java:73)
>         at 
codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)
> Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte 
> UTF-8 sequen
> e.
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown 
Source)
>         at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:432)
> 
> 
> Is this process correct? Supposing that it is, it seems the 
> exception is due to ?Ă?? characters  (?Á? and ?Í? UTF-8 ? encoding),
> so I would like to know how I could UTF-8 - encode ?Á? and ?Í? 
> characters and then, back them to ISO-8859-1 encoding.
> 
> Could anybody be so kind as to help me, please?
> 
> Thank you very much in advance.
> Inma.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic