[prev in list] [next in list] [prev in thread] [next in thread]
List: xerces-j-user
Subject: Re: Any plans to support UTF-32 BOM?
From: Gary Gregory <garydgregory () gmail ! com>
Date: 2012-08-13 21:30:48
Message-ID: CACZkXPzD+NbN5sVYbEkkjjrzxKaRQfdQ0c64SgwbUF_cxAtF1g () mail ! gmail ! com
[Download RAW message or body]
Thank you for all the info Michael, very helpful.
Gary
On Mon, Aug 13, 2012 at 4:51 PM, Michael Glavassevich
<mrglavas@ca.ibm.com>wrote:
> Hi Gary,
>
> Gary Gregory <garydgregory@gmail.com> wrote on 13/08/2012 02:27:33 PM:
>
>
> > Hi Michael,
> >
> > I've not caught one in the savannah either! I've not had a customer
> > request for it either, that, or the request did not make it through
> > our sales engineers, professional services, or tech support all the way
> to me.
> >
> > Our products are XML and buzzword compliant and I am checking my Ps
> > and Qs. So, at this point, the point is rather academic as you mention.
>
> XML parsers are only required to support UTF-8 and UTF-16. Support for any
> other encodings is icing on the cake.
>
> > I am aware of the inefficiencies involved, but our customers can
> > decide how efficient they want to be for themselves, sometimes they
> > have no control over the format of the documents they have to
> > process with our software. For those who can control the format, I
> > do not know if someone has tried UTF-32, watched it blow up and then
> > switched to something.
> >
> > Now, out of curiosity, I do notice a
> > org.apache.xerces.impl.io.UCSReader class in Xerces which is used
> > from a couple of places.
> >
> > Is that not hooked up in all the right spots?
>
> It is, but if presented with a UTF-32 BOM, Xerces won't hit the code path
> where the UCSReader would be used since its encoding auto-detector doesn't
> recognize UTF-32 BOM byte sequences. It's probably just defaulting to UTF-8
> (since it has no better guess) and then bombs out.
>
> Assuming Xerces did support UTF-32 the UCSReader might not be the right
> reader to use anyway. A compliant UTF-32 Reader might require more error
> checking (e.g. to reject non-characters, like the byte sequences that would
> be used to represent surrogates in UTF-16).
>
> > Gary
>
> > On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich <
> mrglavas@ca.ibm.com
> > > wrote:
> > Hi Gary,
> >
> > There haven't been any plans for UTF-32 support. It seems you're the
> > first [1] (and only) one who has asked about it on the project lists.
> >
> > Is this just an academic question or do you have an actual need for it?
> >
> > I must say I've never seen a UTF-32 encoded document in the wild. In
> > my opinion it's a very inefficient encoding. Always uses 32-bits to
> > represent a character when the largest Unicode code point only
> > requires 21-bits. UTF-8 and UTF-16 only ever use that much space for
> > supplementary characters (i.e. code points greater than U+FFFF).
> >
> > Thanks.
> >
> > [1] http://xerces-j.markmail.org/search/?q=UTF-32
> >
> > Michael Glavassevich
> > XML Technologies and WAS Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
> > Gary Gregory <garydgregory@gmail.com> wrote on 13/08/2012 01:49:46 PM:
> >
> >
> > > Hi All:
> > >
> > > Any plans to support UTF-32 BOM?
> > >
> > > Currently, if I parse a UTF-32 document I get 'content not expected
> > > in prolog" error.
> > >
> > > Thank you,
> > > Gary
> > >
> > > --
> > > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > > JUnit in Action, 2nd Ed:
> > http://bit.ly/ECvg0
> >
> > > Spring Batch in Action: http://bit.ly/bqpbCK
> > > Blog: http://garygregory.wordpress.com
> > > Home: http://garygregory.com/
> > > Tweet! http://twitter.com/GaryGregory
> >
> >
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > JUnit in Action, 2nd Ed: http://bit.ly/ECvg0
> > Spring Batch in Action: http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
>
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
--
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory
[Attachment #3 (text/html)]
Thank you for all the info Michael, very helpful. <br><br>Gary<br><br><div \
class="gmail_quote">On Mon, Aug 13, 2012 at 4:51 PM, Michael Glavassevich <span \
dir="ltr"><<a href="mailto:mrglavas@ca.ibm.com" \
target="_blank">mrglavas@ca.ibm.com</a>></span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><tt><font>Hi Gary,</font></tt> <br>
<br><tt><font>Gary Gregory <<a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a>> wrote on 13/08/2012 02:27:33 PM:<div \
class="im"><br> <br>
> Hi Michael,<br>
> <br>
> I've not caught one in the savannah either! I've not had a customer
<br>
> request for it either, that, or the request did not make it through
<br>
> our sales engineers, professional services, or tech support all the
way to me.<br>
> <br>
> Our products are XML and buzzword compliant and I am checking my Ps
<br>
> and Qs. So, at this point, the point is rather academic as you mention.
<br>
</div></font></tt>
<br><tt><font>XML parsers are only required to support UTF-8 and
UTF-16. Support for any other encodings is icing on the cake.</font></tt>
<br><div class="im"><tt><font><br>
> I am aware of the inefficiencies involved, but our customers can <br>
> decide how efficient they want to be for themselves, sometimes they
<br>
> have no control over the format of the documents they have to <br>
> process with our software. For those who can control the format, I
<br>
> do not know if someone has tried UTF-32, watched it blow up and then<br>
> switched to something.<br>
> <br>
> Now, out of curiosity, I do notice a <br>
> org.apache.xerces.impl.io.UCSReader class in Xerces which is used
<br>
> from a couple of places.<br>
> <br>
> Is that not hooked up in all the right spots?<br>
</font></tt>
<br></div><tt><font>It is, but if presented with a UTF-32 BOM, Xerces
won't hit the code path where the UCSReader would be used since its encoding
auto-detector doesn't recognize UTF-32 BOM byte sequences. It's probably
just defaulting to UTF-8 (since it has no better guess) and then bombs
out.</font></tt>
<br>
<br><tt><font>Assuming Xerces did support UTF-32 the UCSReader might
not be the right reader to use anyway. A compliant UTF-32 Reader might
require more error checking (e.g. to reject non-characters, like the byte
sequences that would be used to represent surrogates in UTF-16).</font></tt>
<br><div class="HOEnZb"><div class="h5"><tt><font><br>
> Gary<br>
</font></tt>
<br><tt><font>> On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich
<<a href="mailto:mrglavas@ca.ibm.com" target="_blank">mrglavas@ca.ibm.com</a><br>
> > wrote:</font></tt>
<br><tt><font>> Hi Gary, <br>
> <br>
> There haven't been any plans for UTF-32 support. It seems you're the<br>
> first [1] (and only) one who has asked about it on the project lists.
<br>
> <br>
> Is this just an academic question or do you have an actual need for
it? <br>
> <br>
> I must say I've never seen a UTF-32 encoded document in the wild.
In<br>
> my opinion it's a very inefficient encoding. Always uses 32-bits to
<br>
> represent a character when the largest Unicode code point only <br>
> requires 21-bits. UTF-8 and UTF-16 only ever use that much space for<br>
> supplementary characters (i.e. code points greater than U+FFFF). <br>
> <br>
> Thanks. <br>
> <br>
> [1] </font></tt><a href="http://xerces-j.markmail.org/search/?q=UTF-32" \
target="_blank"><tt><font>http://xerces-j.markmail.org/search/?q=UTF-32</font></tt></a><tt><font>
<br>
> <br>
> Michael Glavassevich<br>
> XML Technologies and WAS Development<br>
> IBM Toronto Lab<br>
> E-mail: <a href="mailto:mrglavas@ca.ibm.com" \
target="_blank">mrglavas@ca.ibm.com</a> <br> > E-mail: <a \
href="mailto:mrglavas@apache.org" target="_blank">mrglavas@apache.org</a> <br> > \
<br> > Gary Gregory <<a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a>> wrote on 13/08/2012 01:49:46 \
PM:</font></tt> <br><tt><font>> <br>
> <br>
> > Hi All:<br>
> > <br>
> > Any plans to support UTF-32 BOM?<br>
> > <br>
> > Currently, if I parse a UTF-32 document I get 'content not expected
<br>
> > in prolog" error.<br>
> > <br>
> > Thank you,<br>
> > Gary<br>
> > <br>
> > -- <br>
> > E-Mail: <a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a> | <a href="mailto:ggregory@apache.org" \
target="_blank">ggregory@apache.org</a> <br> > > JUnit in Action, 2nd Ed: \
</font></tt> <br><tt><font>> </font></tt><a href="http://bit.ly/ECvg0" \
target="_blank"><tt><font>http://bit.ly/ECvg0</font></tt></a> <br><tt><font>> <br>
> > Spring Batch in Action: </font></tt><a href="http://bit.ly/bqpbCK" \
target="_blank"><tt><font>http://bit.ly/bqpbCK</font></tt></a><tt><font><br> > \
> Blog: </font></tt><a href="http://garygregory.wordpress.com/" \
target="_blank"><tt><font>http://garygregory.wordpress.com</font></tt></a><tt><font> \
<br> > > Home: </font></tt><a href="http://garygregory.com/" \
target="_blank"><tt><font>http://garygregory.com/</font></tt></a><tt><font><br> > \
> Tweet! </font></tt><a href="http://twitter.com/GaryGregory" \
target="_blank"><tt><font>http://twitter.com/GaryGregory</font></tt></a> \
<br><tt><font>> <br> > <br>
> <br>
> -- <br>
> E-Mail: <a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a> | <a href="mailto:ggregory@apache.org" \
target="_blank">ggregory@apache.org</a> <br> > JUnit in Action, 2nd Ed: \
</font></tt><a href="http://bit.ly/ECvg0" \
target="_blank"><tt><font>http://bit.ly/ECvg0</font></tt></a><tt><font><br> > \
Spring Batch in Action: </font></tt><a href="http://bit.ly/bqpbCK" \
target="_blank"><tt><font>http://bit.ly/bqpbCK</font></tt></a><tt><font><br> > \
Blog: </font></tt><a href="http://garygregory.wordpress.com/" \
target="_blank"><tt><font>http://garygregory.wordpress.com</font></tt></a><tt><font> \
<br> > Home: </font></tt><a href="http://garygregory.com/" \
target="_blank"><tt><font>http://garygregory.com/</font></tt></a><tt><font><br> > \
Tweet! </font></tt><a href="http://twitter.com/GaryGregory" \
target="_blank"><tt><font>http://twitter.com/GaryGregory</font></tt></a> <br>
<br><tt><font>Michael Glavassevich<br>
XML Technologies and WAS Development<br>
IBM Toronto Lab<br>
E-mail: <a href="mailto:mrglavas@ca.ibm.com" \
target="_blank">mrglavas@ca.ibm.com</a></font></tt> <br><tt><font>E-mail: <a \
href="mailto:mrglavas@apache.org" target="_blank">mrglavas@apache.org</a></font></tt> \
</div></div></blockquote></div><br><br clear="all"><br>-- <br>E-Mail: <a \
href="mailto:garydgregory@gmail.com" target="_blank">garydgregory@gmail.com</a> | <a \
href="mailto:ggregory@apache.org" target="_blank">ggregory@apache.org </a><br> JUnit \
in Action, 2nd Ed: <a href="http://goog_1249600977" target="_blank"></a><a \
href="http://bit.ly/ECvg0" target="_blank">http://bit.ly/ECvg0</a><br>Spring Batch in \
Action: <a href="http://s.apache.org/HOq" target="_blank"></a><a \
href="http://bit.ly/bqpbCK" rel="nofollow" \
target="_blank">http://bit.ly/bqpbCK</a><br>
Blog: <a href="http://garygregory.wordpress.com/" \
target="_blank">http://garygregory.wordpress.com</a> <br>Home: <a \
href="http://garygregory.com/" target="_blank">http://garygregory.com/</a><br>Tweet! \
<a href="http://twitter.com/GaryGregory" \
target="_blank">http://twitter.com/GaryGregory</a><br>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic