[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-j-user
Subject:    Re: Any plans to support UTF-32 BOM?
From:       Gary Gregory <garydgregory () gmail ! com>
Date:       2012-08-13 21:30:48
Message-ID: CACZkXPzD+NbN5sVYbEkkjjrzxKaRQfdQ0c64SgwbUF_cxAtF1g () mail ! gmail ! com
[Download RAW message or body]

Thank you for all the info Michael, very helpful.

Gary

On Mon, Aug 13, 2012 at 4:51 PM, Michael Glavassevich
<mrglavas@ca.ibm.com>wrote:

> Hi Gary,
>
> Gary Gregory <garydgregory@gmail.com> wrote on 13/08/2012 02:27:33 PM:
>
>
> > Hi Michael,
> >
> > I've not caught one in the savannah either! I've not had a customer
> > request for it either, that, or the request did not make it through
> > our sales engineers, professional services, or tech support all the way
> to me.
> >
> > Our products are XML and buzzword compliant and I am checking my Ps
> > and Qs. So, at this point, the point is rather academic as you mention.
>
> XML parsers are only required to support UTF-8 and UTF-16. Support for any
> other encodings is icing on the cake.
>
> > I am aware of the inefficiencies involved, but our customers can
> > decide how efficient they want to be for themselves, sometimes they
> > have no control over the format of the documents they have to
> > process with our software. For those who can control the format, I
> > do not know if someone has tried UTF-32, watched it blow up and then
> > switched to something.
> >
> > Now, out of curiosity, I do notice a
> > org.apache.xerces.impl.io.UCSReader class in Xerces which is used
> > from a couple of places.
> >
> > Is that not hooked up in all the right spots?
>
> It is, but if presented with a UTF-32 BOM, Xerces won't hit the code path
> where the UCSReader would be used since its encoding auto-detector doesn't
> recognize UTF-32 BOM byte sequences. It's probably just defaulting to UTF-8
> (since it has no better guess) and then bombs out.
>
> Assuming Xerces did support UTF-32 the UCSReader might not be the right
> reader to use anyway. A compliant UTF-32 Reader might require more error
> checking (e.g. to reject non-characters, like the byte sequences that would
> be used to represent surrogates in UTF-16).
>
> > Gary
>
> > On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich <
> mrglavas@ca.ibm.com
> > > wrote:
> > Hi Gary,
> >
> > There haven't been any plans for UTF-32 support. It seems you're the
> > first [1] (and only) one who has asked about it on the project lists.
> >
> > Is this just an academic question or do you have an actual need for it?
> >
> > I must say I've never seen a UTF-32 encoded document in the wild. In
> > my opinion it's a very inefficient encoding. Always uses 32-bits to
> > represent a character when the largest Unicode code point only
> > requires 21-bits. UTF-8 and UTF-16 only ever use that much space for
> > supplementary characters (i.e. code points greater than U+FFFF).
> >
> > Thanks.
> >
> > [1] http://xerces-j.markmail.org/search/?q=UTF-32
> >
> > Michael Glavassevich
> > XML Technologies and WAS Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
> > Gary Gregory <garydgregory@gmail.com> wrote on 13/08/2012 01:49:46 PM:
> >
> >
> > > Hi All:
> > >
> > > Any plans to support UTF-32 BOM?
> > >
> > > Currently, if I parse a UTF-32 document I get 'content not expected
> > > in prolog" error.
> > >
> > > Thank you,
> > > Gary
> > >
> > > --
> > > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > > JUnit in Action, 2nd Ed:
> > http://bit.ly/ECvg0
> >
> > > Spring Batch in Action: http://bit.ly/bqpbCK
> > > Blog: http://garygregory.wordpress.com
> > > Home: http://garygregory.com/
> > > Tweet! http://twitter.com/GaryGregory
> >
> >
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > JUnit in Action, 2nd Ed: http://bit.ly/ECvg0
> > Spring Batch in Action: http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
>
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

[Attachment #3 (text/html)]

Thank you for all the info Michael, very helpful. <br><br>Gary<br><br><div \
class="gmail_quote">On Mon, Aug 13, 2012 at 4:51 PM, Michael Glavassevich <span \
dir="ltr">&lt;<a href="mailto:mrglavas@ca.ibm.com" \
target="_blank">mrglavas@ca.ibm.com</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><tt><font>Hi Gary,</font></tt> <br>
<br><tt><font>Gary Gregory &lt;<a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a>&gt; wrote on 13/08/2012 02:27:33 PM:<div \
class="im"><br> <br>
&gt; Hi Michael,<br>
&gt; <br>
&gt; I've not caught one in the savannah either! I&#39;ve not had a customer
<br>
&gt; request for it either, that, or the request did not make it through
<br>
&gt; our sales engineers, professional services, or tech support all the
way to me.<br>
&gt; <br>
&gt; Our products are XML and buzzword compliant and I am checking my Ps
<br>
&gt; and Qs. So, at this point, the point is rather academic as you mention.
<br>
</div></font></tt>
<br><tt><font>XML parsers are only required to support UTF-8 and
UTF-16. Support for any other encodings is icing on the cake.</font></tt>
<br><div class="im"><tt><font><br>
&gt; I am aware of the inefficiencies involved, but our customers can <br>
&gt; decide how efficient they want to be for themselves, sometimes they
<br>
&gt; have no control over the format of the documents they have to <br>
&gt; process with our software. For those who can control the format, I
<br>
&gt; do not know if someone has tried UTF-32, watched it blow up and then<br>
&gt; switched to something.<br>
&gt; <br>
&gt; Now, out of curiosity, I do notice a <br>
&gt; org.apache.xerces.impl.io.UCSReader class in Xerces which is used
<br>
&gt; from a couple of places.<br>
&gt; <br>
&gt; Is that not hooked up in all the right spots?<br>
</font></tt>
<br></div><tt><font>It is, but if presented with a UTF-32 BOM, Xerces
won&#39;t hit the code path where the UCSReader would be used since its encoding
auto-detector doesn&#39;t recognize UTF-32 BOM byte sequences. It&#39;s probably
just defaulting to UTF-8 (since it has no better guess) and then bombs
out.</font></tt>
<br>
<br><tt><font>Assuming Xerces did support UTF-32 the UCSReader might
not be the right reader to use anyway. A compliant UTF-32 Reader might
require more error checking (e.g. to reject non-characters, like the byte
sequences that would be used to represent surrogates in UTF-16).</font></tt>
<br><div class="HOEnZb"><div class="h5"><tt><font><br>
&gt; Gary<br>
</font></tt>
<br><tt><font>&gt; On Mon, Aug 13, 2012 at 2:07 PM, Michael Glavassevich
&lt;<a href="mailto:mrglavas@ca.ibm.com" target="_blank">mrglavas@ca.ibm.com</a><br>
&gt; &gt; wrote:</font></tt>
<br><tt><font>&gt; Hi Gary, <br>
&gt; <br>
&gt; There haven&#39;t been any plans for UTF-32 support. It seems you&#39;re the<br>
&gt; first [1] (and only) one who has asked about it on the project lists.
<br>
&gt; <br>
&gt; Is this just an academic question or do you have an actual need for
it? <br>
&gt; <br>
&gt; I must say I&#39;ve never seen a UTF-32 encoded document in the wild.
In<br>
&gt; my opinion it&#39;s a very inefficient encoding. Always uses 32-bits to
<br>
&gt; represent a character when the largest Unicode code point only <br>
&gt; requires 21-bits. UTF-8 and UTF-16 only ever use that much space for<br>
&gt; supplementary characters (i.e. code points greater than U+FFFF). <br>
&gt; <br>
&gt; Thanks. <br>
&gt; <br>
&gt; [1] </font></tt><a href="http://xerces-j.markmail.org/search/?q=UTF-32" \
target="_blank"><tt><font>http://xerces-j.markmail.org/search/?q=UTF-32</font></tt></a><tt><font>
 <br>
&gt; <br>
&gt; Michael Glavassevich<br>
&gt; XML Technologies and WAS Development<br>
&gt; IBM Toronto Lab<br>
&gt; E-mail: <a href="mailto:mrglavas@ca.ibm.com" \
target="_blank">mrglavas@ca.ibm.com</a> <br> &gt; E-mail: <a \
href="mailto:mrglavas@apache.org" target="_blank">mrglavas@apache.org</a> <br> &gt; \
<br> &gt; Gary Gregory &lt;<a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a>&gt; wrote on 13/08/2012 01:49:46 \
PM:</font></tt> <br><tt><font>&gt; <br>
&gt; <br>
&gt; &gt; Hi All:<br>
&gt; &gt; <br>
&gt; &gt; Any plans to support UTF-32 BOM?<br>
&gt; &gt; <br>
&gt; &gt; Currently, if I parse a UTF-32 document I get &#39;content not expected
<br>
&gt; &gt; in prolog&quot; error.<br>
&gt; &gt; <br>
&gt; &gt; Thank you,<br>
&gt; &gt; Gary<br>
&gt; &gt; <br>
&gt; &gt; -- <br>
&gt; &gt; E-Mail: <a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a> | <a href="mailto:ggregory@apache.org" \
target="_blank">ggregory@apache.org</a> <br> &gt; &gt; JUnit in Action, 2nd Ed: \
</font></tt> <br><tt><font>&gt; </font></tt><a href="http://bit.ly/ECvg0" \
target="_blank"><tt><font>http://bit.ly/ECvg0</font></tt></a> <br><tt><font>&gt; <br>
&gt; &gt; Spring Batch in Action: </font></tt><a href="http://bit.ly/bqpbCK" \
target="_blank"><tt><font>http://bit.ly/bqpbCK</font></tt></a><tt><font><br> &gt; \
&gt; Blog: </font></tt><a href="http://garygregory.wordpress.com/" \
target="_blank"><tt><font>http://garygregory.wordpress.com</font></tt></a><tt><font> \
<br> &gt; &gt; Home: </font></tt><a href="http://garygregory.com/" \
target="_blank"><tt><font>http://garygregory.com/</font></tt></a><tt><font><br> &gt; \
&gt; Tweet! </font></tt><a href="http://twitter.com/GaryGregory" \
target="_blank"><tt><font>http://twitter.com/GaryGregory</font></tt></a> \
<br><tt><font>&gt; <br> &gt; <br>
&gt; <br>
&gt; -- <br>
&gt; E-Mail: <a href="mailto:garydgregory@gmail.com" \
target="_blank">garydgregory@gmail.com</a> | <a href="mailto:ggregory@apache.org" \
target="_blank">ggregory@apache.org</a> <br> &gt; JUnit in Action, 2nd Ed: \
</font></tt><a href="http://bit.ly/ECvg0" \
target="_blank"><tt><font>http://bit.ly/ECvg0</font></tt></a><tt><font><br> &gt; \
Spring Batch in Action: </font></tt><a href="http://bit.ly/bqpbCK" \
target="_blank"><tt><font>http://bit.ly/bqpbCK</font></tt></a><tt><font><br> &gt; \
Blog: </font></tt><a href="http://garygregory.wordpress.com/" \
target="_blank"><tt><font>http://garygregory.wordpress.com</font></tt></a><tt><font> \
<br> &gt; Home: </font></tt><a href="http://garygregory.com/" \
target="_blank"><tt><font>http://garygregory.com/</font></tt></a><tt><font><br> &gt; \
Tweet! </font></tt><a href="http://twitter.com/GaryGregory" \
target="_blank"><tt><font>http://twitter.com/GaryGregory</font></tt></a> <br>
<br><tt><font>Michael Glavassevich<br>
XML Technologies and WAS Development<br>
IBM Toronto Lab<br>
E-mail: <a href="mailto:mrglavas@ca.ibm.com" \
target="_blank">mrglavas@ca.ibm.com</a></font></tt> <br><tt><font>E-mail: <a \
href="mailto:mrglavas@apache.org" target="_blank">mrglavas@apache.org</a></font></tt> \
</div></div></blockquote></div><br><br clear="all"><br>-- <br>E-Mail: <a \
href="mailto:garydgregory@gmail.com" target="_blank">garydgregory@gmail.com</a> | <a \
href="mailto:ggregory@apache.org" target="_blank">ggregory@apache.org </a><br> JUnit \
in Action, 2nd Ed: <a href="http://goog_1249600977" target="_blank"></a><a \
href="http://bit.ly/ECvg0" target="_blank">http://bit.ly/ECvg0</a><br>Spring Batch in \
Action: <a href="http://s.apache.org/HOq" target="_blank"></a><a \
href="http://bit.ly/bqpbCK" rel="nofollow" \
                target="_blank">http://bit.ly/bqpbCK</a><br>
Blog: <a href="http://garygregory.wordpress.com/" \
target="_blank">http://garygregory.wordpress.com</a> <br>Home: <a \
href="http://garygregory.com/" target="_blank">http://garygregory.com/</a><br>Tweet! \
<a href="http://twitter.com/GaryGregory" \
target="_blank">http://twitter.com/GaryGregory</a><br>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic