'Re: Heuristic autodetection of utf-8 unicode?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Heuristic autodetection of utf-8 unicode?
From:       Bryce Nesbitt <bryce-mailinglists () attbi ! com>
Date:       2002-02-26 4:05:10
[Download RAW message or body]

On Monday 25 February 2002 04:52 pm, Nicolas Goutte wrote:
> On Monday 25 February 2002 21:37, Nicolas Goutte wrote:
> > On Monday 25 February 2002 18:29, Bryce Nesbitt wrote:
> > > Kde folks;
> > >
> > > First: Is there a standard kde/qt/glibc way that is recommended for kde
> > > authors to determine if a string is utf-8 encoded?
> > >
> > > Second: Has anyone any experience with using utf-8 encoded text in jpeg
> > > COM comment blocks?
> > >
> > >               -Bryce (bryce at obviously.com)
> >
> > Just an idea: if you are writting the UTF-8 strings yourself, you can
> > always add the BOM (Byte Order Mark, Unicode 0xEFFF) at the start of your
> > text.
>
> Sorry, I meant: 0xFEFF not 0xEFFF

Maybe that makes sense for comments that actually have non-ASCII
characters in them.  For pure ASCII comments, adding the BOM would
destroy interoperability.

Keep in mind the bigger problem is incorrectly identifying text as utf8.  The 
BOM only helps positively identify utf8, it does not help weed out those 
occasional nationally encoded strings that just happen to be utf-8.

I hear that for texts longer than a filename, the hueristic is quite good.

> > You can look at the appendice F "Autodetection of Character Encodings" of
> > the XML Recommendation (even if you do not plan to write or read XML.)
> > http://www.w3.org/TR/REC-xml
> >
> > Appendice F tells that the BOM is coded in UTF-8 like that:
> > 0xEF 0xBB 0XBF
> >
> > Have a nice day/evening/night!

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread]