'[freetds] Re: Unicode'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freetds
Subject:    [freetds] Re: Unicode
From:       Steve Langasek <vorlon () netexpress ! net>
Date:       2001-06-10 15:39:38
[Download RAW message or body]

On Sun, 10 Jun 2001, James K. Lowden wrote:

> What about the insert side of things?  That is, when I send "insert T (U)
> values ( 'abc' )" and T.U is a Unicode column, what the heck is FreeTDS
> supposed to do?  Currently (I surmise) it "upgrades" the whole string to
> UCS2 because that's what TDS7 specifies.  Once iconv is in place, that
> upgrade will work correctly and the 'abc' string will arrive at the server
> in good UCS2 shape along with the rest of the insert string, even if 'abc'
> is in Greek, because the command buffer will be UTF-8, which sneakily
> resembles Ascii to English speakers living in the lower 127.

I realized late last night after poking through more Slashdot noise that you
/cannot/ losslessly convert UTF-8 to UCS2; UCS2 can only represent the first
65536 characters of the Unicode set, but UTF-8 is an encoding that supports
the whole Unicode charset - meaning CJK users using some of the more esoteric
characters will lose them between the client and the server.  It also means
that users writing in other character sets that produce bytestreams that
/look/ like the UTF-8 representation of these characters will lose data if we
automatically treat their input like UTF-8 and try to convert it to UCS2.

Another possibility would be to treat the app's (char*) input as UTF-8, and
convert it to UTF-16, a Unicode encoding that uses 2- and 4-byte
representations of characters:  UTF-8 -> UTF-16 conversion is lossless, UTF-16
and UCS2 are byte-compatible (modulo some endianness touch-up) for the first
few tens of thousands of characters, and using UTF-16 would give a gentle push
for modernization of the Microsoft libraries.  In the short term, this has the
effect that although FreeTDS apps will have no problems retrieving data from
the SQL server, apps using the Microsoft or Sybase libraries may have
difficulty interpreting some of this same data.  However, I think a little bit
of garbling is easier for everyone to deal with than losing data outright.

> I've been following the Gnome development pretty closely, Dia in particular,
> where there is also a Unicode/UTS-8 debate.  The sense of the Senate there
> is that everyone's adopting UTS-8 for internal representation.  The strength
> of UTS-8 seems to be that it can do anything UCS2 can do, without the
> hassles of endianism, and without another upgrade as UCS4 comes online.
> Based on that information, I'd be surprised if Mr. Peppler or our friends at
> Sqsh will be showing up asking for clear Unicode support anytime soon.

UTF-8 can do more than UCS2 can do; if we were at liberty to change the line
protocol to UTF-8 or UCS4, either would be better than UCS2.  But Microsoft
hasn't announced any plans to upgrade from UCS2 in the near future, so we're
stuck trying to be compatible with Microsoft's partial Unicode implementation.

> My prescription, respectfully submitted, comes down to this:
> 1.    Ctlib.    Follow Sybase.
> 2.    Dblib.    All internal representation in UTS-8.
> 3.    ODBC.     Unicode pass-thru.

Doesn't the FreeTDS ODBC driver exist as a wrapper around one of the
lower-level APIs?  For varying definitions of 'pass-thru', then, this sounds
right to me.

> I'll stop here, before Brian tells me that the very next suggestion I make
> had better be expressed in 'C'.

Perhaps I should start looking deeper into the FreeTDS code, for both of our
sakes? :)

Steve Langasek
postmodern programmer


---
You are currently subscribed to freetds as: [freetds@progressive-comp.com]
To unsubscribe, forward this message to leave-freetds-113879Q@franklin.oit.unc.edu

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic