'Re: blessing db data as utf8'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       msql-mysql-modules
Subject:    Re: blessing db data as utf8
From:       Gaal Yahas <gaal () forum2 ! org>
Date:       2004-06-10 18:49:57
Message-ID: 20040610184957.GO17923 () sike ! forum2 ! org
[Download RAW message or body]

[I hope nobody minds that I'm moving this thread to the DBD::mysql list,
because it seems like the best place for it. Please drop cdbi-talk
from replies.]


On Thu, Jun 10, 2004 at 07:01:30PM +0100, Tim Bunce wrote:
> On Thu, Jun 10, 2004 at 12:18:42PM +0300, Gaal Yahas wrote:
> > On Thu, Jun 10, 2004 at 09:51:06AM +0100, Tim Bunce wrote:
> > > This isn't a good way to check for utf8:
> > > 
> > > +int is_high_bit_set(char *val) {
> > > +    while (*val++)
> > > +      if (*val & 0x80) return 1;
> > > +    return 0;
> > > +}
> > > 
> > > because it make it hard for any latin-1 data to coexist.
> > > The perl guts probably has a function to check for well-formed utf8
> > > and that should be used instead.
> > 
> > This function is only used as an optimization. The actual decision is here:
> > 
> > +        if (imp_dbh->enable_utf8 &&
> > +            is_high_bit_set(col) && is_utf8_string(col, len))
> > +          SvUTF8_on(sv);
> 
> Ah, okay.
> 
> > That said, bad things are going to happen sooner of later if a table has
> > both latin-1 and utf8 data.
> 
> I'm thinking more about different fields having either latin-1 or utf8 data.
> 
> > But now that I think of it, I'm not sure the call to is_high_bit_set is
> > a good idea there, since SvUTF8_on() on a pure (7 bit) ASCII string
> > shouldn't do any harm
> 
> It does add overhead (and is actually harmful on 5.6.x where many
> utf8 bugs lurk) so the check is worthwhile.
> 
> > and may even be more correct if the string is later concatenated
> > with utf8 data.
> 
> No, perl will do-the-right-thing.
 
So all in all it sounds like this patch is simple, but correct? Steve
Hay mentioned another similar patch had been written but didn't reach CPAN;
I'd like to encourage the maintainers to put either version :-)

> > I'm not sure what the cleanest way would be to go about this in the
> > long run (whose responsibility it is to say what is and what isn't
> > utf8) but the patch addresses an immediate need for people with
> > utf8-only data. Maybe this problem would go away in mysql 4.1; I'd
> > prefer not to wait.
> 
> Something along these lines is needed. But it does require careful thought.

Perhaps the application, or Class::DBI::mysql (which already has some
provisions for similar things) should be responsible for keeping track
of what fields are which charset, with no policy (except a default one)
being enforced on the DBD level. In this scheme the current approach
becomes part of the default handling, so it still makes sense to put it
in now.

-- 
Gaal Yahas <gaal@forum2.org>
http://gaal.livejournal.com/

-- 
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe:    http://lists.mysql.com/perl?unsub=msql-mysql-modules@progressive-comp.com

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic