'Re: perl, the data, and the tf8 flag'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl5-porters
Subject:    Re: perl, the data, and the tf8 flag
From:       Juerd Waalboer <juerd () convolution ! nl>
Date:       2007-03-31 22:34:48
Message-ID: 20070331223447.GL31277 () c4 ! convolution ! nl
[Download RAW message or body]

Glenn Linderman skribis 2007-03-31 14:35 (-0700):
> Juerd says two, but describes 3 types of data.

Well observed! :)

And in general, your post is a good summary of how things work.

> Juerd says two, but describes 3 types of data.

Binary and latin1 are hybrid in Perl. Only the programmer knows (can
know) the difference, Perl doesn't (can't). That's why to Perl, there
are only 2.

> As a developer that hopes to be implementing Unicode character support 
> in a perl application soon (tuits, always tuits), I have the following 
> questions and comments.  

I invite you to read perlunitut and perlunifaq. You'll have to look for
them (e.g. Google) or get them from bleadperl.

> 1) What operations can safely be used on bytes stored in a string 
> without causing implicit upgrades to multi-bytes?

All operations are safe, except:

1. operations that add characters greater than 255
2. joining text strings with byte strings (because
the text string may already internally be prepared for handling
characters greater than 255, and forces the byte string to be prepared
in a similar way, breaking its binaryness) and byte operations (because
they cannot handle characters greater than 255).

If you read carefully, you'll notice that "1" is just a different way of
having "2", and come to the conclusion that there's only one simple, yet
important, guideline: keep binary and text separate, and only bridge
between the two by means of decoding and encoding.

> My perception from following all this discussion is that you can do any 
> operation, as long as all the data involved is bytes data that has never 
> been upgraded, except for decode, which always assumes a bytes parameter.

Your perception is correct.

> 2) What operations create multi-bytes data?

1. operations that add characters greater than 255
2. joining text strings with byte strings or byte operations, if the
text string is internally prepared for handling characters greater than
255 ("is internally encoded as UTF8", "has the UTF8 flag set").

This is, in essence, the same list as before :)

> 3) What operations create bytes data?

In general, everything that creates a new string value with only
characters in the 0..255 range, with the possible exception of
operations that are designed to create text strings only (like decode).

Some examples:

    "\xdf\xff\xa0\x00\xa1"            # binary (but can be used as text)
    "\x{df}\x{ff}\x{a0}\x{00}\x{a1}"  # binary (but can be used as text)
    "\x{100}..."                      # text (should not be used as bin)
    chr 255                           # binary (but can be used as text)
    chr 256                           # text (should not be used as bin)
    readline $fh                      # either, depends on io layers

Note that "use encoding" will turn almost everything into a text-only
creating thing, and makes using binary data very hard. My advise is to
avoid the module altogether.

> 4) What operations implicitly upgrade data from binary, assuming that 
> because of context it must be ISO-8859-1 encoded data?

In the core, only concatenating with an internally-UTF8 string. All
other operations that require UTF8 only upgrade temporarily.

There are modules that carelessly upgrade strings; this causes no
problem if your string is a text string and you decode/encode properly
and keep the string properly separated from byte strings. But it might
otherwise.

> My perception is _any operation_ that also includes another operand that 
> is UTF8 already.

When the "other operand" becomes part of the target string at some
point, yes.

> 5) It seems that there should be documented lists of operations and core 
> modules that a) never upgrade b) never downgrade c) always upgrade d) 
> always downgrade e) may upgrade f) may downgrade and the conditions 
> under which it may happen.

Instead of compiling lists, isn't adding this information to the
existing documentation a better idea?

Lists like this would suggest that you'd have to learn them by heart in
order to write safe code, while in fact it's only needed to keep text
from byte values and byte operations. The latter is a lot easier to
learn and universally applicable.

> Juerd's forthcoming perlunitut document seems to imply that the rules 
> are indeed common to all operations, but this discussion seems to 
> indicate that there might be a few exceptions to that... 

It dose seem to indicate so, but I've yet to see proof of these other
supposed implicit upgrades...

> A) pack -- it is not clear to me how this operation could produce 
> anything except bytes for the packed buffer parameter, regardless of 
> other parameters supplied.

pack "U" works like chr. I'd strongly advise against using U with other
letters, because U makes text strings, and the other letters are byte
operations (so using them together would break the text/byte rule).

pack "U*", LIST is useful because it is more convenient than writing
join "", map chr, LIST.

> B) unpack -- it is not clear to me how this operation could successfully 
> process a multi-bytes buffer parameter, except by first downgrading it, 
> if it contains no values > 255, since all the operations on it are 
> defined in terms of unpacking bytes.

Indeed. Even downgrading is questionable, because its operand should
never have been upgraded in the first place.

Binary strings don't have any encoding, as far as Perl is concerned.
When it gets a string of which it is certain that it does have an
encoding, it can't possibly be binary.

non-U unpacking might as well return undef or random values, when it
gets a string that has the UTF8 flag set :)

Again, here's the special case for U, that works like "ord", but again
supports lists in a nicer way. And with unpack too, I think it's wrong
to mix U with other letters.

> C) use bytes; -- clearly this impacts lots of other operations.

I advise against "use bytes", and "use encodings".

> D) Data::Dumper -- someone made the claim that Data::Dumper simply 
> ignores the UTF-8 flag, and functions properly.  Could someone elucidate 
> how that happens?  

It's because the text/byte distinction exists in a programmer's mind,
not in the string value. There is a latin1/utf8 distinction in the
string value, internally, but the representation of the codepoints
doesn't change the value of the codepoints, so effectively, even with a
different internal encoding, you maintain the same string value.

If you use $Data::Dumper::Useqq, Dumper uses \ notation for non-ASCII,
so dumped binary strings even survive encoding layers. (Okay, they
should be ASCII compatible, but most are).

Without Useqq, D::D works fine on unicode strings and binary strings,
but your binary strings might be re-encoded in an inconvenient way.

> G) regular expressions -- lots of reference is made to regular 
> expressions being broken, or at least different, for multi-byte stuff.  
> I fail to see why regular expressions are so hard to deal with. 

I guess that it is not particularly hard to deal with the bug, because
both kinds of semantics are already present. It's dealing with all the
code out in the wild that depends on the current buggy semantics that is
hard. To remain backwards compatible, new syntax has to be introduced
(but may be implied with "use 5.10", for example). See my thread "fixing
the regex engine wrt unicode".

> Firstly, regular expressions deal in "characters", not bytes, or 
> multi-byte sequences.  

Some people like to use regular expressions on their binary data, and I
think they should be able to keep doing it. Of course, things like /i or
the predefined character classes don't make sense there, and those are
the broken things.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic