'Re: [Python-ideas] Bytestrings in Python 2'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       python-ideas
Subject:    Re: [Python-ideas] Bytestrings in Python 2
From:       Andrew Barnert <abarnert () yahoo ! com ! dmarc ! invalid>
Date:       2015-04-26 9:22:05
Message-ID: E9D01FCD-BB22-4230-A133-109EF6F01F4A () yahoo ! com
[Download RAW message or body]

On Apr 25, 2015, at 20:02, Nick Coghlan <ncoghlan@gmail.com> wrote:
> 
> On 26 April 2015 at 05:27, Markus Unterwaditzer
> <markus@unterwaditzer.net> wrote:
> > On Sat, Apr 25, 2015 at 07:14:57PM +0300, Serhiy Storchaka wrote:
> > > Here is an idea that perhaps will help to prepare Python 2 code for
> > > converting to Python 3.
> > > 
> > > Currently bytes is just an alias of str in Python 2, and the "b" prefix of
> > > string literals is ignored. There are no differences between natural strings
> > > and bytes. I propose to add special bit to str instances and set it for
> > > bytes literals and strings created from binary sources (read from binary
> > > files, received from sockets, the result of unicode.encode() and
> > > struct.pack(), etc). With -3 flag operations with binary strings that
> > > doesn't allowed for bytes in Python 3 (e.g. encoding or coercing to unicode)
> > > will emit a warning. Unfortunately we can't change the bytes constructor in
> > > minor version, it should left an alias to str in 2.7. So the result of
> > > bytes() will be not tagged as binary string.
> > > 
> > > May be it is too late for this.
> > 
> > You can get similar kinds of warnings with unicode-nazi
> > (https://github.com/mitsuhiko/unicode-nazi), so I'm not sure if this would be
> > that helpful.
> 
> Mentioning that utility in the porting guide could potentially be
> useful, but I don't think it's a substitute for Serhiy's suggestion
> here.
> 
> Serhiy's suggestion covers a slightly different situation, which is
> that we can't warn about the following code snippet in Python 2, even
> though we know bytes objects don't have an encode method in Python 3:
> 
> "str".encode(encoding)
> 
> The reason is that we can't easily tell the difference between
> something that is correct in both Python 2 & 3 like:
> 
> "text".encode("utf-8")
> 
> (str->str encoding in Python 2, str->bytes encoding in Python 3)
> 
> and something that will break in Python 3 like:
> 
> "data".encode("hex")

I don't think that's a problem. The former is legal in both 2.x and 3.x, but it has a \
different meaning--in 2.x it means "decode with the system encoding, then recode to \
UTF-8". Unless it's called on a pure-printable-ASCII literal, there's no reason to \
expect that code to work without changes in 3.x.

(And that's even ignoring the fact that the vast majority of calls to str.encode are \
bugs, either called on a variable you think is a unicode but is actually a str, or \
just introduced by a novice throwing in random calls to encode, decode, and str until \
some exception goes away.)

Whether the encoding is a literal for a unicode->bytes encoding, a literal for a \
non-3.x-compatible encoding, or a variable whose value can't be guessed statically \
doesn't really matter; in every case, it's something you need to look at for porting \
to 3.x. So a warning still seems both doable and worth doing, whether you're talking \
about a static linter tool or a -3 mode that tracks probably-bytes.

> The single source version of the latter is actually
> 'codecs.encode(b"data", "hex")', but it's quite hard for an analyser
> or test suite to pick that up and recommend the change, as it's hard
> to tell the difference between "str-as-text-object" and
> "str-as-binary-data-object" in Python 2.
> 
> Looking at the way string objects are stored in Python 2, it's
> possible that the ob_sstate field (which tracks the interning status
> of string instances) could potentially be co-opted to hold this
> additional flag. If the new flag was *only* set when "-3" was
> specified, then there'd only be a potential compatibility risk in that
> context (the PyString_CHECK_INTERNED macro currently assumes that a
> non-zero value in ob_sstate always indicates an interned string).
> 
> There'd be a more general performance risk however, as
> PyString_CHECK_INTERNED would also need to be updated to either mask
> out the new "this is probably binary data" state flag unconditionally,
> or else to check the Py3k warning flag and mask out the new flag
> conditionally. Either way, we'd be making "is this interned or not?"
> checks slightly more expensive and the interpreter does a *lot* of
> those.
> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia
> _______________________________________________
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

[prev in list] [next in list] [prev in thread] [next in thread]