'[capi-sig]PyUnicode C API'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       python-capi-sig
Subject:    [capi-sig]PyUnicode C API
From:       Victor Stinner <vstinner () redhat ! com>
Date:       2018-09-07 9:57:16
Message-ID: CA+3bQGFiVhwgw_q-obvviDpQ2iHkRCNBDXdwyjZjqJaUuw7DfQ () mail ! gmail ! com
[Download RAW message or body]

On 07.09.2018 10:22, Victor Stinner wrote:
> > I'm in discussion with PyPy developers, and they reported different
> > APIs which cause them troubles:
> > (...)
> > * almost all PyUnicode API functions have to go according to them.
> > PyPy3 uses UTF-8 internally, CPython uses "compact string" (array of
> > Py_UCS1, Py_UCS2 or Py_UCS4 depending on the string content).
> >   https://pythoncapi.readthedocs.io/bad_api.html#pypy-requests

Le ven. 7 sept. 2018 Ă  10:33, M.-A. Lemburg <mal@egenix.com> a ĂŠcrit :
> I'm -1 on removing the PyUnicode APIs. We deliberately created a
> useful and very complete C API for Unicode.
>
> The fact that PyPy chose to use a different internal representation
> is not a good reason to remove APIs and have CPython extension take
> the hit as a result. It would be better for PyPy rethink the
> internal representation or create a shim API which translates
> between the two worlds.
>
> Note that UTF-8 is not a good internal representation for Unicode
> if you want fast indexing and slicing. This is why we are using
> fixed code units to represent the Unicode strings.

The PyUnicode C API is not only an issue for PyPy, it's also an issue
for CPython. When the PEP 393 has been implemented, suddly, most of
the PyUnicode API has been directly deprecated: all functions using
the now legacy Py_UNICODE* type...

Python 3.7 still has to support both the legacy Py_UNICODE* API and
the new "compact string" API. It makes the CPython code base way more
complex that it should be: any function accepting a string is supposed
to call PyUnicode_Ready() and handle error properly. I would prefer to
be able to remove the legacy PyUnicodeObject type, to only use compact
strings everywhere.

Let me elaborate what are good and bad functions for PyUnicode.

Example of bad APIs:

* PyUnicode_IS_COMPACT(): this API really rely on the *current* implementation
* PyUnicode_2BYTE_DATA(): should only be used internally, there is no
need to export it
* PyUnicode_READ()
* Py_UNICODE_strcmp(): use Py_UNICODE which is an implementation detail

Good API:

* PyUnicode_Concat(): C API for str + str
* PyUnicode_Split()
* PyUnicode_FindChar()

Border line:

* PyUnicode_IS_ASCII(op): it's a O(1) operation on CPython, but it can
O(n) on other implementations (like PyPy which uses UTF-8). But we
also added str.isascii() in Python 3.7....
* PyUnicode_READ_CHAR()
* PyUnicode_CompareWithASCIIString(): the function name announces
ASCII but decodes the byte string from Latin1 :-)

Victor
_______________________________________________
capi-sig mailing list -- capi-sig@python.org
To unsubscribe send an email to capi-sig-leave@python.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic