'Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       python-dev
Subject:    Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
From:       "M.-A. Lemburg" <mal () egenix ! com>
Date:       2015-11-15 20:38:45
Message-ID: 5648ED55.4020906 () egenix ! com
[Download RAW message or body]

On 14.11.2015 23:56, Victor Stinner wrote:
> These encodings are rarely used. I don't think that any text editor use
> them. Editors use ascii, latin1, utf8 and... all locale encoding. But I
> don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk
> space.

UTF-16 is used a lot for Windows text files, e.g. Unicode
CSV files (the save as "Unicode text file" option writes
UTF-16).

However, nowadays, all text editors also support UTF-8 and
many of these recognize the UTF-8 BOM as identifier to detect
Unicode text files.

> Ok, even if it exists, Python already accepts a very wide range of
> encoding. It is not worth to make the parser much more complex just to
> support encodings which are also never used (for .py files).

Agreed. In Python 2 we decided to only allow ASCII super-sets
for Python source files, which out ruled multi-byte encodings
such as UTF-16 and -32. I don't think we need to make the parser
more complex just to support them. UTF-8 works fine as Python
source code encoding.

> Victor
> Le 14 nov. 2015 20:20, "Serhiy Storchaka" <storchaka@gmail.com> a =E9crit=
 :
> =

>> For now UTF-16 and UTF-32 source encodings are not supported. There is a
>> comment in Parser/tokenizer.c:
>>
>>     /* Disable support for UTF-16 BOMs until a decision
>>        is made whether this needs to be supported.  */
>>
>> Can we make a decision whether this support will be added in foreseeable
>> future (say in near 10 years), or no?
>>
>> Removing commented out and related code will help to refactor the
>> tokenizer, and that can help to fix some existing bugs (e.g. issue14811,
>> issue18961, issue20115 and may be others). Current tokenizing code is too
>> tangled.
>>
>> If the support of UTF-16 and UTF-32 is planned, I'll take this to
>> attention during refactoring. But in many places besides the tokenizer t=
he
>> ASCII compatible encoding of source files is expected.
>>
>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev@python.org
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmai=
l.com
>>
> =

> =

> =

> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/mal%40ege=
nix.com
> =

-- =

Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 15 2015)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________
2015-10-23: Released mxODBC Connect 2.1.5 ...     http://egenix.com/go85

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/python-dev-=
marcsub-zyf4%40marc.info
[prev in list] [next in list] [prev in thread] [next in thread]