[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gentoo-dev
Subject:    [gentoo-dev] GLEP ??: Character Sets for the Portage Tree Items
From:       Ciaran McCreesh <ciaranm () gentoo ! org>
Date:       2004-10-27 23:08:16
Message-ID: 20041028000816.62ada3f8 () snowdrop ! home
[Download RAW message or body]

[Attachment #2 (multipart/mixed)]


This is a reaaaalllly quick one... I don't have an official GLEP number
yet, since Grant wasn't around when I threw this together, but from the
looks of things it'll probably be 31. It's all kloeri and langthang's
fault, they got me started on this...

Only reason that this is a GLEP is because there's no way anyone's going
to get around to committing themselves to anything without one... The
issue's been discussed several times before, always with the same
conclusion, but it seems that certain developers don't like hearing
"yeah, we agreed on $foo a while back" without something to back it
up... If anyone has any comments, please feel free to discuss. Hopefully
this one's not gonna be a problem though...

[ See attached thingie for details on what this whole thing's about, or
if you'd prefer a one word executive summary: "UTF-8". Grant, please do
your magic. ]

-- 
Ciaran McCreesh : Gentoo Developer (Vim, Fluxbox, Sparc, Mips)
Mail            : ciaranm at gentoo.org
Web             : http://dev.gentoo.org/~ciaranm


["utf8-glep.txt" (text/plain)]

GLEP: XX
Title: Character Sets for Portage Tree Items
Version: $Revision: $
Author: Ciaran McCreesh <ciaranm@gentoo.org>
Last-Modified: $Date: $
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 27-October-2004
Post-Date: 28-October-2004

Abstract
========

A set of rules regarding what characters are permissible in the portage
tree and how they should be encoded is required.

Motivation
==========

At present we have several developers and many more users whose names
require characters (for example, accents) which are not part of the
standard 'safe' 0..127 ASCII range. There is no current standard on how
these should be represented, leading to inconsistency across the tree.

Although the issues involved have been discussed many times informally, no
official decision has been made.

Specification
=============

ChangeLog and Metadata Character Sets
-------------------------------------

It is proposed that UTF-8 ([1]_) is used for encoding ChangeLog and
metadata.xml files inside the portage tree.

UTF-8 allows the full range of Unicode ([2]_) characters to be expressed,
which is necessary given the diversity of the Gentoo developer- and
user-base.  It is character-compatible with ASCII for the 0..127
characters and does not significantly increase the storage requirements
for files which consist mainly of American English characters. It is
widely supported, widely used and an official standard.

The ISO-8859-* character sets ([3]_) would *not* be appropriate since they
cannot express the full range of required characters.

Ebuild and Eclass Character Sets
--------------------------------

For the same reasons as previously, it is proposed that UTF-8 is used as
the official encoding for ebuild and eclass files.

However, developers should be warned that any output which is parsed by
bash (in other words, non-comments), and any output which is echoed to the
screen (for example, einfo messages) must not use anything outside the
regular ASCII 0..127 range for compatibility purposes.

files/ Entries Character Sets
-----------------------------

Patches must clearly be in the same character set as the file they are
patching. For other files/ entries (for example, GNOME desktop files),
consistency with the upstream-recommended character set is most sensible.

Suitable Characters for File and Directory Names
------------------------------------------------

Characters outside the ASCII 0..127 range cannot safely be used for file
or directory names. (Of course, not all characters inside the ASCII 0..127
range can be used safely either.)

Backwards Compatibility
=======================

The existing tree uses a mixture of encodings. It would be straightforward
to fix existing ChangeLogs and metadata files to use UTF-8.

The ``echangelog`` tool is character-set agnostic. In order to properly
enter UTF-8, developers would have to switch to a UTF-8 shell session.
This only applies if the developer is entering new text which uses 'fancy'
characters -- existing characters are not mangled.

Certain text editors are incapable of handling UTF-8 cleanly. However,
since the ``echangelog`` tool is generally the correct way to generate
ChangeLog entries, this should not be a major problem. Generating
metadata.xml files correctly in these editors could become problematic.
(The ``vim`` and ``emacs`` editors, which appear to be most widely used,
are both capable of handling UTF-8 cleanly.)

References
==========

.. [1] RFC 3629: UTF-8, a transformation format of ISO 10646
       http://www.ietf.org/rfc/rfc3629.txt
.. [2] ISO/IEC 10646 (Universal Multiple-Octet Coded Character Set)
.. [3] ISO/IEC 8859 (8-bit single-byte coded graphic character sets)

Copyright
=========

This document has been placed in the public domain.

 vim: set tw=74 fileencoding=utf-8 :


[Attachment #6 (application/pgp-signature)]

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic