[prev in list] [next in list] [prev in thread] [next in thread]
List: squeak-dev
Subject: Re: [squeak-dev] The Trunk: Collections-topa.807.mcz
From: Tobias Pape <Das.Linux () gmx ! de>
Date: 2018-09-14 8:15:58
Message-ID: B35D32AA-BB07-486E-BFE8-AC19F8B2696E () gmx ! de
[Download RAW message or body]
Hi,
I reverted my change.
I understand Leventes point and as long as we don't consider Unicode's separator \
categories proper (https://www.fileformat.info/info/unicode/category/Zs/list.htm, \
and maybe https://www.fileformat.info/info/unicode/category/Zl/list.htm
https://www.fileformat.info/info/unicode/category/Zp/list.htm)
it is preposterous to make an exception for NBSP.
Ron raised a good point, and I though the fix was swift; I was wrong tho.
(the following does NOT apply to the 5.2 release)
To what others have written, eg, regarding utf-8 and such, here my reasoning.
1. Encoding conversion should not be done form string to string, but rather only
Encoding: String => ByteArray
Decoding: ByteArray => String
(In theory, we could make a class, eg UTF8, that inherits from ByteArray to make \
some things clear) 2. UTF8 ist a very good idea, the site http://utf8everywhere.org/ \
raises very good points. It is not important for Squeak to internally encode Strings \
as UTF8, I think, tho it wouldn't hurt. The current Byte/Wide distinction with the \
nice property that all values in a string correspond to Unicode code points is nice \
and even clever. However, sometimes that bites, eg, when you write things on a \
Stream[1]. 3. Regarding the often mentioned importance of constant time access to \
characters and easy computation of string length:
This depends heavily on the notion of what a Characters is.
This is an easy thing for ascii chars, so there's that.
Also, one could say that "a character is any instance of Character" which is \
technically correct, however, the questions you can ask with that, namely
- Where is the instance qurxs of Character in this string and
- How many instances of Character are in this string
_are_ easy to answer with a 'direct' encoding (eg, ByteString for ASCII or latin, \
UTF32/WideString for Unicode etc) but actually less meaningful than one might think.
The UTF-8 everywhere page hints to that direction:
'A programmer might count characters as code units, code points, or grapheme \
clusters, according to the level of the programmer's Unicode expertise.' A more \
in-depth discussion can be found at: \
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
(please read it, and if you have time, the follow up \
https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/)
I think we need a distinction in String between
(a) size aka number of storage entities (Eg, number of bytes/words) and
(b) displaySize/length aka number of Extended Grapheme Clusters (EGC) [2], what \
users will see when they print the string. Also, we need to have a distinction \
between (a) what value is at memory position x in the string and
(b) what is the x-th grapheme cluster.
Only (a) can be answered in constant time, anyhow.
So embracing this, we could also go UTF-8 internally.
4. Yes, we need better font support.
But, yes, not for 5.2
Best regards
-Tobias
PS: how many characters is this?: ï·½
(fun fact: one. one code point, one grapheme cluster…)
PPS: Boy this got long. Sorry.
[1]: I was bitten here: I wrote the same string on a Socket stream and on a File \
stream, the former retained the internal encoding, which happens for byte strings \
to be Latin-1, as subset of Unicode; the latter encoded to UTF-8, and I wondered \
why the network endpoint rejected my string as not-utf-8. [2]: Swift and Perl 6 \
apparently use EGCs
> On 14.09.2018, at 09:38, commits@source.squeak.org wrote:
>
> Tobias Pape uploaded a new version of Collections to project The Trunk:
> http://source.squeak.org/trunk/Collections-topa.807.mcz
>
> ==================== Summary ====================
>
> Name: Collections-topa.807
> Author: topa
> Time: 14 September 2018, 9:37:43.484317 am
> UUID: fae1c8b3-8396-4790-a491-4e51b047bc49
> Ancestors: Collections-topa.806
>
> Revert for consistency and, subsequently, speed.
>
> The correct fix is not as trivial and not fit in the beta phase.
>
> Sorry, Ron.
>
> =============== Diff against Collections-topa.806 ===============
>
> Item was changed:
> ----- Method: Character class>>separators (in category 'instance creation') -----
> separators
> + "Answer a collection of the standard ASCII separator characters."
> - "Answer a collection of space-like separator characters.
> - Note that we do not consider spaces in >8bit code points yet.
> - "
>
> + ^ #(32 "space"
> - ^ #(9 "tab"
> - 10 "line feed"
> - 12 "form feed"
> 13 "cr"
> + 9 "tab"
> + 10 "line feed"
> + 12 "form feed")
> + collect: [:v | Character value: v] as: String!
> - 32 "space"
> - 160 "non-breaking space, see Unicode Z general category")
> - collect: [:v | Character value: v] as: String
> - " To be considered:
> - 16r1680 OGHAM SPACE MARK
> - 16r2000 EN QUAD
> - 16r2001 EM QUAD
> - 16r2002 EN SPACE
> - 16r2003 EM SPACE
> - 16r2004 THREE-PER-EM SPACE
> - 16r2005 FOUR-PER-EM SPACE
> - 16r2006 SIX-PER-EM SPACE
> - 16r2007 FIGURE SPACE
> - 16r2008 PUNCTUATION SPACE
> - 16r2009 THIN SPACE
> - 16r200A HAIR SPACE
> - 16r2028 LINE SEPARATOR
> - 16r2029 PARAGRAPH SEPARATOR
> - 16r202F NARROW NO-BREAK SPACE
> - 16r205F MEDIUM MATHEMATICAL SPACE
> - 16r3000 IDEOGRAPHIC SPACE
> - "!
>
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic