[prev in list] [next in list] [prev in thread] [next in thread] 

List:       squeak-dev
Subject:    Re: [squeak-dev] The Trunk: Collections-topa.807.mcz
From:       Tobias Pape <Das.Linux () gmx ! de>
Date:       2018-09-14 8:15:58
Message-ID: B35D32AA-BB07-486E-BFE8-AC19F8B2696E () gmx ! de
[Download RAW message or body]

Hi,

I reverted my change.
I understand Leventes point and as long as we don't consider Unicode's separator \
categories proper  (https://www.fileformat.info/info/unicode/category/Zs/list.htm, \
and maybe   https://www.fileformat.info/info/unicode/category/Zl/list.htm
	https://www.fileformat.info/info/unicode/category/Zp/list.htm)
it is preposterous to make an exception for NBSP.
Ron raised a good point, and I though the fix was swift; I was wrong tho.


(the following does NOT apply to the 5.2 release)

To what others have written, eg, regarding utf-8 and such, here my reasoning.

1. Encoding conversion should not be done form string to string, but rather only
	Encoding: String => ByteArray
	Decoding: ByteArray => String
   (In theory, we could make a class, eg UTF8, that inherits from ByteArray to make \
some things clear) 2. UTF8 ist a very good idea, the site http://utf8everywhere.org/ \
raises very good points.  It is not important for Squeak to internally encode Strings \
as UTF8, I think, tho it wouldn't hurt.  The current Byte/Wide distinction with the \
nice property that all values in a string correspond to Unicode  code points is nice \
and even clever. However, sometimes that bites, eg, when you write things on a \
Stream[1]. 3. Regarding the often mentioned importance of constant time access to \
characters and easy computation   of string length:
     This depends heavily on the notion of what a Characters is.
   This is an easy thing for ascii chars, so there's that.
   Also, one could say that "a character is any instance of Character" which is \
technically correct,  however, the questions you can ask with that, namely
     - Where is the instance qurxs of Character in this string and
     - How many instances of Character are in this string
   _are_ easy to answer with a 'direct' encoding (eg, ByteString for ASCII or latin, \
UTF32/WideString for Unicode etc)  but actually less meaningful than one might think.
   The UTF-8 everywhere page hints to that direction: 
	'A programmer might count characters as code units, code points, or grapheme \
clusters, according to the level of the programmer's Unicode expertise.'  A more \
in-depth discussion can be found at: \
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
  (please read it, and if you have time, the follow up \
https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/)

   I think we need a distinction in String between 
	(a) size aka number of storage entities (Eg, number of bytes/words) and
	(b) displaySize/length aka number of Extended Grapheme Clusters (EGC) [2], what \
users will see when they print the string.  Also, we need to have a distinction \
between  (a) what value is at memory position x in the string and
	(b) what is the x-th grapheme cluster.
   Only (a) can be answered in constant time, anyhow.
   So embracing this, we could also go UTF-8 internally.


4. Yes, we need better font support.


But, yes, not for 5.2

Best regards
	-Tobias

PS: how many characters is this?:  ï·½
	(fun fact: one. one code point, one grapheme cluster…)
PPS: Boy this got long. Sorry.


[1]: I was bitten here: I wrote the same string on a Socket stream and on a File \
stream, the former retained the   internal encoding, which happens for byte strings \
to be Latin-1, as subset of Unicode; the latter encoded   to UTF-8, and I wondered \
why the network endpoint rejected my string as not-utf-8. [2]:    Swift and Perl 6 \
apparently use EGCs


> On 14.09.2018, at 09:38, commits@source.squeak.org wrote:
> 
> Tobias Pape uploaded a new version of Collections to project The Trunk:
> http://source.squeak.org/trunk/Collections-topa.807.mcz
> 
> ==================== Summary ====================
> 
> Name: Collections-topa.807
> Author: topa
> Time: 14 September 2018, 9:37:43.484317 am
> UUID: fae1c8b3-8396-4790-a491-4e51b047bc49
> Ancestors: Collections-topa.806
> 
> Revert for consistency and, subsequently, speed.
> 
> The correct fix is not as trivial and not fit in the beta phase.
> 
> Sorry, Ron.
> 
> =============== Diff against Collections-topa.806 ===============
> 
> Item was changed:
> ----- Method: Character class>>separators (in category 'instance creation') -----
> separators
> + 	"Answer a collection of the standard ASCII separator characters."
> - 	"Answer a collection of space-like separator characters.
> - 	Note that we do not consider spaces in >8bit code points yet.
> - 	"
> 
> + 	^ #(32 "space"
> - 	^ #(9 "tab"
> - 		10 "line feed"
> - 		12 "form feed"
> 		13 "cr"
> + 		9 "tab"
> + 		10 "line feed"
> + 		12 "form feed")
> + 		collect: [:v | Character value: v] as: String!
> - 		32 "space"
> - 		160 "non-breaking space, see Unicode Z general category")
> - 		collect: [:v | Character value: v] as: String
> - " To be considered:
> - 16r1680 OGHAM SPACE MARK
> - 16r2000 EN QUAD
> - 16r2001 EM QUAD
> - 16r2002 EN SPACE
> - 16r2003 EM SPACE
> - 16r2004 THREE-PER-EM SPACE
> - 16r2005 FOUR-PER-EM SPACE
> - 16r2006 SIX-PER-EM SPACE
> - 16r2007 FIGURE SPACE
> - 16r2008 PUNCTUATION SPACE
> - 16r2009 THIN SPACE
> - 16r200A HAIR SPACE
> - 16r2028 LINE SEPARATOR
> - 16r2029 PARAGRAPH SEPARATOR
> - 16r202F NARROW NO-BREAK SPACE
> - 16r205F MEDIUM MATHEMATICAL SPACE
> - 16r3000 IDEOGRAPHIC SPACE
> - "!
> 
> 


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic