'Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-devel
Subject:    Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?
From:       Ben Bolker <bbolker () gmail ! com>
Date:       2023-06-16 12:56:51
Message-ID: 7d766eba-0e40-24ed-7b8e-38acdc750ace () gmail ! com
[Download RAW message or body]

   Yes.
   FWIW I submitted a request for a documentation fix to TRE (to 
document that it actually uses Unicode order, not collation order, to 
define ranges, just like most (but not all) other regex engines ...)

https://github.com/laurikari/tre/issues/88

On 2023-06-16 5:16 a.m., peter dalgaard wrote:
> Just for amusement: Similar messups occur with Danish and its three extra letters:
> 
> > Sys.setlocale("LC_ALL", "da_DK")
> [1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"
> > sort(c(LETTERS,"Æ","Ø","Å"))
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
> [20] "T" "U" "V" "W" "X" "Y" "Z" "Æ" "Ø" "Å"
> 
> > grepl("[A-Å]", "Ø")
> [1] FALSE
> > grepl("[A-Å]", "Æ")
> [1] FALSE
> > grepl("[A-Æ]", "Å")
> [1] TRUE
> > grepl("[A-Æ]", "Ø")
> [1] FALSE
> > grepl("[A-Ø]", "Å")
> [1] TRUE
> > grepl("[A-Ø]", "Æ")
> [1] TRUE
> 
> So for character ranges, the order is Å,Æ,Ø (which is how they'd collate in \
> Swedish, except that Swedish uses diacriticals rather than Æ and Ø). 
> > Sys.setlocale("LC_ALL", "sv_SE")
> [1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"
> > sort(c(LETTERS,"Æ","Ø","Å"))
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
> [20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Æ" "Ø"
> > sort(c(LETTERS,"Ä","Ö","Å"))
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
> [20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Ä" "Ö"
> 
> 
> 
> > On 30 May 2023, at 17:45 , Ben Bolker <bbolker@gmail.com> wrote:
> > 
> > Inspired by this old Stack Overflow question
> > 
> > https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
> >  
> > I was wondering why this is TRUE:
> > 
> > Sys.setlocale("LC_ALL", "et_EE")
> > grepl("[A-Z]", "T")
> > 
> > TRE's documentation at <https://laurikari.net/tre/documentation/regex-syntax/> \
> > says that a range "is shorthand for the full range of characters between those \
> > two [endpoints] (inclusive) in the collating sequence". 
> > Yet, T is *not* between A and Z in the Estonian collating sequence:
> > 
> > sort(LETTERS)
> > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
> > [20] "Z" "T" "U" "V" "W" "X" "Y"
> > 
> > I realize that this may be a question about TRE rather than about R *per se* \
> > (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question also \
> > applies to PCRE), but I'm wondering if anyone has any insights ...  (and yes, I \
> > know that the correct answer is "use [:alpha:] and don't worry about it") 
> > (In contrast, the ICU engine underlying stringi/stringr says "[t]he characters to \
> > include are determined by Unicode code point ordering" - see 
> > https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
> >  
> > for links)
> > 
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
 > E-mail is sent at my convenience; I don't expect replies outside of 
working hours.

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[prev in list] [next in list] [prev in thread] [next in thread]