[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ruby-core
Subject:    [ruby-core:104435] [Ruby master Bug#18012] Case-insensitive character classes can only match multipl
From:       jiri.marsik () oracle ! com
Date:       2021-06-29 8:35:07
Message-ID: redmine.issue-18012.20210629083505.46524 () ruby-lang ! org
[Download RAW message or body]

Issue #18012 has been reported by jirkamarsik (Jirka Marsik).

----------------------------------------
Bug #18012: Case-insensitive character classes can only match multiple code points \
when top-level character class is not negated https://bugs.ruby-lang.org/issues/18012

* Author: jirkamarsik (Jirka Marsik)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
Some Unicode characters case-fold to strings of multiple code points, e.g. the \
ligature `\ufb00` can match the string `ff`.

```
irb(main):001:0> /\A[\ufb00]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):002:0> /\A[\ufb00]\z/i.match("ff")
=> #<MatchData "ff">
```

As expected, when we negate this character class, we can no longer match neither the \
ligature character `\ufb00` nor the string `ff`.

```
irb(main):003:0> /\A[^\ufb00]\z/i.match("\ufb00")
=> nil
irb(main):004:0> /\A[^\ufb00]\z/i.match("ff")
=> nil
```

Then, when we add a second negation, the `\ufb00` ligature reappears in the character \
set but the string `ff` is no longer accepted.

```
irb(main):005:0> /\A[^[^\ufb00]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):006:0> /\A[^[^\ufb00]]\z/i.match("ff")
=> nil
```

This reveals that the multi-code-point matches in character classes are blocked by \
negation. However, this is implemented only by checking whether the topmost character \
class is negated. If we wrap the character class in another set of brackets, the \
semantics change.

```
irb(main):007:0> /\A[[^[^\ufb00]]]\z/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):008:0> /\A[[^[^\ufb00]]]\z/i.match("ff")
=> #<MatchData "ff">
```

The cause behind this discrepancy (the fact that `[^[^\ufb00]]` and `[[^[^\ufb00]]]` \
match different strings) is the extra `IS_NCCLASS_NOT` check in `i_apply_case_fold` \
(https://github.com/ruby/ruby/blob/9eae8cdefba61e9e51feb30a4b98525593169666/regparse.c#L5568).





-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic