[prev in list] [next in list] [prev in thread] [next in thread] 

List:       jakarta-oro-dev
Subject:    Re: bug with CP1252 characters and Perl5Matcher
From:       "Takashi Okamoto" <tokamoto () rd ! nttdata ! co ! jp>
Date:       2000-12-01 1:49:38
[Download RAW message or body]

>Special characters in the CP1252 character set (but not in ISO Latin-1)
>cause a ArrayIndexOutOfBoundsException deep within Perl5Matcher.  The
>problem occurs if I use "[^.]*\." as the pattern but not if I use
>"Test" as the pattern.  The characters are fancy forms of apostrophe
>and double-quotes (decimal 146, 147, and 148).  Use IE 5 to view the
>test files to see what they look like.  I am running Jakarta ORO 2.0,
>JDK 1.2.2, WinNT 4.0sp5.

Could you read Jakarta ORO 2.0.1 TODO file?

It says,

o Make Perl5 character classes (e.g., [abcde...]) fully support Unicode
  input.  Currently character classes only match 8-bit characters.

I posted a patch for this problem.
You can use this patch for temporaly.
May be this patch consumes much  memory (about 8k byte).
Read attached file.
--------------
Takashi Okamoto


["[PATCH] for unicode problem over 0xff characters.eml" (message/rfc822)]

Received: from geb.rd.nttdata.co.jp (geb.rd.nttdata.co.jp [10.8.156.16])
	by osiris.rd.nttdata.co.jp (8.10.1/3.7W/R8V8) with ESMTP id eAC91ht00094
	for <tokamoto@unix.rd.nttdata.co.jp>; Sun, 12 Nov 2000 18:01:43 +0900 (JST)
Received: from mail1.nttdata.co.jp ([163.135.10.21])
	by geb.rd.nttdata.co.jp (8.9.3/3.7W/R8V8) with ESMTP id SAA19249
	for <tokamoto@rd.nttdata.co.jp>; Sun, 12 Nov 2000 18:01:47 +0900 (JST)
Received: from ms.nttdata.co.jp (localhost [127.0.0.1])
	by mail1.nttdata.co.jp (8.9.1/3.7W-NTTDATA-TOP-09/22/00) with ESMTP id SAA20002
	for <tokamoto@rd.nttdata.co.jp>; Sun, 12 Nov 2000 18:02:17 +0900 (JST)
Received: from locus.apache.org (locus.apache.org [63.211.145.10])
	by ms.nttdata.co.jp (8.9.3/3.7W-NTTDATA-TOP-08/31/00) with SMTP id SAA14529
	for <tokamoto@rd.nttdata.co.jp>; Sun, 12 Nov 2000 18:01:42 +0900 (JST)
Received: (qmail 59819 invoked by uid 500); 12 Nov 2000 09:01:39 -0000
Mailing-List: contact oro-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oro-dev@jakarta.apache.org
list-help: <mailto:oro-dev-help@jakarta.apache.org>
list-unsubscribe: <mailto:oro-dev-unsubscribe@jakarta.apache.org>
list-post: <mailto:oro-dev@jakarta.apache.org>
Delivered-To: mailing list oro-dev@jakarta.apache.org
Received: (qmail 59809 invoked from network); 12 Nov 2000 09:01:38 -0000
Message-ID: <002c01c04b4c$8648f5a0$5919fea9@rd.nttdata.co.jp>
From: "Takashi Okamoto" <tokamoto@rd.nttdata.co.jp>
To: <oro-dev@jakarta.apache.org>
Subject: [PATCH] for unicode problem over 0xff characters
Date: Sat, 11 Nov 2000 04:26:54 +0900
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.00.2919.6600
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6600
X-Spam-Rating: locus.apache.org 1.6.2 0/1000/N
Content-Type: text/plain;
	charset="iso-2022-jp"
X-UIDL: #En"!-4P!!3'0!!0U:!!
Status: RO

Hello ,jakarta-oro developers.

I made a patch for unicode problem at Perl5Compiler.java and
Perl5Matcher.java.
Now jakarta-oro has following problems,

[Problem 1]

 Perl5Util perl = new Perl5Util();
 boolean result = perl.match("m![a-z]+!", "abcdX");


 'X' is unicode character over 0xff.

 This matching throws fatal Exception!!

[Problem 2]

 Perl5Util perl = new Perl5Util();
 boolean result = perl.match("m![X-Y]+!", "ABCDEF");


 'X' and 'Y' are unicode characters over 0xff.
 'ABCDEF' is also unicode characters between 'X' and 'Y'.
 This matching result is false!! (true is right)


But these problems will not occur after attach my patch to
Perl5Compiler.java and Perl5Matcher.java.

Maybe this patch isn't so good idea ,because I don't know jakarta-oro
code so much.

But I hope next jakarta-oro release resolve these unicode problems.


Regards.


PS.
This patch  is for CVS 2000/11/10 version's source.

install memo

[1] download jakarta-oro from CVS
[2] cd jakarta-oro/src/java/org/apache/oro/text/regexp
[3] patch -p1 < [patch tail of this mail]
[4] build jakarta-oro
------------------------
Takashi Okamoto



------- patch for Perl5Compiler.java and Perl5Matcher.java -------

*** Perl5Compiler.java.org Fri Nov 10 09:55:15 2000
--- Perl5Compiler.java Fri Nov 10 09:09:34 2000
***************
*** 925,934 ****
    private void __setCharacterClassBits(char[] bits, int offset, char
deflt,
             char ch)
    {
!     if(__program== null || ch >= 256)
        return;
-     ch &= 0xffff;

      if(deflt == 0) {
        bits[offset + (ch >> 4)] |= (1 << (ch & 0xf));
      } else {
--- 925,935 ----
    private void __setCharacterClassBits(char[] bits, int offset, char
deflt,
             char ch)
    {
!     if(__program == null)
        return;

+     __extendProgramSize( offset + (ch >> 4) );
+     ch &= 0xffff;
      if(deflt == 0) {
        bits[offset + (ch >> 4)] |= (1 << (ch & 0xf));
      } else {
***************
*** 936,942 ****
      }
    }

!
    private int __parseCharacterClass() throws MalformedPatternException {
      boolean range = false, skipTest;
      char clss, deflt, lastclss = Character.MAX_VALUE;
--- 937,949 ----
      }
    }

!   private void __extendProgramSize (int  max)
!   {
!       if( max > __programSize ) {
!    __programSize = max + 1;
!       }
!   }
!
    private int __parseCharacterClass() throws MalformedPatternException {
      boolean range = false, skipTest;
      char clss, deflt, lastclss = Character.MAX_VALUE;
***************
*** 1468,1475 ****
      if(__programSize >= Character.MAX_VALUE - 1)
        throw new MalformedPatternException("Expression is too large.");


-     __program= new char[__programSize];
      regexp = new Perl5Pattern();

      regexp._program    = __program;
--- 1475,1486 ----
      if(__programSize >= Character.MAX_VALUE - 1)
        throw new MalformedPatternException("Expression is too large.");

+     __program = new char[Character.MAX_VALUE >> 4];
+
+     for(int i = 0 ;i < Character.MAX_VALUE >> 4 ;i++){
+        __program[i] = Character.MAX_VALUE;
+     }

      regexp = new Perl5Pattern();

      regexp._program    = __program;
*** Perl5Matcher.java.org Fri Nov 10 09:55:37 2000
--- Perl5Matcher.java Fri Nov 10 09:29:51 2000
***************
*** 412,418 ****
     while(__currentOffset < endOffset) {
       ch = __input[__currentOffset];

!      if(ch < 256 &&
          (__program[offset + (ch >> 4)] & (1 << (ch & 0xf))) == 0) {
         if(tmp && __tryExpression(expression, __currentOffset)) {
    success = true;
--- 412,418 ----
     while(__currentOffset < endOffset) {
       ch = __input[__currentOffset];

!      if(
          (__program[offset + (ch >> 4)] & (1 << (ch & 0xf))) == 0) {
         if(tmp && __tryExpression(expression, __currentOffset)) {
    success = true;
***************
*** 655,661 ****
        break;

      case OpCode._ANYOF:
!       if(scan < eol && (ch = __input[scan]) < 256) {
   while((__program[operand + (ch >> 4)] & (1 << (ch & 0xf))) == 0) {
     if(++scan < eol)
       ch = __input[scan];
--- 655,662 ----
        break;

      case OpCode._ANYOF:
!       if(scan < eol ) {
!       ch = __input[scan];
   while((__program[operand + (ch >> 4)] & (1 << (ch & 0xf))) == 0) {
     if(++scan < eol)
       ch = __input[scan];
***************
*** 805,811 ****
   if(nextChar == __EOS && inputRemains)
     nextChar = __input[input];

!  if(nextChar >= 256 || (__program[current + (nextChar >> 4)] &
       (1 << (nextChar & 0xf))) != 0)
     return false;

--- 806,812 ----
   if(nextChar == __EOS && inputRemains)
     nextChar = __input[input];

!  if((__program[current + (nextChar >> 4)] &
       (1 << (nextChar & 0xf))) != 0)
     return false;




["Re_ [PATCH] for unicode problem over 0xff characters .eml" (message/rfc822)]

Received: from geb.rd.nttdata.co.jp (geb.rd.nttdata.co.jp [10.8.156.16])
	by osiris.rd.nttdata.co.jp (8.10.1/3.7W/R8V8) with ESMTP id eACI53t08823
	for <tokamoto@unix.rd.nttdata.co.jp>; Mon, 13 Nov 2000 03:05:03 +0900 (JST)
Received: from mail1.nttdata.co.jp ([163.135.10.20])
	by geb.rd.nttdata.co.jp (8.9.3/3.7W/R8V8) with ESMTP id DAA23249
	for <tokamoto@rd.nttdata.co.jp>; Mon, 13 Nov 2000 03:05:08 +0900 (JST)
Received: from ms.nttdata.co.jp (localhost [127.0.0.1])
	by mail1.nttdata.co.jp (8.9.1/3.7W-NTTDmx/00092220) with ESMTP id DAA15395
	for <tokamoto@rd.nttdata.co.jp>; Mon, 13 Nov 2000 03:04:59 +0900 (JST)
Received: from locus.apache.org (locus.apache.org [63.211.145.10])
	by ms.nttdata.co.jp (8.9.3/3.7W-NTTDATA-TOP-08/31/00) with SMTP id DAA05104
	for <tokamoto@rd.nttdata.co.jp>; Mon, 13 Nov 2000 03:05:03 +0900 (JST)
Received: (qmail 62311 invoked by uid 500); 12 Nov 2000 18:05:01 -0000
Mailing-List: contact oro-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oro-dev@jakarta.apache.org
list-help: <mailto:oro-dev-help@jakarta.apache.org>
list-unsubscribe: <mailto:oro-dev-unsubscribe@jakarta.apache.org>
list-post: <mailto:oro-dev@jakarta.apache.org>
Delivered-To: mailing list oro-dev@jakarta.apache.org
Received: (qmail 62292 invoked from network); 12 Nov 2000 18:05:01 -0000
Message-Id: <200011121804.NAA05031@chewie.savarese.org>
X-Mailer: exmh version 2.0.3
To: oro-dev@jakarta.apache.org
Subject: Re: [PATCH] for unicode problem over 0xff characters 
In-reply-to: Your message of "Sat, 11 Nov 2000 04:26:54 +0900."
             <002c01c04b4c$8648f5a0$5919fea9@rd.nttdata.co.jp> 
Mime-Version: 1.0
Date: Sun, 12 Nov 2000 13:04:27 -0500
From: "Daniel F. Savarese" <dfs@savarese.org>
X-Spam-Rating: locus.apache.org 1.6.2 0/1000/N
Content-Type: text/plain; charset=us-ascii
X-UIDL: \`6"!p~]!!~DT"!"d:"!
Status: RO


>I made a patch for unicode problem at Perl5Compiler.java and
>Perl5Matcher.java.

Thanks very much for your efforts and providing a patch.  Unfortunately,
the fundamental approach is not the ultimate one that should be taken,
so I don't think we should apply the patch or some variation thereof.  The
problem is that it can cause excessive memory use (e.g., up to 8K per
character class) since it follows the same bitfield approach used for
the 8-bit ASCII character classes.  Handling 16-bit characters in a
character class requires a different approach, which is a little less
efficient in matching time, but much more efficient in the use of memory.
Implementing the "proper" solution, however, will require a good investment
of time and some significant code changes.

However, as a stopgap measure, we could implement the bitfield approach,
making it clear in comments and in the CHANGES file that it is temporary.
We could take an informal vote to that effect.  The problem with the patch
you posted is that it allows for incorrect matches or
ArrayOutOfBoundsExceptions to be thrown if the input contains characters
outside of the upper limit of the range of the character class range:

--- 806,812 ----
   if(nextChar == __EOS && inputRemains)
     nextChar = __input[input];

!  if((__program[current + (nextChar >> 4)] &
       (1 << (nextChar & 0xf))) != 0)
     return false;

It also doesn't make the necessary changes to Perl5Debug, which would
break after this patch was applied.  To generalize the current bitfield
implementation, you need to store the ultimate size of the bitfield and
make a comparision to ensure the input character can be used to index
into the bitfield.  At any rate, making these changes is rather
straightforward; I never did because of the desire to conserve memory.
So the question is, do people feel we should implement a temporary
stopgap measure, or just wait to "do it right"?

daniel




[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic