[prev in list] [next in list] [prev in thread] [next in thread]
List: perl5-changes
Subject: [perl.git] branch smoke-me/davem/regex-trailing-null, created. v5.17.4-31-g8f6719e
From: "Dave Mitchell" <davem () iabyn ! com>
Date: 2012-09-21 23:56:23
Message-ID: E1TFD55-0005Q8-Ke () camel ! ams6 ! corp ! booking ! com
[Download RAW message or body]
In perl.git, the branch smoke-me/davem/regex-trailing-null has been created
<http://perl5.git.perl.org/perl.git/commitdiff/8f6719e2c3acfbc7536d81e44386a43bd0a24aab?hp=0000000000000000000000000000000000000000>
at 8f6719e2c3acfbc7536d81e44386a43bd0a24aab (commit)
- Log -----------------------------------------------------------------
commit 8f6719e2c3acfbc7536d81e44386a43bd0a24aab
Author: David Mitchell <davem@iabyn.com>
Date: Fri Sep 21 10:29:04 2012 +0100
stop regex engine reading beyond end of string
Historically the regex engine has assumed that any string passed to it
will have a trailing null char. This isn't normally an issue in perl code,
since perl strings *are* null terminated; but it could cause problems with
strings returned by XS code, or with someone calling the regex engine
directly from XS, with strend not pointing at a null char.
The engine currently relies on there being a null char in the following
ways.
First, when at the end of string, the main loop of regmatch() still reads
in the 'next' character (i.e. the character following the end of string)
even if it doesn't make any use of it. This precludes using memory mapped
files as strings for example, since the read off the end would SEGV.
Second, the matching algorithm often required the trailing character to be
\0 to work correctly: the test for 'EOF' was "if next char is null *and*
locinput >= PL_regeol, then stop". So a random non-null trailing char
could cause an overshoot.
Thirdly, some match ops require the trailing char to be null to operate
correctly; for example, \b applied at the end of the string only happens
to work because the trailing char (\0) happens to match \W.
Also, some utf8 ops will try to extract the code point at the end, which
can result in multiple bytes past the end of string being read, and
possible problems if they don't correspond to well-formed utf8.
The main fix is in S_regmatch, where the 'read next char' code has been
updated to set it to a special value, NEXTCHR_EOS instead, if we would be
reading past the end of the string.
Lots of other random bits in the regex engine needed to be fixed up too.
To track these down, I temporarily hacked regexec_flags() to make a copy
of the string but without trailing \0, then ran all the t/re/*.t tests
under valgrind to flush out all buffer overruns. So I think I've removed
most of the bad code, but by no means all of it. The code within the
various functions in regexec.c is far too complex to be able to visually
audit the code with any confidence.
M MANIFEST
M ext/XS-APItest/APItest.pm
M ext/XS-APItest/APItest.xs
A ext/XS-APItest/t/callregexec.t
M regexec.c
commit ffb83602ac7621e306ecae2bc5a8b0d224eb3d87
Author: David Mitchell <davem@iabyn.com>
Date: Sun Sep 16 17:39:06 2012 +0100
regmatch(): fix typo in TRIE commentary text
M regexec.c
commit 927ce50c99cfffa62ba5ada03562f9da75224a1c
Author: David Mitchell <davem@iabyn.com>
Date: Sun Sep 16 17:33:08 2012 +0100
regmatch() annotate ops and separate out branches
Annotate each 'case OP:' in the main switch in regmatch() to show
what regex pattern this implements. About half the ops had already been
done. Also add a blank line between each 'case' statement for readability.
(no code changes)
M regexec.c
commit b05efd3c9cc1583c4a8b1719b69077edd9c397df
Author: David Mitchell <davem@iabyn.com>
Date: Fri Sep 14 16:19:10 2012 +0100
regmatch(): do nextchr=*locinput at top of loop
Currently each branch in the main regmatch() loop is responsible
re-initialising nextchar to UCHARAT(locinput) if locinput is modified.
By adding
nextchr = UCHARAT(locinput);
to the head of the loop, we can remove most of the nextchar assignments
in the individual branches. We lose slightly for the zero-width assertions
like \b which will re-read the same nextchar, but this will make it
easier to handle non-null-terminated strings.
M regexec.c
commit 6855194d74be66127b6d32dd40a26ddcd0785867
Author: David Mitchell <davem@iabyn.com>
Date: Fri Sep 14 15:46:47 2012 +0100
regmatch(): nextchar should always be positive
Remove the one bit of code that tests for < 0, and put in a
general assert.
M regexec.c
commit 996dc38f68a45f2bd8cf33d4b2f24775fad675ff
Author: David Mitchell <davem@iabyn.com>
Date: Fri Sep 14 12:37:33 2012 +0100
regmatch(): consolidate locinput++
There are several places in the code that increment locinput by 1 char
(which may or may not be 1 byte) then update nextchr.
Consolidate these into a single code block with the others goto'ing it.
This actually reduces the code more than it appears, since the CCC_TRY*
macros expand into several branches, each of which repeatthe
increment code.
M regexec.c
commit 10cd6a101a65575e939faa0a2e805236aa2adf51
Author: David Mitchell <davem@iabyn.com>
Date: Fri Sep 14 11:28:08 2012 +0100
regmatch(): use nextchar where available
In a couple of places the code was using *locinput, where
nextchar already equalled *locinput
M regexec.c
-----------------------------------------------------------------------
--
Perl5 Master Repository
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic