'Re: Regex one-liner to find several multi-line blocks of text in a single file'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl-beginners
Subject:    Re: Regex one-liner to find several multi-line blocks of text in a single file
From:       Jim Gibson <jimsgibson () gmail ! com>
Date:       2012-11-01 15:08:43
Message-ID: 5588D8B9-8E03-475D-AEC1-D9F790A2B872 () gmail ! com
[Download RAW message or body]


On Nov 1, 2012, at 12:44 AM, Thomas Smith wrote:

> Hi,
> 
> I'm trying to search a file for several matching blocks of text. A sample
> of what I'm searching through is below.
> 
> What I want to do is match "##### START block #####" through to the next
> "##### END block #####" and repeat that throughout the file without
> matching any of the text that falls between each matched block (that is,
> the "ok: some text" lines should not be matched). Here is the one-liner I'm
> using:
> 
> perl -p -e '/^##### START block #####.*##### END block #####$/s' file.txt
> 
> I've tried a few variations of this but with the same result--a match is
> being made from the first "##### START block #####" to the last "##### END
> block #####", and everything in between... I believe that the ".*",
> combined with the "s" modifier, in the regex is causing this match to be
> made.

The '*' is what's called a "greedy" quantifier. That means it will match as many \
characters in the string as possible. What the regular expression engine does when it \
encounters the pattern '.*' is to immediately match it with as many characters as \
possible. Since your regular expression includes the 's' modifier, this will include \
newlines as well. When the RE engine sees that there are characters in the pattern \
after the '.*', it will start removing characters from the end of the substring \
matched by the '.*' until the subsequent pattern characters are also matched. This \
will continue until there are no characters matched by the '.*'.

The result of all this is that for your pattern, the last '##### END block #####' \
substring is the one that will be matched, and the '.*' pattern will match everything \
between the first '##### START block #####' and the last '##### END block #####'.

The way to fix this is to make the '*' quantifier "non-greedy" by putting a '?' \
quantifier after it. With that pattern, the RE engine will match as few characters as \
possible, and the first START block will pair up with the first subsequent END block. \
A 'g' modifier will tell the RE engine to start looking after each match for the next \
match in the string.



--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic