[prev in list] [next in list] [prev in thread] [next in thread] 

List:       privoxy-developers
Subject:    [privoxy-devel] [ ijbswa-Bugs-972839 ] lookbehind works strange
From:       "SourceForge.net" <noreply () sourceforge ! net>
Date:       2004-06-15 12:01:50
Message-ID: E1BaCdS-00039Q-00 () sc8-sf-web1 ! sourceforge ! net
[Download RAW message or body]

Bugs item #972839, was opened at 2004-06-14 14:39
Message generated for change (Comment added) made by dessent
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=111118&aid=972839&group_id=11118

Category: funct: filtering
Group: version 3.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
Summary: lookbehind works strange

Initial Comment:
I want to replace minus between digits with "&mdash;". I 
don't want this happen in tags.

I wrote a filter:
s/(?<=>)([^<]*?\d)-(?=\d)/$1&mdash;<!-- -->/sig

It looks behind for a ">", then any symbols, which aren't 
"<", then a digit, then a minus, and a digit again. I put 
comment in the end to get a ">" for the next match. 

In fact that filter turns string ">2004-06-14" into 
"2004&mdash;<!-- -->06-14". I guess that it doesn't look 
behind to see ">" generated by first replacing. Option "g" 
works: the next instance of ">2004-06-14" is converted the 
same way.

I use win32 binary Privoxy v3.0.3 with Opera 7.5 (guess that 
browser doesn't matter here).

Log:
Jun 15 01:26:29 Privoxy(02336) Re-Filter: Adding re_filter 
job s/(?<=>)([^<]*?\d)-(?=\d)/$1&mdash;<!-- -->/sig to 
filter tire succeeded.
.........
Jun 15 01:26:31 Privoxy(03708) Re-Filter: re_filtering 
www.livejournal.com/users/dolboeb/440396.html?nc=1 
(size 21438) with filter tire...
Jun 15 01:26:31 Privoxy(03708) Re-Filter:  ...produced 6 
hits (new size 21522).

Sorry: i couldn't log in. SourceForge returns me neither my 
password nor hash. My mail is arttreg@mail.ru

----------------------------------------------------------------------

>Comment By: Brian (dessent)
Date: 2004-06-15 05:01

Message:
Logged In: YES 
user_id=585719

A useful trick for debugging these pcre substitutions is the 
following:

echo "string to match" | perl       -e 'use re "debugcolor";' -pe 's/foo/bar/sig'

or in your example:

echo ">2004-06-14" | perl -e 'use re "debugcolor";' \ 
  -pe 's/(?<=>)([^<]*?\d)-(?=\d)/$1&mdash;<!-- -->/sig';

This will show the details of the RE matching.  Use "debug" 
instead of "debugcolor" to get a plaintext version.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-06-15 03:26

Message:
Logged In: NO 

My fault. Today I tested my regex in perl and it looks like I have 
had some misunderstanding of lookbehind in pcre.

Thank you for advices : )

----------------------------------------------------------------------

Comment By: Brian (dessent)
Date: 2004-06-15 01:01

Message:
Logged In: YES 
user_id=585719

This is not a fault of Privoxy but rather just how pcre works.  
When using "/g" all the matching is done against the 
unmodified source string.  You can't base the next 
replacement off something that was changed in the previous 
match, unless you iterate through each replacement 
individually, i.e. without using "/g", which would not be 
possible with Privoxy.

In my opinion you're going about this wrong.  My first advice 
would be that if you have some specific forms of data that 
you're trying to match, then just code something to match 
them, such as dates.  You'll pull your hair out trying to do 
something that's completely and 100% generic and doesn't 
fail in some circumstances... This is precisely why future 
Privoxies will require a real parser, as working with tags with 
REs like this can be very hard to do right.

My second suggestion, if you can't make your filter specifc to 
some easily identifiable data would be to make two filters.  
The first changes all occurances of a '-' between digit groups 
to &mdash; and a distinctive html comment to flag the 
replacement.  Then a second filter looks for "&mdash<!-- foo -
->" inside tags and changes them back to regular '-' 
characters.  This is neither as pretty nor as efficient, but 
sometimes you have to resort to doing it in more than one 
step.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=111118&aid=972839&group_id=11118


-------------------------------------------------------
This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference
Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer
Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA
REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND
_______________________________________________
Ijbswa-developers mailing list
Ijbswa-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ijbswa-developers
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic