[prev in list] [next in list] [prev in thread] [next in thread]
List: python-list
Subject: Re: String multi-replace
From: Frederic Rentsch <anthra.norell () bluewin ! ch>
Date: 2010-11-18 21:38:09
Message-ID: 1290116289.2902.81.camel () hatchbox-one
[Download RAW message or body]
On Wed, 2010-11-17 at 21:12 -0800, Sorin Schwimmer wrote:
> Thanks for your answers.
>
> Benjamin Kaplan: of course dict is a type... silly me! I'll blame it on the time (it's midnight here).
>
> Chris Rebert: I'll have a look.
>
> Thank you both,
> SxN
>
>
Forgive me if this is off the track. I haven't followed the thread. I do
have a little module that I believe does what you attempted to do:
multiple substitutions using a regular expression that joins a bunch of
targets with '|' in between. Whether or not you risk unintended
translations as Dave Angel pointed out where the two characters or one
of your targets join coincidentally you will have to determine. If so
you can't use this approach. If, on the other hand, your format is safe
it'll work just fine. Use like this:
>>> import translator
>>> t = translator.Translator (nodia.items ())
>>> t (name) # Your example
'Rasca'
Frederic
["translator.py" (translator.py)]
class Translator:
"""
Will translate any number of targets, handling them correctly if some overlap.
Making Translator
T = Translator (definitions, [eat = 1])
'definitions' is a sequence of pairs: ((target, substitute),(t2, s2), ...)
'eat = True' will make an extraction filter that lets only the replaced targets \
pass. Definitions example: (('a','A'),('b','B'),('ab','ab'),('abc','xyz'),
('\x0c', 'page break'), ('\r\n','\n'), (' ','\t')) # ('ab','ab') see Tricks.
Order doesn't matter.
Testing
T.test (). Translates the definitions and prints the result. All targets
must look like the substitutes as defined. If a substitute differs, it has been
affected by the translation. (E.g. 'A'|'A' ... 'page break'|'pAge BreAk').
If this is not intended---the effect can be useful---protect the
affected substitute by translating it to itself. See Tricks.
Running
translation = T (source)
Tricks
Deletion: ('target', '')
Exception: (('\n',''), ('\n\n','\n\n')) # Eat LF except paragraph breaks.
Exception: (('\n', '\r\n'), ('\r\n',\r\n')) # Unix to DOS, would leave DOS \
unchanged Translation cascade:
# Unwrap paragraphs, Unix or DOS, restoring inter-word space if missing,
Mark_LF = Translator \
((('\n','+LF+'),('\r\n','+LF+'),('\n\n','\n\n'),('\r\n\r\n','\r\n\r\n'))) # Pick \
positively identifiable mark for end of lines in either Unix or MS-DOS. \
Single_Space_Mark = Translator (((' +LF+', ' '),('+LF+', ' '),('-+LF+', ''))) \
no_lf_text = Single_Space_Mark (Mark_LF (text)) Translation cascade:
reptiles = T_latin_english (T_german_latin (reptilien))
Limitations
1. The number of substitutions and the maximum size of input depends on the \
respective capabilities of the Python re module.
2. Regular expressions will not work as such but will be handled literally.
Author:
Frederic Rentsch (info@anthra-norell.ch).
"""
def __init__ (self, definitions, eat = 0):
'''
definitions: a sequence of pairs of strings. ((target, substitute), (t, s), ...)
eat: False (0) means translate: unaffected data passes unaltered.
True (1) means extract: unaffected data doesn't pass (gets eaten).
Extraction filters typically require substitutes to end with some separator,
else they fuse together. (E.g. ' ', '\t' or '\n')
'eat' is an attribute that can be switched anytime.
'''
self.eat = eat
self.compile_sequence_of_pairs (definitions)
def compile_sequence_of_pairs (self, definitions):
'''
Argument 'definitions' is a sequence of pairs:
(('target 1', 'substitute 1'), ('t2', 's2'), ...)
Order doesn't matter.
'''
import re
self.definitions = definitions
targets, substitutes = zip (*definitions)
re_targets = [re.escape (item) for item in targets]
re_targets.sort (reverse = True)
self.targets_set = set (targets)
self.table = dict (definitions)
regex_string = '|'.join (re_targets)
self.regex = re.compile (regex_string, re.DOTALL)
def __call__ (self, s):
hits = self.regex.findall (s)
nohits = self.regex.split (s)
valid_hits = set (hits) & self.targets_set # Ignore targets with illegal re \
modifiers. if valid_hits:
substitutes = [self.table [item] for item in hits if item in valid_hits] + [] # \
Make lengths equal for zip to work right if self.eat:
return ''.join (substitutes)
else:
zipped = zip (nohits, substitutes)
return ''.join (list (reduce (lambda a, b: a + b, [zipped][0]))) + nohits [-1]
else:
if self.eat:
return ''
else:
return s
def test (self):
'''
Translates the definitions and prints the result. All targets
must look like the substitutes as defined. If a substitute differs,
it has been affected by the translation, indicating a potential
problem, should the substitute occur in the source.
'''
targets_translated = [self (item [0]) for item in self.definitions]
substitutes = [self (item [1]) for item in self.definitions]
for item in [(repr (targets_translated [n]), repr (substitutes [n])) for n in range \
(len (substitutes))]: print '%s|%s' % (item)
--
http://mail.python.org/mailman/listinfo/python-list
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic