'Re: String multi-replace'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       python-list
Subject:    Re: String multi-replace
From:       Frederic Rentsch <anthra.norell () bluewin ! ch>
Date:       2010-11-18 21:38:09
Message-ID: 1290116289.2902.81.camel () hatchbox-one
[Download RAW message or body]

On Wed, 2010-11-17 at 21:12 -0800, Sorin Schwimmer wrote:
> Thanks for your answers.
> 
> Benjamin Kaplan: of course dict is a type... silly me! I'll blame it on the time (it's midnight here).
> 
> Chris Rebert: I'll have a look.
> 
> Thank you both,
> SxN
> 
> 

Forgive me if this is off the track. I haven't followed the thread. I do
have a little module that I believe does what you attempted to do:
multiple substitutions using a regular expression that joins a bunch of
targets with '|' in between. Whether or not you risk unintended
translations as Dave Angel pointed out where the two characters or one
of your targets join coincidentally you will have to determine. If so
you can't use this approach. If, on the other hand, your format is safe
it'll work just fine. Use like this:

>>> import translator
>>> t = translator.Translator (nodia.items ())
>>> t (name)  # Your example
'Rasca'

Frederic

["translator.py" (translator.py)]

class Translator:                                    

	"""
		Will translate any number of targets, handling them correctly if some overlap.

		Making Translator
			T = Translator (definitions, [eat = 1])
			'definitions' is a sequence of pairs: ((target, substitute),(t2, s2), ...)
			'eat = True' will make an extraction filter that lets only the replaced targets \
pass.  Definitions example: (('a','A'),('b','B'),('ab','ab'),('abc','xyz'),
				('\x0c', 'page break'), ('\r\n','\n'), ('   ','\t'))   # ('ab','ab') see Tricks.
			Order doesn't matter.          

		Testing
			T.test (). Translates the definitions and prints the result. All targets 
			must look like the substitutes as defined. If a substitute differs, it has been
			affected by the translation. (E.g. 'A'|'A' ... 'page break'|'pAge BreAk').
			If this is not intended---the effect can be useful---protect the 
			affected substitute by translating it to itself. See Tricks. 

		Running
			translation = T (source)

		Tricks 
			Deletion:  ('target', '')
			Exception: (('\n',''), ('\n\n','\n\n'))     # Eat LF except paragraph breaks.
			Exception: (('\n', '\r\n'), ('\r\n',\r\n')) # Unix to DOS, would leave DOS \
unchanged  Translation cascade: 
				# Unwrap paragraphs, Unix or DOS, restoring inter-word space if missing,
				Mark_LF = Translator \
((('\n','+LF+'),('\r\n','+LF+'),('\n\n','\n\n'),('\r\n\r\n','\r\n\r\n')))  # Pick \
positively identifiable mark for end of lines in either Unix or MS-DOS.         \
Single_Space_Mark = Translator (((' +LF+', ' '),('+LF+', ' '),('-+LF+', '')))  \
no_lf_text = Single_Space_Mark (Mark_LF (text))  Translation cascade: 
				reptiles = T_latin_english (T_german_latin (reptilien))

		Limitations
			1. The number of substitutions and the maximum size of input depends on the \
respective   capabilities of the Python re module.
			2. Regular expressions will not work as such but will be handled literally.

		Author:
			Frederic Rentsch (info@anthra-norell.ch).

	"""

	def __init__ (self, definitions, eat = 0):

		'''
			definitions: a sequence of pairs of strings. ((target, substitute), (t, s), ...)
			eat: False (0) means translate: unaffected data passes unaltered.
			     True  (1) means extract:   unaffected data doesn't pass (gets eaten).
			     Extraction filters typically require substitutes to end with some separator, 
			     else they fuse together. (E.g. ' ', '\t' or '\n') 
			'eat' is an attribute that can be switched anytime.

		'''			
		self.eat = eat
		self.compile_sequence_of_pairs (definitions)

	def compile_sequence_of_pairs (self, definitions):

		'''
			Argument 'definitions' is a sequence of pairs:
			(('target 1', 'substitute 1'), ('t2', 's2'), ...)
			Order doesn't matter.         

		'''

		import re
		self.definitions = definitions
		targets, substitutes = zip (*definitions)
		re_targets = [re.escape (item) for item in targets]
		re_targets.sort (reverse = True)
		self.targets_set = set (targets)                           
		self.table = dict (definitions)
		regex_string = '|'.join (re_targets)
		self.regex = re.compile (regex_string, re.DOTALL)

	def __call__ (self, s):
		hits = self.regex.findall (s)
		nohits = self.regex.split (s)
		valid_hits = set (hits) & self.targets_set  # Ignore targets with illegal re \
modifiers.  if valid_hits:
			substitutes = [self.table [item] for item in hits if item in valid_hits] + []  # \
Make lengths equal for zip to work right  if self.eat:
				return ''.join (substitutes)
			else:            
				zipped = zip (nohits, substitutes)
				return ''.join (list (reduce (lambda a, b: a + b, [zipped][0]))) + nohits [-1]
		else:
			if self.eat:
				return ''
			else:
				return s

	def test (self):

		'''
			Translates the definitions and prints the result. All targets 
			must look like the substitutes as defined. If a substitute differs,
			it has been affected by the translation, indicating a potential 
			problem, should the substitute occur in the source.

		'''

		targets_translated = [self (item [0]) for item in self.definitions]
		substitutes = [self (item [1]) for item in self.definitions]
		for item in [(repr (targets_translated [n]), repr (substitutes [n])) for n in range \
(len (substitutes))]:  print '%s|%s' % (item)

-- 
http://mail.python.org/mailman/listinfo/python-list

[prev in list] [next in list] [prev in thread] [next in thread]