[prev in list] [next in list] [prev in thread] [next in thread]
List: python-list
Subject: Re: String multi-replace
From: "Steven D'Aprano" <steve-REMOVE-THIS () cybersource ! com ! au>
Date: 2010-11-18 5:10:03
Message-ID: 4ce4b52b$0$29981$c3e8da3$5496439d () news ! astraweb ! com
[Download RAW message or body]
On Wed, 17 Nov 2010 20:21:06 -0800, Sorin Schwimmer wrote:
> Hi All,
>
> I have to eliminate diacritics in a fairly large file.
What's "fairly large"? Large to you is probably not large to your
computer. Anything less than a few dozen megabytes is small enough to be
read entirely into memory.
> Inspired by http://code.activestate.com/recipes/81330/, I came up with
> the following code:
If all you are doing is replacing single characters, then there's no need
for the 80lb sledgehammer of regular expressions when all you need is a
delicate tack hammer. Instead of this:
* read the file as bytes
* search for pairs of bytes like chr(195)+chr(130) using a regex
* replace them with single bytes like 'A'
do this:
* read the file as a Unicode
* search for characters like Â
* replace them with single characters like A using unicode.translate()
(or str.translate() in Python 3.x)
The only gotcha is that you need to know (or guess) the encoding to read
the file correctly.
--
Steven
--
http://mail.python.org/mailman/listinfo/python-list
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic