[prev in list] [next in list] [prev in thread] [next in thread]
List: python-list
Subject: Re: String multi-replace
From: Dave Angel <davea () ieee ! org>
Date: 2010-11-18 14:19:22
Message-ID: 4CE535EA.7040309 () ieee ! org
[Download RAW message or body]
On 2:59 PM, Sorin Schwimmer wrote:
> Steven D'Aprano: the original file is 139MB (that's the typical size for it). \
> Eliminating diacritics is just a little toping on the cake; the processing is \
> something else.
> Thanks anyway for your suggestion,
> SxN
>
> PS Perhaps I should have mention that I'm on Python 2.7
>
>
In the message you were replying to, Steven had a much more important
suggestion to make than the size one, and you apparently didn't notice
it. Chris made a similar implication. I'll try a third time.
The file is obviously encoded, and you know the encoding. Judging from
the first entry in your table, it's in utf-8. If so, then your approach
is all wrong. Treating it as a pile of bytes, and replacing pairs is
likely to get you in trouble, since it's quite possible that you may
get a match with the last byte of one character and the first byte of
another one. If you substitute such a match, you'll make a hash of the
whole region, and quite likely end up with a byte stream that is no
longer even utf-8.
Fortunately, you can solve that problem, and simplify your code greatly
in the bargain, by doing something like what was suggested by Steven.
Change your map of encoded bytes into unicode_nodia, using
decode("utf-8") on the keys, and u"" on the values
Read in each line of the file, decode it to the unicode it represents,
and do a simple translate once it's valid unicode.
Assuming the line is in utf-8, use
uni = line.decode("utf-8")
newuni = uni.trans(unicode_nodia)
newutf8 = newuni.encode("utf-8")
incidentally, to see what a given byte pair in your table is, you can do
something like:
import unicodedata
a = chr(196)+chr(130)
unicodedata.name(a.decode("utf-8"))
'LATIN CAPITAL LETTER A WITH BREVE'
DaveA
--
http://mail.python.org/mailman/listinfo/python-list
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic