[prev in list] [next in list] [prev in thread] [next in thread] 

List:       apache-modperl-cvs
Subject:    Re: cvs commit: modperl/t/net/perl util.pl
From:       Eric Cholet <cholet () logilune ! com>
Date:       2002-03-25 18:22:39
[Download RAW message or body]

--On Sunday, March 24, 2002 21:57:54 +0000 dougm@apache.org wrote:

> dougm       02/03/24 13:57:53
>
>   Modified:    .        Changes STATUS
>                src/modules/perl Util.xs
>                t/net/perl util.pl
>   Log:
>   Submitted by:   Geoff Young <geoff@modperlcookbook.org>
>   Reviewed by:	dougm
>   properly escape highbit chars in Apache::Utils::escape_html

This is uncool for those of us using a non-ASCII encoding and sending
out lots of characters with the 8th bit set, e.g. in a French page
many accented characters will be replaced by 6-byte sequences.
If I'm sending out "Content-type: text/html; charset=ISO-8859-1",
and calling escape_html to escape '<', '>' and the like, I'm going
to be serving quite a lot more bytes than before this patch.

However escape_html () has no clue as to what the character set is,
and whether it has been correctly specified in the Content-Type.
It has also be mentionned here that escape_html is only valid for
single-byte encodings.

So this patch does the right thing to escape the odd 8 bit char in
a mostly ASCII output, but users of other charsets should be warned
not to use it. I use HTML::Entities::encode($_[0], '<>&"') myself.

Therefore I propose a doc patch to clear this up:

Index: Util.pm
===================================================================
RCS file: /home/cvs/modperl/Util/Util.pm,v
retrieving revision 1.8
diff -u -r1.8 Util.pm
--- Util.pm	4 Mar 2000 20:55:47 -0000	1.8
+++ Util.pm	25 Mar 2002 18:19:37 -0000
@@ -68,6 +68,13 @@

  my $esc = Apache::Util::escape_html($html);

+This function is unaware of its argument's character set and encoding.
+It assumes a single-byte encoding and escapes all characters with the
+8th bit set. Do not use it with multi-byte encodings such as utf8.
+When using a single byte non-ASCII encoding such as ISO-8859-1,
+consider specifying the character set in the Content-Type header,
+and using HTML::Entities to avoid unnecessary escaping.
+
 =item escape_uri

 This function replaces all unsafe characters in the $string with their


--
Eric Cholet

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic