[prev in list] [next in list] [prev in thread] [next in thread] 

List:       php-doc-bugs
Subject:    [DOC-BUGS] Bug->Doc #53187 [Opn]: urlencode and ~ encodes differently in the various versions of the
From:       cataphract () php ! net
Date:       2010-10-28 15:16:38
Message-ID: 20101028151638.B7A361093 () ez1 ! php ! net
[Download RAW message or body]

Edit report at http://bugs.php.net/bug.php?id=53187&edit=1

 ID:                 53187
 Updated by:         cataphract@php.net
 Reported by:        asphp at dsgml dot com
 Summary:            urlencode and ~ encodes differently in the various
                     versions of the functions
 Status:             Open
-Type:               Bug
+Type:               Documentation Problem
 Package:            URL related
 PHP Version:        trunk-SVN-2010-10-27 (SVN)
 Block user comment: N

 New Comment:

RFC 3986, which apparently rawurlencode is to follow (though the
documentation mentions the obsolete 1738 -- that should be changed):

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

For consistency, percent-encoded octets in the ranges of ALPHA
   (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
   underscore (%5F), or tilde (%7E) should not be created by URI
   producers and, when found in a URI, should be decoded to their
   corresponding unreserved characters by URI normalizers.

So the behavior of rawurlencode is correct in not escaping ~.

urlencode, on the other hand, mentions the encoding mechanism
application/x-www-form-urlencoded. This is defined in the HTML 4.01 spec
and Xforms 1.1 spec:

* http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
* http://www.w3.org/TR/2009/REC-xforms-20091020/#serialize-urlencode

The first points to RFC 1738, under which ~ should be encoded, as it is.
The second mentions reserved characters "as defined by [RFC 2396] as
amended by subsequent documents in the IETF track", so it would refer to
the RFC 3986, which obsoletes 2396. I'd say the current behavior is
correct because it "plays safe" by following the HTML 4/RFC 1738.

As for EBCDIC, well, the support is broken anyway.
urlencode/rawurlencode should receive the string in UTF-8 encoding,
since that's the only way it can give meaningful results. The current
implementation assumes that, in an EBCDIC system, it receives the string
in EBCDIC. But yes, there's an inconsistency between the ASCII and
EBCDIC version.


Previous Comments:
------------------------------------------------------------------------
[2010-10-28 01:20:12] asphp at dsgml dot com

rawurlencode was changed in revision 260750:
http://svn.php.net/viewvc?view=revision&revision=260750

It mentions RFC3986, but after reading it seems to me that ~ should also
be un-encoded in urlencode not just rawurlencode

------------------------------------------------------------------------
[2010-10-28 01:10:43] asphp at dsgml dot com

Description:
------------
In urlencode a ~ (tilde) is escaped.

In rawurlencode, when using ASCII a ~ is NOT escaped, but when using
EBCDIC it IS escaped.

There is no mention of ~ being special in the docs for rawurlencode or
the docs for urlencode.

And if it is special for rawurlencode, it should act the same in ASCII
and EBCDIC.



------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=53187&edit=1

-- 
PHP Documentation Bugs Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic