[prev in list] [next in list] [prev in thread] [next in thread] 

List:       racket-users
Subject:    [racket] uri-decode and non-UTF-8 percent-encoded links
From:       jay.mccarthy () gmail ! com (Jay McCarthy)
Date:       2011-09-26 16:57:11
Message-ID: CAJYbDakB1SGTBamWq8jiPswY56-+p7TKot=Omuj_BswxQQvetQ () mail ! gmail ! com
[Download RAW message or body]

There is not, and I think this is a major flaw throughout a lot of the
Racket net libraries. I have gone through many efforts to use bytes
throughout the Web server to avoid issues like this, but the URL
module is one place where it hurts.

I think it should be written to use bytes internally and provide the
UTF-8 string versions for compatibility. Unfortunately, I don't have
the time to fix it now.

Jay

On Sat, Sep 24, 2011 at 3:56 PM, Rodolfo Carvalho <rhcarvalho at gmail.com> wrote:
> Hello,
> I'm running a (simple) web scrapper in a page written in?iso-8859-1
> (declared in source using a meta tag).
> The page contains links like this:
> "http://www.ufrj.br/editais.php?tp=Acad%EAmicos&no=Cursos&idtp=4"
> In one point the code calls:
> (combine-url/relative current-url resource)
> Where current-url is a Racket URL and resource is the aforementioned string.
> I then get the error:
> bytes->string/utf-8: string is not a well-formed UTF-8 encoding:
> #"Acad\352micos"
>
> This seems to be a problem with?uri-decode.
> (uri-decode resource)
> bytes->string/utf-8: string is not a well-formed UTF-8 encoding:
> #"http://www.ufrj.br/editais.php?tp=Acad\352micos&no=Cursos&idtp=4"
>
> I looked at the source code of uri-decode to see that after decoding the
> percent encoded string, a call to?bytes->string/utf-8 expects the string to
> be UTF-8 encoded... but there's no way to tell uri-decode to use a different
> encoding.
> I copied the relevant portion of code from uri-codec-unit.rkt from the
> collects/net, and verified that I can change?bytes->string/utf-8
> =>?bytes->string/latin-1 and get it to work... but that's like cheating :)
> AFAICT Chrome and Firefox handles the
> URL?"http://www.ufrj.br/editais.php?tp=Acad%EAmicos&no=Cursos&idtp=4" as
> well as it's UTF-8 %-encoded
> equivalent?"http://www.ufrj.br/editais.php?tp=Acad%C3%AAmicos&no=Cursos&idtp=4",
> with the difference that the second appears as
> "http://www.ufrj.br/editais.php?tp=Acad?micos&no=Cursos&idtp=4" (but when
> copied->pasted is still?%C3%AA instead of ?).
>
> How could I make?uri-decode understand an encoding other than UTF-8?
>
> Thanks,
> Rodolfo Carvalho
>
> _________________________________________________
> ?For list-related administrative tasks:
> ?http://lists.racket-lang.org/listinfo/users
>



-- 
Jay McCarthy <jay at cs.byu.edu>
Assistant Professor / Brigham Young University
http://faculty.cs.byu.edu/~jay

"The glory of God is Intelligence" - D&C 93



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic