[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xml-cocoon-dev
Subject:    Re: [RT] About charsets (character encoding) and servlet API
From:       Bruno Dumon <bruno () outerthought ! org>
Date:       2004-05-30 17:09:17
Message-ID: 1085936956.8093.65.camel () yum
[Download RAW message or body]

On Sun, 2004-05-30 at 03:08, Pier Fumagalli wrote:
> On 29 May 2004, at 16:11, Antonio Gallardo wrote:
> >
> > I think most of us are using servlet containers with servlet specs 2.3 
> > or
> > superior. In that way, I think it is time to move to a higher servlet 
> > API
> > specs? I think just this little things are enough.
> 
> I've been doing i18n work on Servlets for a _very_ long time and, dude, 
> I've never seen a problem with the API ever...
> 
> Let's split the problem in three parts: headers and body and URLs:

<snip/>

> 
> -----
> BODY:
> -----
> 
> RFC-2616 is _very_ clear at this point, if you don't specify the 
> charset token in the "Content-Type" header, and you specify (or imply) 
> that the body is "text/something" you SHOULD assume that you're 
> receiving / sending text encoded in ISO-8859-1...
> 
> Again, I seriously don't think that servlet containers check for the 
> encoding of the request body when the content type is 
> "application/x-www-form-urlencoded", because I _suppose_ that given 
> that it doesn't start with "text/..." they ignore the whole shabang...
> 
> So, I believe that in some cases, the encoding of parameters returned 
> by servlet containers MIGHT be wrong (but I ain't sure, haven't checked 
> that lately).
> 
> When you send, on the other hand, the servlet API doesn't have much 
> functionalities until 2.4 to set the charset encoding of the response, 
> but that _really_ affected only stupid JSPs which were never though 
> right anyway...
> 
> In Cocoon (I hope) we should never rely on the "getWriter()" returned 
> by the servlet container but ALWAYS use a "getOutputStream()" and set 
> ALWAYS the content type with the proper "charset" token...
> 
> If we don't we're kinda violating 3.4.1 of RFC-2616 as it says that one 
> SHOULD always put the charset in there (if relevant, of course).

AFAIK we currently don't set it, and that's causing some problems with
certain Tomcat versions who by default set it to ISO-8859-1.

> 
> So, the problem is only in reading parameters, and that should be fixed 
> at the servlet container level.

Servlet containers can't do much about it, since browsers don't tell in
which encoding they send their data (I know they should, but seems they
don't do it). All browsers seem to keep to the convention to send them
in the same encoding as the page containing the form.

As Antonio mentioned, with servlet 2.3 it's possible to do something
like

req.setCharacterEncoding("UTF-8");

to make them decode it in the encoding you want. Only thing with this
is, that this should be called before any request parameter is read.
Thus that means somewhere in the beginning of the cocoon servlet.

This also means that, if there are different parts of your app that use
different encodings (because they target applications/devices that don't
understand eg UTF-8), there's no easy way to change it, except by using
the decode/encode trick. So we'd need to keep that system in place
anyhow.

So basically it comes down to:

* what we have in place today works just fine

* by using req.setCharacterEncoding, we would gain some cpu cycles and
some memory by avoiding the need for the recode trick (in most cases).

* but this requires us to set the servlet 2.3 spec as minimum
requirement

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic