[prev in list] [next in list] [prev in thread] [next in thread] 

List:       pgsql-bugs
Subject:    Re: [BUGS] BUG #5800: "corrupted" error messages (encoding problem
From:       Craig Ringer <ringerc () ringerc ! id ! au>
Date:       2011-09-29 7:44:37
Message-ID: 4E8421E5.7090407 () ringerc ! id ! au
[Download RAW message or body]

First, sorry for the slow reply.

Response inline.

On 09/17/2011 08:34 AM, Tom Lane wrote:
> Craig Ringer<ringerc@ringerc.id.au>  writes:
>> On 09/17/2011 05:10 AM, Carlo Curatolo wrote:
>>> Just tried with PG 9.1...same problem...
>
>> Yep. There appears to be no interest in fixing this bug. All the
>> alternatives I proposed were rejected, and there doesn't seem to be any
>> concern about the issue.
 >
> The problem is to find a cure that's not worse than the disease.
> I'm not exactly convinced that forcing all log messages into a common
> encoding is a better behavior than allowing backends to log in their
> native database encoding.
 >
> If you do want a common encoding, there's a very easy way to get it, ie,
> standardize on one encoding for all your databases.

The postmaster may still emit messages in a different encoding if the 
system encoding is not the same as the standard database encoding chosen.

> People who aren't
> doing that already probably have good reasons why they want to stay with
> the encoding choices they've made; forcing their logs into some other
> encoding isn't necessarily going to improve their lives.

I'm not convinced.

Mixing their logs with messages in other encodings makes it *impossible* 
for most people to read them at all. A file with (say) mixed UTF-8, 
latin-1 and Shift-JIS is effectively hopelessly corrupted as far as most 
people are concerned. If lines are differently encoded, the file is a 
totally mangled mess. Try it and see what I mean. As such, I disagree: 
forcing all their logs into one encoding WILL improve their lives over 
the current situation, and won't affect people whose databases are all 
already in the system encoding.

In any case, if the system uses a utf8 encoding and the databases are 
latin-1 (for example) the admin might actually prefer to have utf8 logs 
for easy reading and processing by system tools, no matter what encoding 
the databases are in.

The database encoding is an internal thing. The log encoding is an 
external thing. Writing messages to stdout/stderr in an encoding other 
than that specified by LC_CTYPE and LC_MESSAGES is wrong as it'll cause 
garbage to be shown on a terminal; so IMO is logging in a different 
encoding.

Because there's no standard way to flag a file as having a certain 
encoding, I contend that the correct default is to write files in the 
default encoding used by the system. That is what programs that consume 
the logs will expect. The only other correct alternative would be to 
write UTF-8 logs with a BOM that lets programs unamgiguously identify 
the encoding. That said, users probably should be able to override the 
log file location and encoding so a particular database's logs go to a 
separate file in a user-defined encoding and/or override the default 
encoding Pg writes.


>> ... The only valid fixes are to log them to different files (with some
>> way to identify which encoding is used)
>
> I don't recall having heard any serious discussion of such a design, but
> perhaps doing that would satisfy some use-cases.  One idea that comes to
> mind is to provide a %-escape for log_filename that expands to the name
> of the database encoding (or more likely, some suitable abbrevation).
> The logging collector protocol would have to be expanded to include that
> information, but that seems do-able.

That'd work, though it doesn't solve the problem for people logging to 
syslog or to a single file.

I think Pg should also be able to convert all messages into a common 
encoding for logging to a single file and should default to using the 
system encoding as that encoding.

The user could configure a different encoding - for example, they might 
want to force utf-8 logging because their databases may have all sorts 
of different encodings, but they're logging to syslog so they can't 
split logs out to different files.

A special log destination encoding name, say "log_encoding = database" 
could be used to bypass all encoding conversion, retaining the current 
behaviour of logging in whatever encoding the database happens to use.

I'm willing to implement this setup (or try, at least) if you think it's 
a reasonable thing to do. I don't know how I'll go with multi-file 
logging in log_filename, but I'm pretty sure I can handle the log 
message encoding conversion and associated configuration directives.

There's some overhead to encoding conversion, but it's pretty minimal. 
It can be avoided entirely by ensuring that your log destination 
encoding is the same as your Pg database encoding, which under this 
scheme you can do by setting "log_encoding = database" and sticking to 
one encoding or using multi-file logging.

Reasonable plan?

--
Craig Ringer

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic