[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postfix-users
Subject:    Re: Message encoding by guessing
From:       Viktor Dukhovni <postfix-users () dukhovni ! org>
Date:       2020-02-09 11:19:22
Message-ID: 20200209111922.GL49778 () straasha ! imrryr ! org
[Download RAW message or body]

On Sun, Feb 09, 2020 at 01:45:21PM +0300, wesley@199903.xyz wrote:

> How to guess the message body's language encoding if message didn't
> have MIME charset set?   The message may be encoded with utf8, gb2312,
> gbk or something others, but it didn't have an charset header.

You could run the text through "iconv -f <take-a-guess>", and
see what comes out.

For valid (correctly minimally encoded) utf-8:

    https://en.wikipedia.org/wiki/UTF-8#Description

every non-ascii character sequence starts with an initial byte that is
in the range: 

        0b11000010  - 0xc2 hex or 194 decimal, through:
        0b11110100 -- 0xf4 hex or 244 decimal

and continues with more bytes that are all in the range

    0x10xxxxxx  - 0x80--0xbf hex or 128--191 decimal

the number of such bytes in each group (including the initial byte) is
equal to the number of consecutive non-zero bits starting with the MSB
in the first byte.

For some random other code point, good luck!  But Windows-1232 is pretty
common for things mostly in the Latin alphabet.

-- 
    Viktor.
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic