[prev in list] [next in list] [prev in thread] [next in thread]
List: postfix-users
Subject: Re: Message encoding by guessing
From: Viktor Dukhovni <postfix-users () dukhovni ! org>
Date: 2020-02-09 11:19:22
Message-ID: 20200209111922.GL49778 () straasha ! imrryr ! org
[Download RAW message or body]
On Sun, Feb 09, 2020 at 01:45:21PM +0300, wesley@199903.xyz wrote:
> How to guess the message body's language encoding if message didn't
> have MIME charset set? The message may be encoded with utf8, gb2312,
> gbk or something others, but it didn't have an charset header.
You could run the text through "iconv -f <take-a-guess>", and
see what comes out.
For valid (correctly minimally encoded) utf-8:
https://en.wikipedia.org/wiki/UTF-8#Description
every non-ascii character sequence starts with an initial byte that is
in the range:
0b11000010 - 0xc2 hex or 194 decimal, through:
0b11110100 -- 0xf4 hex or 244 decimal
and continues with more bytes that are all in the range
0x10xxxxxx - 0x80--0xbf hex or 128--191 decimal
the number of such bytes in each group (including the initial byte) is
equal to the number of consecutive non-zero bits starting with the MSB
in the first byte.
For some random other code point, good luck! But Windows-1232 is pretty
common for things mostly in the Latin alphabet.
--
Viktor.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic