[prev in list] [next in list] [prev in thread] [next in thread]
List: intlwiki-l
Subject: Re: [Intlwiki-l] Per-page statistics
From: Ray Saintonge <saintonge () telus ! net>
Date: 2003-12-27 0:02:01
[Download RAW message or body]
Ramanan Selvaratnam wrote:
>>We can, however, measure the relative sizes of two text files.
>>
>If I understood this correctly UTF-8 would proove to be a challenge and
>would have to be accounted for.
>
>http://www.unicode.org/faq/unicode_web.html#14
>
>I understand that some have arrived at a figure of 2.5:1 for UTF-8
>encoded file size as opposed to the 8-bit Latin-1 like encodings Tamil
>has. More tests on Wikipedia content and arriving at a more appropriate
>ratio should not be too much work to help this route in the case of
>Tamil.
>
That was an excellent reference. Though neither English nor Tamil
appeared in the list the principles remain the same. I further support
the idea that all the Wikis should be UTF-8 encoded including the
English one, but I recognize that there may be some difficulty in
accomplishing that.
Using the data from your example, with French at 10506 and simplified
Chinese at 8882 gives a ratio of 8882/10506 = 0.8454. Thus a French
text of 5000 bytes should be statistically expected to have an
equivalent simplified Chinese text of 5000 * .8454 = 4227 bytes. We
would naturally make allowances for measures of statistical significance.
Ec
_______________________________________________
Intlwiki-l mailing list
Intlwiki-l@wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/intlwiki-l
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic