[prev in list] [next in list] [prev in thread] [next in thread] 

List:       intlwiki-l
Subject:    Re: [Intlwiki-l] Per-page statistics
From:       Ray Saintonge <saintonge () telus ! net>
Date:       2003-12-27 0:02:01
[Download RAW message or body]

Ramanan Selvaratnam wrote:

>>We can, however, measure the relative sizes of two text files. 
>>
>If I understood this correctly UTF-8 would proove to be a challenge and
>would have to be accounted for.
>
>http://www.unicode.org/faq/unicode_web.html#14
>
>I understand that some have arrived at a figure of 2.5:1 for UTF-8
>encoded file size as opposed to the 8-bit Latin-1 like encodings Tamil
>has. More tests on Wikipedia content and arriving at a  more appropriate
>ratio  should not be too much work to help this route in the case of
>Tamil.
>
That was an excellent reference. Though neither English nor Tamil 
appeared in the list the principles remain the same.  I further support 
the idea that all the Wikis should be UTF-8 encoded including the 
English one, but I recognize that there may be some difficulty in 
accomplishing that.

Using the data from your example, with French at 10506 and simplified 
Chinese at 8882 gives a ratio of 8882/10506 = 0.8454.  Thus a French 
text of 5000 bytes should be statistically expected to have an 
equivalent simplified Chinese text of 5000 * .8454 =  4227 bytes.  We 
would naturally make allowances for measures of statistical significance.

Ec




_______________________________________________
Intlwiki-l mailing list
Intlwiki-l@wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/intlwiki-l
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic