[prev in list] [next in list] [prev in thread] [next in thread] 

List:       intlwiki-l
Subject:    Re: [Intlwiki-l] Stats for 2 Apr 2002
From:       Tomasz Wegrzanowski <taw () users ! sourceforge ! net>
Date:       2002-04-02 19:12:06
[Download RAW message or body]

On Tue, Apr 02, 2002 at 09:02:34PM +0200, Lars Aronsson wrote:
> Hi Tomasz,
> 
> > Could you think of some better measure of Wikipedia size than comma
> > count ? I think that compressing tarballs with some one-big-block BWT
> > or high-order PPM might be good measure (as it will find all
> > templates, duplicated data, differences between languages etc.).
> > For no mere bzip2 would be enough.
> 
> Compressed tarball sizes sounds like a good measure to me.  Would that
> include previous versions of each page too?
> Can you get a tarball from the Sevilla wiki?

Certainly you can (http://www.forpas.us.es/enciclopedia/tar/wiki.tar.gz)

Compressed tarball is pretty good measure but:
* it does contain old pages
* size of Polish tarball is twice too big due to "technical issues"
* current compression techniques are far worse than true entropy
  and with some languages it might be significantly bigger difference
  than with others. But I'm not sure how important is this issue.

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic