'Re: [Q]:MG usable for HUGE search engine?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       managing-gigabytes
Subject:    Re: [Q]:MG usable for HUGE search engine?
From:       William Webber <wew () triton ! kbs ! citri ! edu ! au>
Date:       1997-12-28 0:13:06
[Download RAW message or body]


Although I do programming for it, I've never had to use mg to
actually manage a textbase of any size, so these are more my
opinions than hard facts...

   How well suited is the MG library for such an engine, excluding hardware and
   traffic issues?

A relevant issue is that mg really isn't a library, but a suite
of programs.  It does not provide a clean or easy to use or
modify API.  You either have to do a substantial amount of
program rewriting to fit mg into your particular application, or
else accept what the mg programs provide (basically, flat text
output) and rework it.  The latter could have performance
implications. 

   Can it do a timely search on millions of possible documents?

Pretty much.  Mg routinely handles collections of a few gigabytes of
(uncompressed text) size, and provides responses to queries 'in
real time' (depending somewhat on the complexity of the query).
(What does 'in real time' mean?  I don't know, I just heard
someone use the expression once and it seemed felicitous.  Here I
mean a few seconds or less.)

   I know MG is being used to search lots of data but I am curious
   as to the physical limits as to database size

Well, I guess the only hard limit is that mg uses 32 bits for
measurements, counters, file offsets and such-like.  I think
you'll first hit this limit in building the inverted file, in
which bit (not byte) offsets into the file need to be kept.  This
means that the inverted file can't be longer than 512MB or so,
and since the inverted file takes up something like 5% of the
uncompressed text size, this places approximately a 10GB limit on
the uncompressed size of the text to be indexed (very
approximately).  If I understand my own code correctly
(doubtful), then I believe this won't be the case with mg-1.2.1
on 64-bit machines (mg-1.2 doesn't work at all on 64-bit machines).

   Any insight that can be provided would be wonderful...
   I am interested in hardware configurations and such as well...

To do the indexing, a machine with lots of RAM.  For fast
querying, fast disks.  

   Thanks for any input you can provide!

Another issue is that mg is really designed to be used with
static, rather than dynamic, text collections, i.e. collections
that you index once then don't have to add new documents to.  The
mgmerge program provides some of this functionality, but
fundamentally the limitation lies in the compression techniques
used on the indexes (we can achieve better compression by
parameterizing our codes on the collection as a whole, but once
we add to an existing collection these parameters change and we
have to recode the indexes).  For research purposes (what mg is
used for at MDS), this isn't an issue; for real-world uses, it
may be.  

HTH

William Webber
---
William Webber    Multimedia Database Systems, RMIT, Melbourne, Australia
wew@triton.mds.rmit.edu.au                   Food.  Shelter.  Source code.
  "'This tree is certainly good for nothing,' said Tzu Chi. 'This
   is why it has grown so large.  Ah-ha!  This is the sort of
   uselessness that sages live by.'" --- _The_Book_of_Chuang_Tzu_

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic