'Re: Full Text Indexing for kmail'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kmail-devel
Subject:    Re: Full Text Indexing for kmail
From:       Don Sanders <sanders () kde ! org>
Date:       2005-03-17 2:34:24
Message-ID: 200503171234.25116.sanders () kde ! org
[Download RAW message or body]

On Thursday 17 March 2005 01:57, Luís Pedro Coelho wrote:
> On Wednesday 16 March 2005 07:29, Don Sanders wrote:
> > > Longer answer: It can cheat on this. I have a mode where the
> > > indexer saves a complete copy of the text. Then if you search
> > > for "hot dog", it retrieves all the documents where hot and dog
> > > appear and then searches in them for "hot dog." Not really a
> > > bad approach, actually, but a tighter integration with kmail
> > > might remove the need to keep a copy.
> >
> > Perhaps in the index you store the id (serial number) of the
> > document (email) each term (word) is in. If so another
> > alternative is to store the document id and the index of the term
> > (word) within the document.
> >
> > Does that make sense?
>
> Yes, it does.
>
> I just store the id of each document the word is in. I do not store
> the index inside the document since I believe that would more or
> less double the index size.
> Keeping a copy of the text is actually a very workable solution. It
> is fast enough and the space requirements aren't that big of a
> problem in email (the whole index is _so much smaller_ than the
> original files).

My concern is that searches for a phrase like 'red hat' could be slow 
this way.

I think glimpse uses the approach you are recommending but most 
everyone else uses word indexing. (But I haven't verified this).

Basically what matters in the end though is real world performance. 
That's really what I care about, it it works ok then I don't care so 
much about how it's implemented.

> > > > How about, aah, wildcard searches
> > > > like "index*"
> > >
> > > Yup. It now always does search* to allow as-you-type searching
> > > (it's almost fast enough)
> > >
> > > > how about "*index"?
> > >
> > > Nope. This is much harder with the current index.
> >
> > I see. That is a bit disappointing also I think. I guess Sam's
> > indexer uses up memory because it loads all the terms into memory
> > so that it can support this kind of search.
>
> There is a way to do *index, but not *index* (this is much, much
> harder: I don't think any known index based on terms is capable of
> this. Suffix trees can, but they take a lot more space).
>
> Right now all the terms are in an array. There is another array
> which indexes this one according to the lexicographical order,
> supporting binary search. To find "index*", one needs to look at
> all terms between "index" and "indexZZZ". To support *index one
> could index the strings as if they were reversed (so, first all the
> string that _end_ in A, then the strings that _end_ in B ...) and
> use the same technique.

Understood. I concur.

Hopefully if Ingo and the other developers are reading they will now 
be aware of this limitation of full text indices. *term* searches are 
intractable. If someone else wishes to comment (and e.g. say 'this 
doesn't seem like a big deal to me, don't worry about it') please do 
so.

If there are no objections in the next few days I think it's safe to 
accept this limitation and move on.

Perhaps when a single term is entered, *term or term* could be 
searched for, that would be somewhat close to being consistent with 
what we have now.

I don't want to get hung up on details like this, I think what is 
important is that we are aware that the limitation that exist.

> > Would it be possible for someone to modify your code so that it
> > can iterate through all terms that have been indexed and create a
> > list of matching terms?
>
> I don't really understand this.

I guess to handle *term* searches one could iterate through every term 
in the term array, and match them one by one, (yes I know it's 
yucky), accumulating a list of documents that match, if more than say 
10,000 documents match some kind of exceptional 'too many documents 
match' result could be returned.

I mean *a* is normally going to be a waste of time to search for, 
because too many documents are going to match. But *ccumulat* might 
be a somewhat sensible thing to search for, though admittedly it 
could be very slow to search over all the terms in the term array 
matching them individually.

Any comment? (Is it feasible, is it worth doing?)

> > Ok, if you would like to post some code that would be nice.
>
> I will try to have something working by the weekend and will post
> it somewhere.

Ok. Whatever suits you. We are just at the beginning of a new 
development cycle so your timing is pretty much perfect.

Don.

(BTW just to reiterate I'm much more concerned with the amount of 
main/core/RAM memory used than how much disk space is used). 
_______________________________________________
KMail developers mailing list
KMail-devel@kde.org
https://mail.kde.org/mailman/listinfo/kmail-devel
[prev in list] [next in list] [prev in thread] [next in thread]