[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kinosearch
Subject:    [KinoSearch] After search, which field/s scored highest
From:       peter () peknet ! com (Peter Karman)
Date:       2006-11-15 11:21:48
Message-ID: 455B65E4.3090308 () peknet ! com
[Download RAW message or body]



Marvin Humphrey scribbled on 11/14/06 10:33 PM:

> 
> Thanks, this clears most everything up.  I have one further question.  
> Are all like properties grouped together, i.e. are all values for "date" 
> stored consecutively on disk?  Or are individual documents stored as 
> units so that the values of the "date" property are scattered?
> 

By file. They're written sequentially during indexing. (I had to ask Bill about 
that since I didn't know.)


> Turns out, Sort::External actually would help you for scaling up the 
> pre-sorting of properties to an arbitrary number of documents.  I see 
> that you're using qsort() right now, so you're limited by available ram 
> divided by the average size of each property value.  Sort::External 
> would remove that constraint.

cool. I'll look at it more closely.

> 
> Without that pre-sort, it looks like you have one random disk access per 
> hit at search time to do sorting, which is a problem with large 
> collections.  With the pre-sort, it looks like you're using the cached 
> integer array, which is cool -- that's how I'm intend to implement sort 
> cache in KS.

ya, that pre-sort really helps speed everything considerably.



>> I'm growing to like the Xapian API for 'posting', 'term', 'data' and 
>> 'value'.
>> http://xapian.org/docs/apidoc/html/classXapian_1_1Document.html
> 
> Term is usefully and precisely defined in the KS/Lucene universe.
> 
> Posting is one Term indexing one document one time.  That's how Xapian 
> and everybody I know of in IR defines it.

hm. maybe I've misunderstood term and posting. In the Xapian world, I thought 
the only difference was that a posting had positional info. Or is that what 
you're saying?

> 
> In between those two, there is a concept missing a label: one term 
> indexing one document several times (several postings).  I sometimes 
> call that a TermDoc.

In Xapian I add_posting() multiple times for the same token, using different 
prefixes (MetaNames). Is that similar to what you're talking about?


> 
> "data" and "value"... Hmm, those are nebulosities.  They need better 
> names.  It looks like "data" in Xapian is the the unadulterated string 
> prior to tokenizing...
> 
> [ ... time passes ... ]
> 
> Jeez, I've been browsing the Xapian docs for 15 minutes and I still 
> can't figure out wtf Xapian thinks a "value" is.
> 

you'll notice I said "growing to like". my devel notes tell me that I think a 
'value' is like a Swish Property: you can sort by value, and each value needs a 
unique int id.

'data' is similar, but can't be sorted on.

You can store whatever you want in either field, I believe. But yes, 
unadulterated pre-tokenized strings are most likely I would think.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic