[prev in list] [next in list] [prev in thread] [next in thread]
List: kinosearch
Subject: [KinoSearch] After search, which field/s scored highest
From: peter () peknet ! com (Peter Karman)
Date: 2006-11-15 11:21:48
Message-ID: 455B65E4.3090308 () peknet ! com
[Download RAW message or body]
Marvin Humphrey scribbled on 11/14/06 10:33 PM:
>
> Thanks, this clears most everything up. I have one further question.
> Are all like properties grouped together, i.e. are all values for "date"
> stored consecutively on disk? Or are individual documents stored as
> units so that the values of the "date" property are scattered?
>
By file. They're written sequentially during indexing. (I had to ask Bill about
that since I didn't know.)
> Turns out, Sort::External actually would help you for scaling up the
> pre-sorting of properties to an arbitrary number of documents. I see
> that you're using qsort() right now, so you're limited by available ram
> divided by the average size of each property value. Sort::External
> would remove that constraint.
cool. I'll look at it more closely.
>
> Without that pre-sort, it looks like you have one random disk access per
> hit at search time to do sorting, which is a problem with large
> collections. With the pre-sort, it looks like you're using the cached
> integer array, which is cool -- that's how I'm intend to implement sort
> cache in KS.
ya, that pre-sort really helps speed everything considerably.
>> I'm growing to like the Xapian API for 'posting', 'term', 'data' and
>> 'value'.
>> http://xapian.org/docs/apidoc/html/classXapian_1_1Document.html
>
> Term is usefully and precisely defined in the KS/Lucene universe.
>
> Posting is one Term indexing one document one time. That's how Xapian
> and everybody I know of in IR defines it.
hm. maybe I've misunderstood term and posting. In the Xapian world, I thought
the only difference was that a posting had positional info. Or is that what
you're saying?
>
> In between those two, there is a concept missing a label: one term
> indexing one document several times (several postings). I sometimes
> call that a TermDoc.
In Xapian I add_posting() multiple times for the same token, using different
prefixes (MetaNames). Is that similar to what you're talking about?
>
> "data" and "value"... Hmm, those are nebulosities. They need better
> names. It looks like "data" in Xapian is the the unadulterated string
> prior to tokenizing...
>
> [ ... time passes ... ]
>
> Jeez, I've been browsing the Xapian docs for 15 minutes and I still
> can't figure out wtf Xapian thinks a "value" is.
>
you'll notice I said "growing to like". my devel notes tell me that I think a
'value' is like a Swish Property: you can sort by value, and each value needs a
unique int id.
'data' is similar, but can't be sorted on.
You can store whatever you want in either field, I believe. But yes,
unadulterated pre-tokenized strings are most likely I would think.
--
Peter Karman . http://peknet.com/ . peter@peknet.com
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic