'Re: [postgis-devel] An Experiment'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgis-devel
Subject:    Re: [postgis-devel] An Experiment
From:       Kevin Neufeld <kneufeld () refractions ! net>
Date:       2009-09-22 16:58:45
Message-ID: 4AB90245.1030000 () refractions ! net
[Download RAW message or body]

Yeah, it was a while ago we last looked at this (great write-up Chris).  I do recall \
looking at many different ways to  order the input of geometries before creating a \
gist index (order by random() being one of them).

For those that don't know what we are talking about, the creating of a gist index \
involves making a tree of bounding  boxes.  At the lowest level, the individual \
bounding boxes correspond to the individual geometries.  At higher levels,  these \
bboxes are grouped together into some sort of groupings.  At the highest level, there \
is a handful of bboxes that  cover your entire dataset.  When you invoke the index in \
a spatial query, it compares your supplied bbox with the tree  nodes to [quickly] \
walk down the tree to identify the target collection of geometries.  (Others can \
correct me if my  overly simplistic definition is off base)

The main function of interest for us is the picksplit method defined by the GiST api. \
When the index is create and  geometries are added to the tree, eventually some \
subtree needs to get split to create two smaller trees.  Now,  logically, it makes \
sense that the best performance might be obtained when the bounding boxes at the \
subtree nodes are  as square as possible and non-overlapping.

It would seem that the ordering of the geometries before creating a gist index plays \
a huge role in what the index looks  like.  As Chris posted in his blog, a typical \
gist index in PostGIS "looks" horrible "looks" like it should perform  badly.  He \
posted an image that shows one level of an index where the tree nodes are very long \
and skinny and very much  overlapping.  This means that at any given point, you are \
likely to select many subtrees when trying to get to your  target geometries at the \
lowest levels.

I found that the best "looking" gist trees were created by ordering the geometries \
spatially, like using a grid specific  to the dataset (I would think Paul's geohash \
would make nice looking trees).  The subtrees created tended to be more  spatially \
disjoint.  This is good since this means that close geometries are grouped together \
and are also "close" on  disk (which means potentially fewer disk accesses).

Another approach I found that worked well was repeatedly creating a spatial index, \
cluster on the index (which orders  the table based on the spatial index), then drop \
and recreate the index.  After a few iterations, the tree looks like it  should \
perform well.

Having said all that, I'm kinda with Paul on this ... this may just be a waste of \
time.  We don't have a solid enough  handle on where the bottleneck is.  With a gist \
implementation, I'm not sure we can do much better than simply  clustering a table \
based on a spatial index.

Cheers,
Kevin

Paul Ramsey wrote:
> This book
> 
> http://www.amazon.com/Foundations-Multidimensional-Structures-Kaufmann-Computer/dp/0123694469
>  
> has a great little section on r-trees which gives capsule summaries of
> various papers, and even better summaries of some of the empirical
> testing that was done on different r-tree variants. Among other
> things, they found that the Ang/Tan linear algorithm actually produced
> a better index than the Guttman quadratic time algorithm, which is
> sort of counter-intuitive, but c'est la vie.
> 
> And it also shows some visualizations which sure make one want to have
> an R*Tree. However, the GiST infrastructure just doesn't provide that
> option. So we're left with really very few knobs to twist (penalty and
> picksplit).
> 
> However, perhaps we are wasting our time :) there was a very very
> interesting talk at GeoWeb this year by some guys selling a RDF
> triple-store, about how they did their spatial index. And they
> basically did a one-level grid system, but even simpler, just encoding
> things in strips, and it performed, because the strips could be read
> in the fastest possible way off the disk: linearly in one direction.
> 
> In a similar vein, the QIX structure used by Mapserver has some really
> really dumb behavior for points, creating an absurdly deep tree by
> default, and yet, doesn't really suffer from it. Our trouble is, we
> don't have a solid enough handle on where our performance bottlenecks
> are, so we could very well be focussing on the wrong (but very
> interesting) thing by looking at, for example, tree structure instead
> of disk access.
> 
> Another interesting performance test would be r-tree versus
> b-tree-over-geohash. The latter is more naive and un-balanced, but has
> the advantage (when clustered) of being nicely spatially
> autocorrelated, thereby providing potentially better disk access
> profiles for our classic bounding box query.
> 
> P.
> 
> On Mon, Sep 21, 2009 at 4:19 PM, Chris Hodgson <chodgson@refractions.net> wrote:
> > Hey Paul, it was actually almost a year ago that I looked into this stuff. I
> > started a blog to publish my findings but I realized that a solution was
> > probably outside the scope of postgresql/postgis and so I didn't end up
> > going any further with it. Also (as I'm sure many other wannabe bloggers
> > have discovered) I haven't done anything interesting enough to blog about
> > since :)
> > 
> > http://cognitivelychris.blogspot.com/
> > 
> > (I originally had it on chris.ca but I ended up selling that domain.)
> > 
> > Anyways, I explain in there why the real problem is progressive insertion of
> > *variable density data*. It doesn't matter whether you order it randomly or
> > semi-geo-correllated - unless it is just perfect (ie. you pre-process it
> > with a magic process that knows about how the index will be built) you end
> > up with a relatively poor index because the dense areas effectively take
> > over the index and mess it up.
> > 
> > I came up with some interesting algorithms to make what I envisioned to be
> > nearly perfectly balanced r-tree indexes but they require sorting the
> > bounding boxes for all of the data in the table at once, which isn't
> > necessarily feasible with large data sets and is definitely outside of the
> > postgresql workflow of inserting one record at a time. I'll try to make
> > another post with some of my prettier pictures.
> > 
> > Chris
> > 
> > Paul Ramsey wrote:
> > > Kevin,
> > > 
> > > Many many moons ago, you and Chris showed me the visual representation
> > > of an index, and lo, it had a damn funny layout compared to what we
> > > expected.
> > > 
> > > Having now gone through the picksplit code with a fine comb, I'm
> > > pretty sure I know why. I don't have any general means of preventing
> > > that (an R*Tree would be awesome, but GiST lacks the infrastructure to
> > > build one, at this time) I do have an observation it would be fun to
> > > see tested.
> > > 
> > > First, note that spatial data tends to be delivered to databases in
> > > shape files, and that the data in those files has a tendency to be
> > > spatially autocorrelated. (Watch the draw in JUMP or ArcView.)
> > > Spatially autocorrelated data is pretty much guaranteed to generate
> > > screwed up R-Trees, since the top nodes are going to be built around
> > > one small area, then expanded in odd ways to fit new data as is fills
> > > in areas outside the original node boundary.
> > > 
> > > A better R-Tree is going to be build against randomized inputs. So,
> > > try two tables, one loaded and built the old fashioned way, another
> > > built by creating a second table with "create table random_mytable as
> > > select * from mytable order by random()". Obviously you'll need enough
> > > records in the tables to make it interesting (bearing in mind there
> > > are hundreds of entries in each index node).
> > > 
> > > If you get a chance to run those, I'd love to hear how it goes :)
> > > 
> > > Paul
> > > 
> > > PS - Of course, once the random-based index is built, clustering the
> > > table against the index to bring the tuples back into some kind of
> > > spatial correlation will be necessary for performance purposes.
> > > Probably all these minor fine-tuning things make ****-all difference
> > > in the big picture. But it's fun to fiddle.
> > > _______________________________________________
> > > postgis-devel mailing list
> > > postgis-devel@postgis.refractions.net
> > > http://postgis.refractions.net/mailman/listinfo/postgis-devel
> > > 
> > _______________________________________________
> > postgis-devel mailing list
> > postgis-devel@postgis.refractions.net
> > http://postgis.refractions.net/mailman/listinfo/postgis-devel
> > 
> _______________________________________________
> postgis-devel mailing list
> postgis-devel@postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
_______________________________________________
postgis-devel mailing list
postgis-devel@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-devel

[prev in list] [next in list] [prev in thread] [next in thread]