'Re: [postgis-users] Point as Index'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgis-users
Subject:    Re: [postgis-users] Point as Index
From:       Ioannis Anagnostopoulos <ioannis () anatec ! com>
Date:       2011-03-30 19:20:59
Message-ID: 4D93829B.1080908 () anatec ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Hello, I think I have some results that I deem inconclusive although I 
am more and more convinced that a spatial index on points is not that 
great. I have attached an open office spreadsheet with the details of 
the different queries. I also try to summarise in the following table:

Query
	Execution time
	Predicted Rows
	Rows Returned
select
     obj_id,
     msg_date_rec,
     pos_point
from
     feed_all.common_pos_messages inner join
     feed_all.messages on (common_pos_messages.msg_id = messages.msg_id)
where
     pos_lat between 30721085 and 31012505 and
     pos_lon between 2601 and 867037 and
     msg_date_rec BETWEEN '2011-03-21' AND '2011-03-22'
ORDER BY
     obj_id, msg_date_rec
	133146.196 ms
	1751339
	663769
select
     obj_id,
     msg_date_rec,
     pos_point
from
     feed_all.common_pos_messages inner join
     feed_all.messages on (common_pos_messages.msg_id = messages.msg_id)
where
     st_contains(ST_GeomFromText('POLYGON((0.00433541700872238 
51.6875086022247,1.4450615615687 51.6875086022247,1.4450615615687 
51.2018083846302,0.00433541700872238 
51.2018083846302,0.00433541700872238 51.6875086022247))', 4326), 
pos_point) and
     msg_date_rec BETWEEN '2011-03-21' AND '2011-03-22'
ORDER BY
     obj_id, msg_date_rec
	 271619.013 ms
	911564
	663769
select
     obj_id,
     msg_date_rec,
     pos_point
from
     feed_all.common_pos_messages inner join
     feed_all.messages on (common_pos_messages.msg_id = messages.msg_id)
where
     ST_GeomFromText('POLYGON((0.00433541700872238 
51.6875086022247,1.4450615615687 51.6875086022247,1.4450615615687 
51.2018083846302,0.00433541700872238 
51.2018083846302,0.00433541700872238 51.6875086022247))', 4326) && 
pos_point AND
     msg_date_rec BETWEEN '2011-03-21' AND '2011-03-22'
ORDER BY
     obj_id, msg_date_rec
	249301.858 ms
	2734692
	663769

All in all, the second query takes longer but has the closest prediction 
when explain - analyze is performed while the quickest one is the first 
with the custom btree index. What really concerns me is that regardless 
of the Vacuum Analyze I perform times and predictions remain unchanged. 
Thus I will try to persevere  with the first query for this development 
cycle aiming to implement in the future a gist index not on points but 
on clusters of points.

Nicklas, I had a reading for your suggestion but if I got it right, 
st_contains and st_within both run a && operator on the bounding box 
before, so I guess that the improved execution time of the 3rd query may 
be attributed to the missing cycle of running the st_contains.

As always any suggestion is more than welcomed.
Thanks and kind regards
Yiannis

On 30/03/2011 15:15, Nicklas Avén wrote:
> Ok, I continue my spamming :-)
> 
> Your compare isn't fair. "Your" index is just doing a bounding box
> comparasion not a recheck to see what points is actually inside your
> geometry.
> 
> It looks like your polygon is a box, but PostGIS don't know that so it
> will do a recheck on all rows fetched by the index and do a "real"
> calculation.
> 
> To only do the bounding box test you can use&&  as operator.
> 
> And using count(*) instead of retrieving all the rows is good to avoid
> the I/O bottleneck
> 
> Do a vacuum analyze on the table and try:
> 
> select
> count(*)
> from
> feed_all.common_pos_messages
> where
> ST_GeomFromText('POLYGON((0.00433541700872238
> 51.6875086022247,1.4450615615687 51.6875086022247,1.4450615615687
> 51.2018083846302,0.00433541700872238
> 51.2018083846302,0.00433541700872238 51.6875086022247))', 4326
> &&
> pos_point;
> 
> 
> Little more thought through I hope :-)
> 
> /Nicklas
> 
> 
> 
> 
> 
> 
> On Wed, 2011-03-30 at 16:01 +0200, Nicklas Avén wrote:
> > Sorry, I was too fast on the keys.
> > 
> > before I saw the comparasion with your own index
> > 
> > /Nicklas
> > 
> > On Wed, 2011-03-30 at 14:40 +0100, Ioannis Anagnostopoulos wrote:
> > > The precise numbers are as follows:
> > > 
> > > Total Rows in the table: 45922415
> > > 
> > > if I use:
> > > select
> > > pos_lat,
> > > pos_lon
> > > from
> > > feed_all.common_pos_messages
> > > where
> > > st_contains(ST_GeomFromText('POLYGON((0.00433541700872238
> > > 51.6875086022247,1.4450615615687 51.6875086022247,1.4450615615687
> > > 51.2018083846302,0.00433541700872238
> > > 51.2018083846302,0.00433541700872238 51.6875086022247))', 4326),
> > > pos_point)
> > > 
> > > I get back: 4493678 in 4.77 mins
> > > 
> > > Doing explain analyze I get:
> > > "Bitmap Heap Scan on common_pos_messages  (cost=82262.99..522647.01
> > > rows=771600 width=8) (actual time=127735.424..198141.843 rows=4493678
> > > loops=1)"
> > > "  Recheck Cond:
> > > ('0103000020E61000000100000005000000D29145A403C2713F0B23294800D84940E04E3ADFF81E \
> > > F73F0B23294800D84940E04E3ADFF81EF73FCC056EDBD4994940D29145A403C2713FCC056EDBD4994940D29145A403C2713F0B23294800D84940'::geometry&& \
> > > pos_point)" "  Filter:
> > > _st_contains('0103000020E61000000100000005000000D29145A403C2713F0B23294800D84940 \
> > > E04E3ADFF81EF73F0B23294800D84940E04E3ADFF81EF73FCC056EDBD4994940D29145A403C2713FCC056EDBD4994940D29145A403C2713F0B23294800D84940'::geometry, \
> > > pos_point)" "  ->   Bitmap Index Scan on idx_pos  (cost=0.00..82070.09 \
> > > rows=2314801 width=0) (actual time=127732.000..127732.000 rows=4493679 \
> > > loops=1)" "        Index Cond:
> > > ('0103000020E61000000100000005000000D29145A403C2713F0B23294800D84940E04E3ADFF81E \
> > > F73F0B23294800D84940E04E3ADFF81EF73FCC056EDBD4994940D29145A403C2713FCC056EDBD4994940D29145A403C2713F0B23294800D84940'::geometry&& \
> > > pos_point)" "Total runtime: 199206.428 ms"
> > > 
> > > Obviously the times and the rows do not seem to much with the actual
> > > results. However after creating my own index based on lat/lon values
> > > (integers) on the same table, executing this:
> > > select
> > > pos_lat,
> > > pos_lon
> > > from
> > > feed_all.common_pos_messages
> > > where
> > > pos_lat between 30721085 and 31012505 and
> > > pos_lon between 2601 and 867037
> > > 
> > > I get back 4493680 in 2.8 mins
> > > 
> > > Doing explain analyze I get:
> > > "Bitmap Heap Scan on common_pos_messages  (cost=161748.26..601144.64
> > > rows=1686719 width=8) (actual time=10064.427..60738.808 rows=4493680
> > > loops=1)"
> > > "  Recheck Cond: ((pos_lat>= 30721085) AND (pos_lat<= 31012505) AND
> > > (pos_lon>= 2601) AND (pos_lon<= 867037))"
> > > "  ->   Bitmap Index Scan on idx_lat_lon  (cost=0.00..161326.58
> > > rows=1686719 width=0) (actual time=10061.108..10061.108 rows=4493680
> > > loops=1)"
> > > "        Index Cond: ((pos_lat>= 30721085) AND (pos_lat<= 31012505)
> > > AND (pos_lon>= 2601) AND (pos_lon<= 867037))"
> > > "Total runtime: 61850.720 ms"
> > > 
> > > The predictions are still "out" from the actual but the btree index
> > > seems to behave better.
> > > 
> > > Any suggestions? Probably I need to bring up to date my statistics.
> > > 
> > > Kind Regards
> > > Yiannis
> > > 
> > > 
> > > On 30/03/2011 13:30, Sandro Santilli wrote:
> > > > On Wed, Mar 30, 2011 at 10:58:57AM +0100, Ioannis Anagnostopoulos wrote:
> > > > 
> > > > > I am involved in a heavy database design initiative where the only kind
> > > > > of geometries I am dealing with are points. I have recently hit a
> > > > > 50million rows long table with those points and my default gist index on
> > > > > the points does not seem to be working very fast (if not at all to be
> > > > > honest). I have started now thinking that probably for "points" an index
> > > > > may not be the best option since in a 50million rows long table most of
> > > > > the points are unique so the index may just duplicate the actual table,
> > > > > of course I may be wrong and I may just missing a very important part of
> > > > > the concept.
> > > > How many points from the 50M set does your tipical query hits ?
> > > > Does PostgreSQL selectivity estimator make a good guess about that ?
> > > > Use EXPLAIN ANALYZE<your query here>  to see.
> > > > 
> > > > --strk;
> > > > 
> > > > ()   Free GIS&  Flash consultant/developer
> > > > /\   http://strk.keybit.net/services.html
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users@postgis.refractions.net
> > > http://postgis.refractions.net/mailman/listinfo/postgis-users
> > 
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users@postgis.refractions.net
> > http://postgis.refractions.net/mailman/listinfo/postgis-users
> > 
> 
> _______________________________________________
> postgis-users mailing list
> postgis-users@postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users

[Attachment #5 (text/html)]

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#ffffff" text="#000000">
    Hello, I think I have some results that I deem inconclusive although
    I am more and more convinced that a spatial index on points is not
    that great. I have attached an open office spreadsheet with the
    details of the different queries. I also try to summarise in the
    following table:<br>
    <br>
    <table border="1" cellpadding="2" cellspacing="2" height="62"
      width="1288">
      <tbody>
        <tr>
          <td valign="top" width="40%">Query<br>
          </td>
          <td valign="top">Execution time<br>
          </td>
          <td valign="top">Predicted Rows<br>
          </td>
          <td valign="top">Rows Returned<br>
          </td>
        </tr>
        <tr>
          <td valign="top"><small><tt>select <br>
                    obj_id,<br>
                    msg_date_rec,<br>
                    pos_point<br>
                from<br>
                    feed_all.common_pos_messages inner join<br>
                    feed_all.messages on (common_pos_messages.msg_id =
                messages.msg_id)<br>
                where <br>
                    pos_lat between 30721085 and 31012505 and<br>
                    pos_lon between 2601 and 867037 and<br>
                    msg_date_rec BETWEEN '2011-03-21' AND '2011-03-22'<br>
                ORDER BY<br>
                    obj_id, msg_date_rec</tt></small><br>
          </td>
          <td valign="top">133146.196 ms<br>
          </td>
          <td valign="top">1751339<br>
          </td>
          <td valign="top">663769<br>
          </td>
        </tr>
        <tr>
          <td valign="top"><tt><small>select <br>
                    obj_id,<br>
                    msg_date_rec,<br>
                    pos_point<br>
                from<br>
                    feed_all.common_pos_messages inner join<br>
                    feed_all.messages on (common_pos_messages.msg_id =
                messages.msg_id)<br>
                where<br>

                st_contains(ST_GeomFromText('POLYGON((0.00433541700872238
                51.6875086022247,1.4450615615687
                51.6875086022247,1.4450615615687
                51.2018083846302,0.00433541700872238
                51.2018083846302,0.00433541700872238
                51.6875086022247))', 4326), pos_point) and<br>
                    msg_date_rec BETWEEN '2011-03-21' AND '2011-03-22'<br>
                ORDER BY<br>
                    obj_id, msg_date_rec</small></tt><br>
          </td>
          <td valign="top"> 271619.013 ms<br>
          </td>
          <td valign="top">911564<br>
          </td>
          <td valign="top">663769<br>
          </td>
        </tr>
        <tr>
          <td valign="top"><tt><small>select <br>
                    obj_id,<br>
                    msg_date_rec,<br>
                    pos_point<br>
                from<br>
                    feed_all.common_pos_messages inner join<br>
                    feed_all.messages on (common_pos_messages.msg_id =
                messages.msg_id)<br>
                where<br>
                    ST_GeomFromText('POLYGON((0.00433541700872238
                51.6875086022247,1.4450615615687
                51.6875086022247,1.4450615615687
                51.2018083846302,0.00433541700872238
                51.2018083846302,0.00433541700872238
                51.6875086022247))', 4326) &amp;&amp; pos_point AND<br>
                    msg_date_rec BETWEEN '2011-03-21' AND '2011-03-22'<br>
                ORDER BY<br>
                    obj_id, msg_date_rec</small></tt><br>
          </td>
          <td valign="top">249301.858 ms<br>
          </td>
          <td valign="top">2734692<br>
          </td>
          <td valign="top">663769<br>
          </td>
        </tr>
      </tbody>
    </table>
    <br>
    All in all, the second query takes longer but has the closest
    prediction when explain - analyze is performed while the quickest
    one is the first with the custom btree index. What really concerns
    me is that regardless of the Vacuum Analyze I perform times and
    predictions remain unchanged. Thus I will try to persevere  with the
    first query for this development cycle aiming to implement in the
    future a gist index not on points but on clusters of points.<br>
    <br>
    Nicklas, I had a reading for your suggestion but if I got it right,
    st_contains and st_within both run a &amp;&amp; operator on the
    bounding box before, so I guess that the improved execution time of
    the 3rd query may be attributed to the missing cycle of running the
    st_contains. <br>
    <br>
    As always any suggestion is more than welcomed.<br>
    Thanks and kind regards<br>
    Yiannis <br>
    <br>
    On 30/03/2011 15:15, Nicklas Av�n wrote:
    <blockquote cite="mid:1301494503.2175.251.camel@ubuntunav"
      type="cite">
      <pre wrap="">Ok, I continue my spamming :-)

Your compare isn't fair. "Your" index is just doing a bounding box
comparasion not a recheck to see what points is actually inside your
geometry. 

It looks like your polygon is a box, but PostGIS don't know that so it
will do a recheck on all rows fetched by the index and do a "real"
calculation.

To only do the bounding box test you can use &amp;&amp; as operator.

And using count(*) instead of retrieving all the rows is good to avoid
the I/O bottleneck

Do a vacuum analyze on the table and try:

select 
count(*)
 from
     feed_all.common_pos_messages
 where
ST_GeomFromText('POLYGON((0.00433541700872238
 51.6875086022247,1.4450615615687 51.6875086022247,1.4450615615687
 51.2018083846302,0.00433541700872238
 51.2018083846302,0.00433541700872238 51.6875086022247))', 4326
&amp;&amp;
 pos_point;

Little more thought through I hope :-)

/Nicklas

On Wed, 2011-03-30 at 16:01 +0200, Nicklas Av�n wrote:
</pre>
      <blockquote type="cite">
        <pre wrap="">Sorry, I was too fast on the keys.

before I saw the comparasion with your own index

/Nicklas

On Wed, 2011-03-30 at 14:40 +0100, Ioannis Anagnostopoulos wrote:
</pre>
        <blockquote type="cite">
          <pre wrap="">The precise numbers are as follows:

Total Rows in the table: 45922415 

if I use:
select 
    pos_lat,
    pos_lon
from
    feed_all.common_pos_messages
where
st_contains(ST_GeomFromText('POLYGON((0.00433541700872238
51.6875086022247,1.4450615615687 51.6875086022247,1.4450615615687
51.2018083846302,0.00433541700872238
51.2018083846302,0.00433541700872238 51.6875086022247))', 4326),
pos_point)

I get back: 4493678 in 4.77 mins

Doing explain analyze I get:
"Bitmap Heap Scan on common_pos_messages  (cost=82262.99..522647.01
rows=771600 width=8) (actual time=127735.424..198141.843 rows=4493678
loops=1)"
"  Recheck Cond:
('0103000020E61000000100000005000000D29145A403C2713F0B23294800D84940E04E3ADFF81EF73F0B \
23294800D84940E04E3ADFF81EF73FCC056EDBD4994940D29145A403C2713FCC056EDBD4994940D29145A403C2713F0B23294800D84940'::geometry \
&amp;&amp; pos_point)" "  Filter:
_st_contains('0103000020E61000000100000005000000D29145A403C2713F0B23294800D84940E04E3A \
DFF81EF73F0B23294800D84940E04E3ADFF81EF73FCC056EDBD4994940D29145A403C2713FCC056EDBD4994940D29145A403C2713F0B23294800D84940'::geometry, \
pos_point)" "  -&gt;  Bitmap Index Scan on idx_pos  (cost=0.00..82070.09 rows=2314801
width=0) (actual time=127732.000..127732.000 rows=4493679 loops=1)"
"        Index Cond:
('0103000020E61000000100000005000000D29145A403C2713F0B23294800D84940E04E3ADFF81EF73F0B \
23294800D84940E04E3ADFF81EF73FCC056EDBD4994940D29145A403C2713FCC056EDBD4994940D29145A403C2713F0B23294800D84940'::geometry \
&amp;&amp; pos_point)" "Total runtime: 199206.428 ms"

Obviously the times and the rows do not seem to much with the actual
results. However after creating my own index based on lat/lon values
(integers) on the same table, executing this:
select 
    pos_lat,
    pos_lon
from
    feed_all.common_pos_messages
where 
    pos_lat between 30721085 and 31012505 and
    pos_lon between 2601 and 867037

I get back 4493680 in 2.8 mins

Doing explain analyze I get:
"Bitmap Heap Scan on common_pos_messages  (cost=161748.26..601144.64
rows=1686719 width=8) (actual time=10064.427..60738.808 rows=4493680
loops=1)"
"  Recheck Cond: ((pos_lat &gt;= 30721085) AND (pos_lat &lt;= 31012505) AND
(pos_lon &gt;= 2601) AND (pos_lon &lt;= 867037))"
"  -&gt;  Bitmap Index Scan on idx_lat_lon  (cost=0.00..161326.58
rows=1686719 width=0) (actual time=10061.108..10061.108 rows=4493680
loops=1)"
"        Index Cond: ((pos_lat &gt;= 30721085) AND (pos_lat &lt;= 31012505)
AND (pos_lon &gt;= 2601) AND (pos_lon &lt;= 867037))"
"Total runtime: 61850.720 ms"

The predictions are still "out" from the actual but the btree index
seems to behave better.

Any suggestions? Probably I need to bring up to date my statistics. 

Kind Regards
Yiannis

On 30/03/2011 13:30, Sandro Santilli wrote: 
</pre>
          <blockquote type="cite">
            <pre wrap="">On Wed, Mar 30, 2011 at 10:58:57AM +0100, Ioannis \
Anagnostopoulos wrote:

</pre>
            <blockquote type="cite">
              <pre wrap="">I am involved in a heavy database design initiative where \
the only kind  of geometries I am dealing with are points. I have recently hit a 
50million rows long table with those points and my default gist index on 
the points does not seem to be working very fast (if not at all to be 
honest). I have started now thinking that probably for "points" an index 
may not be the best option since in a 50million rows long table most of 
the points are unique so the index may just duplicate the actual table, 
of course I may be wrong and I may just missing a very important part of 
the concept. 
</pre>
            </blockquote>
            <pre wrap="">How many points from the 50M set does your tipical query \
hits ? Does PostgreSQL selectivity estimator make a good guess about that ?
Use EXPLAIN ANALYZE &lt;your query here&gt; to see.

--strk; 

  ()   Free GIS &amp; Flash consultant/developer
  /\   <a class="moz-txt-link-freetext" \
href="http://strk.keybit.net/services.html">http://strk.keybit.net/services.html</a> \
</pre>  </blockquote>
          <pre wrap="">
_______________________________________________
postgis-users mailing list
<a class="moz-txt-link-abbreviated" \
href="mailto:postgis-users@postgis.refractions.net">postgis-users@postgis.refractions.net</a>
 <a class="moz-txt-link-freetext" \
href="http://postgis.refractions.net/mailman/listinfo/postgis-users">http://postgis.refractions.net/mailman/listinfo/postgis-users</a>
 </pre>
        </blockquote>
        <pre wrap="">

_______________________________________________
postgis-users mailing list
<a class="moz-txt-link-abbreviated" \
href="mailto:postgis-users@postgis.refractions.net">postgis-users@postgis.refractions.net</a>
 <a class="moz-txt-link-freetext" \
href="http://postgis.refractions.net/mailman/listinfo/postgis-users">http://postgis.refractions.net/mailman/listinfo/postgis-users</a>

</pre>
      </blockquote>
      <pre wrap="">

_______________________________________________
postgis-users mailing list
<a class="moz-txt-link-abbreviated" \
href="mailto:postgis-users@postgis.refractions.net">postgis-users@postgis.refractions.net</a>
 <a class="moz-txt-link-freetext" \
href="http://postgis.refractions.net/mailman/listinfo/postgis-users">http://postgis.refractions.net/mailman/listinfo/postgis-users</a>
 </pre>
    </blockquote>
    <br>
  </body>
</html>

["pla_queries.ods" (application/vnd.oasis.opendocument.spreadsheet)]

_______________________________________________
postgis-users mailing list
postgis-users@postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users

[prev in list] [next in list] [prev in thread] [next in thread]