'[postgis-users] Possibly incorrect row count estimate for bounding box operators'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgis-users
Subject:    [postgis-users] Possibly incorrect row count estimate for bounding box operators
From:       Wojciech Strzalka <wstrzalka () gmail ! com>
Date:       2023-08-29 12:56:09
Message-ID: CAJf-CGsAUQQnmay+GDZuGQ7P=mdQBCMF0QGWXp0P4jTNLXtQZg () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

I have performance related question as I noticed strange PostGIS behaviour
I can not explain, that degrades query performance.

I do have a table with North America continent split into 1mln irregular
sectors based on population density. Some of the sectors are 100mx100m
(cities) and some are 100kmx100km (north Canada) in size.

Table definition:

db=> \d+ table
                                                   Table "table"
     Column      |            Type             | Collation | Nullable |
Default | Storage  | Compression | Stats target | Description
-----------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
  id              | character(36)               |           | not null |
    | extended |             |              |
 polygon         | geometry(MultiPolygon,4326) |           |          |
    | main     |             | 10000        |
…
Indexes:
    "table_pkey" PRIMARY KEY, btree (id)
    "table_gist_idx" gist (polygon)
     …
Access method: heap

db=> select version();
                                                   version
--------------------------------------------------------------------------------------------------------------
  PostgreSQL 14.7 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1
20180712 (Red Hat 7.3.1-6), 64-bit
(1 row)

db=> select postgis_full_version();

               postgis_full_version
-------------------------------------------------------------------------------------- \
----------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------- \
-----------------------------------------------------------------------------------  \
POSTGIS="3.1.7 aafe1ff" [EXTENSION] PGSQL="120" (procs need upgrade for use with \
PostgreSQL "140") GEOS="3.9.1-CAPI-1.14.2" PROJ="8.0.1" GDAL="GDAL 3.4.3, released \
                2022/04/22" LIBXML="2
.9.1" LIBJSON="0.15" LIBPROTOBUF="1.3.2" WAGYU="0.5.0 (Internal)" (core
procs from "3.1.5 c60e4e3" need upgrade) RASTER (raster procs from "3.1.5
c60e4e3" need upgrade)
(1 row)

db=> select count(*) from table;
  count
---------
 1083887
(1 row)

db=> select oid::regclass,  pg_size_pretty(pg_relation_size(oid)) from
pg_class WHERE oid::regclass::text like ‘table%';
                     oid                     | pg_size_pretty
---------------------------------------------+----------------
 table_pkey                          | 61 MB
 table_gist_idx                      | 63 MB
 table                               | 774 MB
(6 rows)

And now the problem - the first symptoms are visible with ST_Contains
function - the index is used correctly but the row count estimate (1075) is
way too high. I think it's causing parallelism to kick off breaking query
performance:

db=> EXPLAIN ANALYZE SELECT id FROM table WHERE ST_Contains(polygon,
ST_PointFromText('POINT(-84.68998 39.1622)', 4326));
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
  Gather  (cost=1044.35..20805.92 rows=1 width=36) (actual
time=0.479..52.232 rows=1 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Parallel Bitmap Heap Scan on table  (cost=44.35..19805.82 rows=1
width=36) (actual time=0.154..0.154 rows=0 loops=2)
         Filter: st_contains(polygon,
'0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry)
         Rows Removed by Filter: 0
         Heap Blocks: exact=1
         ->  Bitmap Index Scan on table_gist_idx  (cost=0.00..44.35
rows=1075 width=0) (actual time=0.066..0.066 rows=2 loops=1)
               Index Cond: (polygon ~
'0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry)
 Planning Time: 0.287 ms
 Execution Time: 52.282 ms

The root cause of the problem with ST_Contains row estimation sources from
bounding box ~ operator estimate (1075 as before). The parallelism doesn't
kick in here however as the BB operation is much cheaper then ST_Contains
function:

db=> EXPLAIN ANALYZE SELECT id FROM table WHERE polygon ~
ST_PointFromText('POINT(-84.68998 39.1622)', 4326);
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on table  (cost=44.62..4004.37 rows=1075 width=36)
(actual time=0.068..0.069 rows=2 loops=1)
   Recheck Cond: (polygon ~
'0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on table_gist_idx  (cost=0.00..44.35 rows=1075
width=0) (actual time=0.063..0.063 rows=2 loops=1)
         Index Cond: (polygon ~
'0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry)
 Planning Time: 0.210 ms
 Execution Time: 0.106 ms
(7 rows)

And here comes the surprise - theoretically less restrictive && operator
does it kind a correct predicting single row:

db=> EXPLAIN ANALYZE SELECT id FROM table WHERE polygon &&
ST_PointFromText('POINT(-84.68998 39.1622)', 4326);
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
  Index Scan using table_gist_idx on table  (cost=0.29..8.30 rows=1
width=36) (actual time=0.068..0.069 rows=2 loops=1)
   Index Cond: (polygon &&
'0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry)
 Planning Time: 0.371 ms
 Execution Time: 0.117 ms
(4 rows)

As a workaround combining && with ST_Contains does the job for me but
imaybe there is room for improvement?

db=> EXPLAIN ANALYZE SELECT id FROM table WHERE polygon &&
ST_PointFromText('POINT(-84.68998 39.1622)', 4326)
                                            AND ST_Contains(polygon,
ST_PointFromText('POINT(-84.68998 39.1622)', 4326));

     QUERY PLAN
-------------------------------------------------------------------------------------- \
--------------------------------------------------------------------------------------
  Index Scan using table_gist_idx on table  (cost=0.29..33.30 rows=1
width=36) (actual time=0.077..0.078 rows=1 loops=1)
   Index Cond: ((polygon &&
'0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry) AND
(polygon ~ '0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry))
   Filter: st_contains(polygon,
'0101000020E610000039B9DFA1282C55C0A2B437F8C2944340'::geometry)
   Rows Removed by Filter: 1
 Planning Time: 0.451 ms
 Execution Time: 0.141 ms
(6 rows)

So the question is - why less restrictive intersect && operator gives lower
row count prediction than more restrictive contains ~ operator? Shouldn't
the row count estimate for ~ operator be the same or lower as for &&?  Is
it a bug in PostGIS (or PostgreSQL core)?

Best regards
   Wojtek Strzałka

[Attachment #5 (text/html)]

<div dir="ltr">I have performance related question as I noticed strange PostGIS \
behaviour I can not explain, that degrades query performance.<br><br>I do have a \
table with North America continent split into 1mln irregular sectors based on \
population density. Some of the sectors are 100mx100m (cities) and some are \
100kmx100km (north Canada) in size.<br><br>Table definition:<br><br>db=&gt; \d+ \
table<br>                                                                             \
Table &quot;table"<br>        Column         |                  Type                  \
| Collation | Nullable | Default | Storage   | Compression | Stats target | \
Description<br>-----------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+-------------<br> \
id                     | character(36)                      |                | not \
null |             | extended |                   |                     |<br>  \
polygon             | geometry(MultiPolygon,4326) |                |               |  \
| main       |                   | 10000            |<br>…<br>Indexes:<br>      \
&quot;table_pkey&quot; PRIMARY KEY, btree (id)<br>      &quot;table_gist_idx&quot; \
gist (polygon)<br>        …<br>Access method: heap<br><br>db=&gt; select \
version();<br>                                                                        \
version<br>--------------------------------------------------------------------------------------------------------------<br> \
PostgreSQL 14.7 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 \
(Red Hat 7.3.1-6), 64-bit<br>(1 row)<br><br>db=&gt; select \
postgis_full_version();<br>                                                           \
postgis_full_version<br>-------------------------------------------------------------- \
-------------------------------------------------------------------------------------- \
--------------------------------------<br>-------------------------------------------- \
-----------------------------------------------------------------------------------------------------------------------------<br> \
POSTGIS=&quot;3.1.7 aafe1ff&quot; [EXTENSION] PGSQL=&quot;120&quot; (procs need \
upgrade for use with PostgreSQL &quot;140&quot;) GEOS=&quot;3.9.1-CAPI-1.14.2&quot; \
PROJ=&quot;8.0.1&quot; GDAL=&quot;GDAL 3.4.3, released 2022/04/22&quot; \
LIBXML=&quot;2<br>.9.1&quot; LIBJSON=&quot;0.15&quot; LIBPROTOBUF=&quot;1.3.2&quot; \
WAGYU=&quot;0.5.0 (Internal)&quot; (core procs from &quot;3.1.5 c60e4e3&quot; need \
upgrade) RASTER (raster procs from &quot;3.1.5 c60e4e3&quot; need upgrade)<br>(1 \
row)<br><br>db=&gt; select count(*) from table;<br>   count<br>---------<br>  \
1083887<br>(1 row)<br><br>db=&gt; select oid::regclass,   \
pg_size_pretty(pg_relation_size(oid)) from pg_class WHERE oid::regclass::text like \
‘table%';<br>                                oid                               | \
pg_size_pretty<br>---------------------------------------------+----------------<br>  \
table_pkey                                       | 61 MB<br>  table_gist_idx          \
| 63 MB<br>  table                                              | 774 MB<br>(6 \
rows)<br><br>And now the problem - the first symptoms are visible with ST_Contains \
function - the index is used correctly but the row count estimate (1075) is way too \
high. I think it's causing parallelism to kick off breaking query \
performance:<br><br>db=&gt; EXPLAIN ANALYZE SELECT id FROM table WHERE \
ST_Contains(polygon, ST_PointFromText(&#39;POINT(-84.68998 39.1622)&#39;, 4326));<br> \
QUERY PLAN<br>-----------------------------------------------------------------------------------------------------------------------------------------<br> \
Gather   (cost=1044.35..20805.92 rows=1 width=36) (actual time=0.479..52.232 rows=1 \
loops=1)<br>     Workers Planned: 1<br>     Workers Launched: 1<br>     -&gt;   \
Parallel Bitmap Heap Scan on table   (cost=44.35..19805.82 rows=1 width=36) (actual \
time=0.154..0.154 rows=0 loops=2)<br>              Filter: st_contains(polygon, \
&#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry)<br>           \
Rows Removed by Filter: 0<br>              Heap Blocks: exact=1<br>              \
-&gt;   Bitmap Index Scan on table_gist_idx   (cost=0.00..44.35 rows=1075 width=0) \
(actual time=0.066..0.066 rows=2 loops=1)<br>                       Index Cond: \
(polygon ~ &#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry)<br> \
Planning Time: 0.287 ms<br>  Execution Time: 52.282 ms<br><br><br>The root cause of \
the problem with ST_Contains row estimation sources from bounding box ~ operator \
estimate (1075 as before). The parallelism doesn't kick in here however as the BB \
operation is much cheaper then ST_Contains function:<br><br>db=&gt; EXPLAIN ANALYZE \
SELECT id FROM table WHERE polygon ~ ST_PointFromText(&#39;POINT(-84.68998 \
39.1622)&#39;, 4326);<br>                                                             \
QUERY PLAN<br>-----------------------------------------------------------------------------------------------------------------------------------<br> \
Bitmap Heap Scan on table   (cost=44.62..4004.37 rows=1075 width=36) (actual \
time=0.068..0.069 rows=2 loops=1)<br>     Recheck Cond: (polygon ~ \
&#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry)<br>     Heap \
Blocks: exact=1<br>     -&gt;   Bitmap Index Scan on table_gist_idx   \
(cost=0.00..44.35 rows=1075 width=0) (actual time=0.063..0.063 rows=2 loops=1)<br>    \
Index Cond: (polygon ~ \
&#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry)<br>  Planning \
Time: 0.210 ms<br>  Execution Time: 0.106 ms<br>(7 rows)<br><br>And here comes the \
surprise - theoretically less restrictive &amp;&amp; operator does it kind a correct \
predicting single row:<br><br>db=&gt; EXPLAIN ANALYZE SELECT id FROM table WHERE \
polygon &amp;&amp; ST_PointFromText(&#39;POINT(-84.68998 39.1622)&#39;, 4326);<br>    \
QUERY PLAN<br>-----------------------------------------------------------------------------------------------------------------------------------------<br> \
Index Scan using table_gist_idx on table   (cost=0.29..8.30 rows=1 width=36) (actual \
time=0.068..0.069 rows=2 loops=1)<br>     Index Cond: (polygon &amp;&amp; \
&#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry)<br>  Planning \
Time: 0.371 ms<br>  Execution Time: 0.117 ms<br>(4 rows)<br><br>As a workaround \
combining &amp;&amp; with ST_Contains does the job for me but imaybe there is room \
for improvement?<br><br>db=&gt; EXPLAIN ANALYZE SELECT id FROM table WHERE polygon \
&amp;&amp; ST_PointFromText(&#39;POINT(-84.68998 39.1622)&#39;, 4326)<br>             \
AND ST_Contains(polygon, ST_PointFromText(&#39;POINT(-84.68998 39.1622)&#39;, \
4326));<br>                                                                           \
QUERY PLAN<br>------------------------------------------------------------------------ \
----------------------------------------------------------------------------------------------------<br> \
Index Scan using table_gist_idx on table   (cost=0.29..33.30 rows=1 width=36) (actual \
time=0.077..0.078 rows=1 loops=1)<br>     Index Cond: ((polygon &amp;&amp; \
&#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry) AND (polygon \
~ &#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry))<br>     \
Filter: st_contains(polygon, \
&#39;0101000020E610000039B9DFA1282C55C0A2B437F8C2944340&#39;::geometry)<br>     Rows \
Removed by Filter: 1<br>  Planning Time: 0.451 ms<br>  Execution Time: 0.141 ms<br>(6 \
rows)<br><br>So the question is - why less restrictive intersect &amp;&amp; operator \
gives lower row count prediction than more restrictive contains ~ operator? Shouldn't \
the row count estimate for ~ operator be the same or lower as for &amp;&amp;?   Is it \
a bug in PostGIS (or PostgreSQL core)?<br><br>Best regards<br>     Wojtek \
Strzałka<br></div>

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

[prev in list] [next in list] [prev in thread] [next in thread]