'Re: [postgis-users] Distance based clustering (like in Openlayers orLeaflet)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgis-users
Subject:    Re: [postgis-users] Distance based clustering (like in Openlayers orLeaflet)
From:       Rémi_Cura <remi.cura () gmail ! com>
Date:       2014-04-25 7:29:27
Message-ID: CAJvUf_uhOaCGmvrLeLS=Qj52cNwZKPOFE1EQUQSWkEVz9TrbLQ () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Hey,
I did the same thing using PLR and a minimal spanning tree.
It was fast, easy, and clustering method could be changed at will
(see all classical unsupervised learning).

Cheers,
Rémi-C

2014-04-24 12:53 GMT+02:00 Paragon Corporation <lr@pcorp.us>:

>  Quick glance at your plpgsql function.  It won't be able to use a
> spatial index.  To utilize a spatial index, you'll need to change your
> construct
>
>   ST_Distance(a.geom, b.geom) <= 4891.96981025 * 40
>
> to:
>   ST_DWithin(a.geom, b.geom, 4891.96981025 * 40)
>
> Also make sure you have spatial indexes on your points table.
>
> Hope that helps,
> Regina
> http://www.postgis.us
> http://postgis.net
>
>
>
>  ------------------------------
> *From:* postgis-users-bounces@lists.osgeo.org [mailto:
> postgis-users-bounces@lists.osgeo.org] *On Behalf Of *Matthias Ludwig
> *Sent:* Wednesday, April 23, 2014 10:53 AM
> *To:* postgis-users@lists.osgeo.org
> *Subject:* [postgis-users] Distance based clustering (like in Openlayers
> orLeaflet)
>
>   Hi,
> I try to implement a simple distance based clustering in PostgreSQL.
> It's working the following way:
> 1. loop as long as there are non clustered points
>    2. took one point that is not clustered
>    3. define this as a new cluster
>    4. get all surrounding points in a defined distance and add them into
> the cluster
> 5. calculate the centroid for each cluster and count cluster points
>
> The calculation speed depends mostly on the count of points and the scale
> used. In the moment i'am using a python script and it took around 1sec for
> 1000 points for TMS zoomlevel 5 (distance = 4891.96981025m * 40 pixel) and
> around 3sec at zoomlevel 15 (distance = 4.78m * 40 pixel). Implementing
> it with pgsql doesn't speeds it up.
> Is it possible to do the work completely with SQL and maybe with just one
> query? I'am not sure if I understood the WITH RECURSIVE concept
> completely...but shouldn't it by possible to replace the iteration(2.-4.)
> with a WITH RECURSIVE statement? Are there any other suggestion to speed it
> up or a better solution?
>
> Greetings
>
>
> CREATE TABLE points
> (
>   id_point serial PRIMARY KEY,
>   geom geometry(Point,3857),
>   cluster_nr integer
> );
> INSERT INTO points (geom)
> SELECT ST_SetSrid(ST_MakePoint(12.5 + random()*5, 51 + random()*5), 3857)
> FROM generate_series(0, 1000);
>
> CREATE TABLE pg_clusters
> (
>     gid integer,
>     cluster_feature_ids integer[],
>     count integer,
>     geom geometry
> );
>
> -- select first point, which isn't clustered yet
> SELECT  id_point
> FROM    points
> WHERE   cluster_nr IS NULL
> LIMIT   1;
> -- use first point and get all surrounding points in defined distance
> including the point itself
> -- use them as first cluster
> WITH first_cluster AS (
>         SELECT  a.id_point AS cluster_nr,
>                 b.id_point,
>                 b.geom AS geom
>         FROM    points a,
>                 points b
>         WHERE   ST_Distance(a.geom, b.geom) <= 4891.96981025 * 40
>                 AND a.id_point = 185
>                 AND a.cluster_nr IS NULL
> )
> UPDATE  points AS a
> SET     cluster_nr = b.cluster_nr
> FROM    first_cluster AS b
> WHERE   a.id_point = b.id_point;
> --> choose next id
> --> calculate next cluster
> etc.
> -- calculate the centroid of a cluster and count cluster points
> DELETE FROM pg_clusters;
> WITH clusters AS (
>         SELECT  cluster_nr, array_agg(id_point) AS feature_ids,
> count(id_point), ST_Centroid(ST_Collect(geom)) AS geom
>         FROM    points
>         GROUP BY cluster_nr
> )
> INSERT INTO pg_clusters(gid, cluster_feature_ids, count, geom)
> SELECT * FROM clusters;
>
> _______________________________________________
> postgis-users mailing list
> postgis-users@lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>

[Attachment #5 (text/html)]

<div dir="ltr"><div><div><div>Hey,<br>I did the same thing using PLR and a minimal \
spanning tree.<br></div>It was fast, easy, and clustering method could be changed at \
will <br></div>(see all classical unsupervised learning).<br> \
<br></div>Cheers,<br>Rémi-C<br></div><div class="gmail_extra"><br><br><div \
class="gmail_quote">2014-04-24 12:53 GMT+02:00 Paragon Corporation <span \
dir="ltr">&lt;<a href="mailto:lr@pcorp.us" \
target="_blank">lr@pcorp.us</a>&gt;</span>:<br> <blockquote class="gmail_quote" \
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><u></u>

<div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial">Quick glance at \
your plpgsql function.   It won&#39;t be able  to use a spatial index.   To utilize a \
spatial index, you&#39;ll need to change  your construct</font></span></div><div \
class=""> <div dir="ltr" align="left"><span><font color="#0000ff" \
face="Arial"></font></span>  </div> <div dir="ltr" align="left"><span>   
ST_Distance(a.geom, b.geom) &lt;= 4891.96981025 * 40<br></span></div>
<div dir="ltr" align="left"><span><font color="#0000ff" face="Arial"></font></span>  \
</div> </div><div dir="ltr" align="left"><span><font color="#0000ff" \
face="Arial">to:</font></span></div> <div dir="ltr" align="left"><span>   \
ST_DWithin(a.geom,  b.geom, 4891.96981025 * 40)</span></div>
<div><span></span>  </div>
<div><span><font color="#0000ff" face="Arial">Also 
make sure you have spatial indexes on your points table.</font></span></div>
<div><span><font color="#0000ff" face="Arial"></font></span>  </div>
<div><span><font color="#0000ff" face="Arial">Hope 
that helps,</font></span></div>
<div><span><font color="#0000ff" face="Arial">Regina</font></span></div>
<div><span><font color="#0000ff" face="Arial"><a href="http://www.postgis.us" \
target="_blank">http://www.postgis.us</a></font></span></div> <div><span><font \
color="#0000ff" face="Arial"><a href="http://postgis.net" \
target="_blank">http://postgis.net</a></font></span></div> <div><span><font \
color="#0000ff" face="Arial"></font>  </span></div> <div dir="ltr" \
align="left"><br></div><br> <div dir="ltr" align="left" lang="en-us">
<hr>
<font face="Tahoma"><b>From:</b> <a \
href="mailto:postgis-users-bounces@lists.osgeo.org" \
target="_blank">postgis-users-bounces@lists.osgeo.org</a>  [mailto:<a \
href="mailto:postgis-users-bounces@lists.osgeo.org" \
target="_blank">postgis-users-bounces@lists.osgeo.org</a>] <b>On Behalf Of \
</b>Matthias  Ludwig<br><b>Sent:</b> Wednesday, April 23, 2014 10:53 AM<br><b>To:</b> \
 <a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br><b>Subject:</b> [postgis-users] \
Distance based  clustering (like in Openlayers \
orLeaflet)<br></font><br></div><div><div class="h5"> <div></div>
<div style="FONT-SIZE:12px;FONT-FAMILY:Verdana">
<div>
<div>Hi,</div>
<div>I try to implement a simple distance based clustering in PostgreSQL.</div>
<div>It&#39;s working the following way:</div>
<div>1. loop as long as there are non clustered points</div>
<div>     2. took one point that is not clustered</div>
<div>     3. define this as a new cluster</div>
<div>     4. get all surrounding points in a defined distance and add 
them into the cluster</div>
<div>5. calculate the centroid for each cluster and count cluster points</div>
<div>  </div>
<div>The calculation speed depends mostly on the count of points and the scale 
used. In the moment i&#39;am using a python script and it took around 1sec for 1000 
points for TMS zoomlevel 5 (distance = 4891.96981025m * 40 pixel) and around 
3sec at zoomlevel 15 (distance = <span \
style="WHITE-SPACE:normal;TEXT-TRANSFORM:none;WORD-SPACING:0px;FLOAT:none;COLOR:rgb(42,42,42);FONT:13px/18px \
&#39;Segoe UI&#39;,&#39;Lucida \
Grande&#39;,Verdana,Arial,Helvetica,sans-serif;DISPLAY:inline!important;LETTER-SPACING:normal;TEXT-INDENT:0px">4.78m</span> \

* 40 pixel). Implementing it with pgsql doesn&#39;t speeds it up.</div>
<div>Is it possible to do the work completely with SQL and maybe with just one 
query? I&#39;am not sure if I understood the WITH RECURSIVE concept completely...but 
shouldn&#39;t it by possible to replace the iteration(2.-4.) with a WITH RECURSIVE 
statement? Are there any other suggestion to speed it up or a better 
solution?</div>
<div>  </div>
<div>Greetings</div>
<div>  </div>
<div>  </div>
<div>CREATE TABLE points<br>(<br>   id_point serial PRIMARY KEY,<br>   
geom geometry(Point,3857),<br>   cluster_nr integer<br>);<br>INSERT INTO 
points (geom)<br>SELECT ST_SetSrid(ST_MakePoint(12.5 + random()*5, 51 + 
random()*5), 3857)<br>FROM generate_series(0, 1000);</div>
<div>  </div>
<div>CREATE TABLE pg_clusters<br>(<br>       gid 
integer,<br>       cluster_feature_ids 
integer[],<br>       count integer,<br>       geom 
geometry<br>);</div>
<div>  </div>
<div>-- select first point, which isn&#39;t clustered yet<br>SELECT   
id_point<br>FROM       points<br>WHERE     cluster_nr IS 
NULL<br>LIMIT     1;</div>
<div>-- use first point and get all surrounding points in defined distance 
including the point itself</div>
<div>-- use them as first cluster<br>WITH first_cluster AS 
(<br>               SELECT   a.id_point AS 
cluster_nr,<br>                               
b.id_point,<br>                               
b.geom AS geom<br>               
FROM       points 
a,<br>                               
points b<br>               WHERE     
ST_Distance(a.geom, b.geom) &lt;= 4891.96981025 * 
40<br>                               
AND a.id_point = 
185<br>                               
AND a.cluster_nr IS NULL<br>)<br>UPDATE   points AS 
a<br>SET         cluster_nr = 
b.cluster_nr<br>FROM       first_cluster AS b<br>WHERE     
a.id_point = b.id_point;</div>
<div>--&gt; choose next id<br>--&gt; calculate next cluster<br>etc.</div>
<div>-- calculate the centroid of a cluster and count cluster points<br>DELETE 
FROM pg_clusters;<br>WITH clusters AS 
(<br>               SELECT   cluster_nr, 
array_agg(id_point) AS feature_ids, count(id_point), 
ST_Centroid(ST_Collect(geom)) AS 
geom<br>               FROM       
points<br>               GROUP BY 
cluster_nr<br>)<br>INSERT INTO pg_clusters(gid, cluster_feature_ids, count, 
geom)<br>SELECT * FROM clusters;</div></div></div></div></div></div>
<br>_______________________________________________<br>
postgis-users mailing list<br>
<a href="mailto:postgis-users@lists.osgeo.org">postgis-users@lists.osgeo.org</a><br>
<a href="http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users" \
target="_blank">http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users</a><br></blockquote></div><br></div>

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

[prev in list] [next in list] [prev in thread] [next in thread]