'Re: [SQL] finding gaps in temporal data'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-sql
Subject:    Re: [SQL] finding gaps in temporal data
From:       Samuel Gendler <sgendler () ideasculptor ! com>
Date:       2011-06-15 20:52:27
Message-ID: BANLkTi=vD=MgfDjCaN7TrThbE3=1-KHXEw () mail ! gmail ! com
[Download RAW message or body]

On Wed, Jun 15, 2011 at 10:23 AM, Samuel Gendler
<sgendler@ideasculptor.com>wrote:

> I have a fact table that looks like this:
>
> dim1_fk bigint,
> time_fk bigint,
> tstamp timestamp without timezone
> value numeric(16,2)
>
> The tstamp column is completely redundant to a colume in the time_utc
> table, but I like to leave it there for convenience when writing ad-hoc
> queries in psql - it allows me to skip the join to a time dimension table.
>  The fact table is actually partitioned into 1-month child tables, with
> check constraints on both time_fk and the tstamp column, since there is a
> 1:1 relationship between those columns.
>
> dim1_fk refers to a dim1 table which has two varchar columns we care about:
>
> dim1_pk bigint,
> label1 varchar,
> label2 varchar
>
> the time_utc table contains the usual time dimension columns, and I've
> structured the primary key to be an integer in the form YYYYMMDDHH24mm - so
> 2011-06-15 15:35 has a primary key of 201106151535 and there is a row every
> 5 minutes. All data in the fact table is assigned to a given 5 minute
> window. There is a row in the time_utc table for every possible time value,
> regardless of whether there is data in a fact table for that 5 minute
> interval.  For our purposes, we only need care about 2 columns
>
> time_pk bigint,
> tstamp timstamp without time zone
>
> I'm looking to run a report which will show me any gaps in the data for any
> label1/label2 pair that appears in the dim1 table - there are 0 or more rows
> for each label1/label2 combination in each 5 minute window and I don't
> actually care about the duplicates (all my queries aggregate multiple rows
> for a given timestamp via the avg() function).
>

OK, I figured this one out on my own.  It looks like this (plus a union all
to a query to show entries from dim1 for which there is no data at all):

SELECT d2.label1,
       d2.label2,
       tstamp - gap as start_time,
       tstamp - '5 minute'::interval as end_time
FROM (
    SELECT d.dim1_pk,
           t.tstamp,
           t.tstamp - lag(t.tstamp,1, '2011-05-15 00:00:00') OVER w AS gap
    FROM
        dim1 d
        JOIN facts.fact_tbl f
             ON f.dim1_fk = d.dim1_pk
        JOIN time_utc t
             ON t.time_pk = f.time_fk
    WHERE f.tstamp between '2011-05-15 00:00:00' and '2011-06-03 23:55:00'
    GROUP BY 1,2
    WINDOW w AS (PARTITION BY d.dim1_pk
                 ORDER BY d.dim1_pk)
    ORDER BY 1,2
) AS q JOIN dim1 d2 ON d2.dim1_pk = q.dim1_pk
WHERE q.gap > '5 minute'::interval
;

That subtracts the tstamp of the previous row from the tstamp of the current
row, within a window defined on individual dim1_pk values.  The outer query
then selects only rows where the gap is greater than 5 minutes, since
sequential values will show a 5 minute interval.  It also joins to dim1
again in order to pull out the label1 and label2 values, since including
those in the window instead of dim1_pk resulted in a much slower query,
presumably because I don't have indexes on the label columns.

This avoids all of the cross join and left join craziness I was doing, which
is useful when attempting to plug gaps with default values, but a pain in
the neck when just attempting to determine where the gaps are and how large
they are.

window functions are a seriously useful tool!

--sam

[Attachment #3 (text/html)]

<br><br><div class="gmail_quote">On Wed, Jun 15, 2011 at 10:23 AM, Samuel Gendler \
<span dir="ltr">&lt;<a \
href="mailto:sgendler@ideasculptor.com">sgendler@ideasculptor.com</a>&gt;</span> \
wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex;"> I have a fact table that looks like \
this:<div><br></div><div>dim1_fk bigint,</div><div>time_fk bigint,</div><div>tstamp \
timestamp without timezone</div><div>value numeric(16,2)</div><div><br></div><div>The \
tstamp column is completely redundant to a colume in the time_utc table, but I like \
to leave it there for convenience when writing ad-hoc queries in psql - it allows me \
to skip the join to a time dimension table.  The fact table is actually partitioned \
into 1-month child tables, with check constraints on both time_fk and the tstamp \
column, since there is a 1:1 relationship between those columns.</div>

<div><br></div><div>dim1_fk refers to a dim1 table which has two varchar columns we \
care about:</div><div><br></div><div>dim1_pk bigint,</div><div>label1 \
varchar,</div><div>label2 varchar</div><div><br></div><div>the time_utc table \
contains the usual time dimension columns, and I&#39;ve structured the primary key to \
be an integer in the form YYYYMMDDHH24mm - so 2011-06-15 15:35 has a primary key of \
201106151535 and there is a row every 5 minutes. All data in the fact table is \
assigned to a given 5 minute window. There is a row in the time_utc table for every \
possible time value, regardless of whether there is data in a fact table for that 5 \
minute interval.  For our purposes, we only need care about 2 columns</div>

<div><br></div><div>time_pk bigint,</div><div>tstamp timstamp without time \
zone</div><div><br></div><div>I&#39;m looking to run a report which will show me any \
gaps in the data for any label1/label2 pair that appears in the dim1 table - there \
are 0 or more rows for each label1/label2 combination in each 5 minute window and I \
don&#39;t actually care about the duplicates (all my queries aggregate multiple rows \
for a given timestamp via the avg() function). </div> \
</blockquote><div><br></div><div>OK, I figured this one out on my own.  It looks like \
this (plus a union all to a query to show entries from dim1 for which there is no \
data at all):</div><div><br></div><div><span class="Apple-style-span" \
style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; \
"><div> <font face="&#39;courier new&#39;, monospace">SELECT \
d2.label1,</font></div><div><font face="&#39;courier new&#39;, monospace">       \
d2.label2,</font></div><div><font face="&#39;courier new&#39;, monospace">       \
tstamp - gap as start_time,</font></div> <div><font face="&#39;courier new&#39;, \
monospace">       tstamp - &#39;5 minute&#39;::interval as end_time \
</font></div><div><font face="&#39;courier new&#39;, monospace">FROM \
(</font></div><div><font face="&#39;courier new&#39;, monospace">    SELECT \
d.dim1_pk,</font></div> <div><font face="&#39;courier new&#39;, monospace">           \
t.tstamp,</font></div><div><font face="&#39;courier new&#39;, monospace">           \
t.tstamp - lag(t.tstamp,1, &#39;2011-05-15 00:00:00&#39;) OVER w AS gap</font></div> \
<div><font face="&#39;courier new&#39;, monospace">    FROM</font></div><div><font \
face="&#39;courier new&#39;, monospace">        dim1 d </font></div><div><font \
face="&#39;courier new&#39;, monospace"></font><span style="font-family: &#39;courier \
new&#39;, monospace; ">        JOIN facts.fact_tbl f</span></div> <div><font \
face="&#39;courier new&#39;, monospace">             ON f.dim1_fk = d.dim1_pk    \
</font></div><div><span class="Apple-style-span" style="font-family: &#39;courier \
new&#39;, monospace; ">        JOIN time_utc t </span></div> <div><font \
face="&#39;courier new&#39;, monospace">             ON t.time_pk = \
f.time_fk</font></div><div><font face="&#39;courier new&#39;, monospace">    WHERE \
f.tstamp between &#39;2011-05-15 00:00:00&#39; and &#39;2011-06-03 \
23:55:00&#39;</font></div> <div><font face="&#39;courier new&#39;, monospace">    \
GROUP BY 1,2</font></div><div><font face="&#39;courier new&#39;, monospace">    \
WINDOW w AS (PARTITION BY d.dim1_pk</font></div><div><font face="&#39;courier \
new&#39;, monospace">                 ORDER BY d.dim1_pk)</font></div> <div><font \
face="&#39;courier new&#39;, monospace">    ORDER BY 1,2</font></div><div><font \
face="&#39;courier new&#39;, monospace">) AS q JOIN dim1 d2 ON d2.dim1_pk = \
q.dim1_pk</font></div><div><font face="&#39;courier new&#39;, monospace">WHERE q.gap \
&gt; &#39;5 minute&#39;::interval</font></div> <div><font face="&#39;courier \
new&#39;, monospace">;</font></div></span></div><div><br></div><div>That subtracts \
the tstamp of the previous row from the tstamp of the current row, within a window \
defined on individual dim1_pk values.  The outer query then selects only rows where \
the gap is greater than 5 minutes, since sequential values will show a 5 minute \
interval.  It also joins to dim1 again in order to pull out the label1 and label2 \
values, since including those in the window instead of dim1_pk resulted in a much \
slower query, presumably because I don&#39;t have indexes on the label columns.</div> \
<div><br></div><div>This avoids all of the cross join and left join craziness I was \
doing, which is useful when attempting to plug gaps with default values, but a pain \
in the neck when just attempting to determine where the gaps are and how large they \
are.</div> <div> </div><div>window functions are a seriously useful \
tool!</div><div><br></div><div>--sam</div></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic