'Re: [HACKERS] "pivot aggregation" with a patched intarray'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: [HACKERS] "pivot aggregation" with a patched intarray
From:       Marc Mamin <M.Mamin () intershop ! de>
Date:       2014-05-30 16:21:16
Message-ID: B6F6FD62F2624C4C9916AC0175D56D8828A80FC4 () jenmbs01 ! ad ! intershop ! net
[Download RAW message or body]

(reposted, this time with attachment. sorry)


Hello,


(sorry for this long post)

I have patched intarray with 3 additional functions in order to count[distinct] event \
IDs into arrays, whereas the array position correspond to the integer values. (mimic \
column oriented storage)

I need to use an array for the event IDs as I don't know how many of them exist, and \
the list may increase slowly upon the time.

The first results are quite promising. 

Although this approach is only usable for a narrow set of use cases (*), 
I wonder if I should look at other places in the source code to achieve a better \
implementation and if there are discussions or plan to enhance Postgres with some \
support for such kind of storage (vectors of int).


(*) The array sizes should ideally not exceed a few tens 
    and the number of "events" should be unknown, otherwise using one column per \
event would be faster)


use case
========

c: category
s: session
e: event int4 (small range, mostly less than 20)

c   s   e
-   -   -
1   1   1
1   1   1
1   1   3
1   2   1
2   2   3
3   1   5

WITH SES AS (
  SELECT s, c, 
         icount_to_array(e) ecount,
         iarray_flag    (e) edistinct
  FROM foo
  GROUP BY s, c)
SELECT  c, 
        iarray_sum(ecount) ecount, 
        iarray_sum(edistinct)edistinct
FROM SES GROUP BY c

        
c        ecount          edistinct
-        -------         ---------
1        [3,0,1]         [2,0,1]
2        [0,0,1]         [0,0,1]
3        [0,0,0,0,1]     [0,0,0,0,1]

e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions.


source code
===========
The 3 aggregates use following functions:

isumv:          isumv([1,1,1],[0,0,1,3]) = [1,1,2,3])

iarray_addat :  iarray_addat([3,3,3],2) = [3,4,3]

iarray_setat :  iarray_setat([0,0],2) = [0,1] 
                iarray_setat([0,1],2) = [0,1] 


I've added them at the end of _int_op.c from intarray (attached)

missing in code: checks for integer overflow and that the array lower bound are all \
1.

These are my first C lines ever and I've never learned it.
Hence I'm open for critics ;-)
I've started with this great posting by Craig Ringer:
 http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top
 The rest is mostly copy and paste from other parts of intarray.
 

iarray_setat is just to set a "bitmap position" to 1, possibly enlarging it when \
required. It is a trivial modification from iarray_addat
It should be possible to implement this more efficiently with bit[] or varbit, but \
this lies beyond my C capbilities.

performances & advantage of icount_to_array
===========================================
As this aggregate allows to reduce the cardinality of GROUP BYs
it often performs better than a vertical approach. 

With the horizontal storage, the result can be stored in smaller tables with much \
less rows, hence bringing more advantages when it comes to indices.



e.g.:

select icount_to_array(user_id) from foo

{256655,0,0,1075,0,1,91154,36,0 (...)

Aggregate  (cost=36930.44..36930.45 rows=1 width=4) (actual time=333.378..333.378 \
                rows=1 loops=1)
  ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual \
time=0.010..131.117 rows=1860179 loops=1) Total runtime: 333.420 ms


select user_id, count(*) from foo group by user_id order by 1

Sort  (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 rows=79 \
loops=1)  Sort Key: user_id
  Sort Method: quicksort  Memory: 28kB
  ->  HashAggregate  (cost=41580.93..41581.41 rows=48 width=4) (actual \
                time=492.606..492.615 rows=79 loops=1)
        ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual \
time=0.010..128.800 rows=1860179 loops=1) Total runtime: 492.699 ms

1;256656
4;1075
7;91157
8;36
17;1455583
(...)

It will swap later on
---------------------
... which result in a significant difference in some cases.

create temp table ev AS SELECT * FROM (
generate_series(1,1000)s cross join
generate_series(1,500)c cross join
generate_series(1,15)e cross join
generate_series(1,5) repeat
)

explain analyze select s, c,  icount_to_array(e)  from ev group by s,c

HashAggregate  (cost=830575.54..830975.54 rows=40000 width=12) (actual \
                time=19188.384..19379.154 rows=500000 loops=1)
  ->  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual \
time=0.046..4849.977 rows=37500000 loops=1) Total runtime: 19399.151 ms

explain analyze select s, c, e, count(*) from ev group by s,c,e

GroupAggregate  (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual \
                time=63290.168..89336.265 rows=7500000 loops=1)
  ->  Sort  (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual \
time=63290.156..83981.772 rows=37500000 loops=1)  Sort Key: s, c, e
        Sort Method: external merge  Disk: 806424kB
        ->  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual \
time=0.039..4957.481 rows=37500000 loops=1) Total runtime: 89680.844 ms


Aggregates definition:
======================


   CREATE AGGREGATE icount_to_array(integer) (
     SFUNC=iarray_addat,
     STYPE=int4[],
     INITCOND='{0}'
   );

   CREATE AGGREGATE iarray_flag(integer) (
     SFUNC=iarray_setat,
     STYPE=int4[],
     INITCOND='{0}'
   );

   CREATE AGGREGATE iarray_sum(int[]) (
     SFUNC=isumv,
     STYPE=int[],
     INITCOND='{0}'
   );




regards,

Marc Mamin


["_int_op.c.gz" (application/x-gzip)]

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic