'Re: [PERFORM] many to many performance'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       pgsql-performance
Subject:    Re: [PERFORM] many to many performance
From:       "Chavdar Kopoev" <chavdar () kopoev ! com>
Date:       2008-11-27 16:39:51
Message-ID: 200811271839493750356 () kopoev ! com
[Download RAW message or body]

Hi Craig,

Thank you for your answer.

Here are my test table and indecies definitions:

-- document id to category id
create table doc_to_cat (
  doc_id integer not null,
  cat_id integer not null
) with (oids=false);

-- Total 1m rows. 500k unique document ids. 20k unique category ids. Each doc_id \
refers to exactly two cat_id.  insert into doc_to_cat
select i/50*25 + i%50 as doc_id, i/50 as cat_id from generate_series(0, 1000000) i

create index doc_to_cat__doc_id on doc_to_cat using btree (doc_id);
create index doc_to_cat__cat_id on doc_to_cat using btree (cat_id);

-- document id to array of category ids
create table doc_to_cat_arr (
  doc_id integer not null,
  cat_arr integer[] not null
) with (oids=false);

insert into doc_to_cat_arr
select doc_id, int_array_aggregate(cat_id) as cat_arr
from doc_to_cat
group by doc_id

create index doc_to_cat_arr__doc_id on doc_to_cat_arr using btree (doc_id);

-- category id to array of document ids
create table cat_to_doc_arr (
  cat_id integer not null,
  doc_arr integer[] not null
) with (oids=false);

insert into cat_to_doc_arr
select cat_id, int_array_aggregate(doc_id) as doc_arr
from doc_to_cat
group by cat_id

create index cat_to_doc_arr__doc_arr__gin on cat_to_doc_arr using gin (doc_arr \
gin__int_ops);

-- Query Ids
create table query_ids (
  doc_id integer not null
) with (oids=false);

-- 200k test ids for query with
insert into query_ids
select generate_series(100000, 300000);

create index query_ids__doc_id on query_ids using btree (doc_id);

Now queries plans. We are looking for cat_id having relations with 200k doc ids:

explain analyze
select distinct cat_id from doc_to_cat join query_ids using (doc_id)

"Unique  (cost=19303.84..19602.03 rows=20544 width=4) (actual time=1006.745..1190.472 \
rows=8002 loops=1)" "  ->  Sort  (cost=19303.84..19452.93 rows=372735 width=4) \
(actual time=1006.743..1094.908 rows=400002 loops=1)" "        Sort Key: \
doc_to_cat.cat_id" "        Sort Method:  quicksort  Memory: 31039kB"
"        ->  Merge Join  (cost=2972.22..13785.04 rows=372735 width=4) (actual \
time=167.591..750.177 rows=400002 loops=1)" "              Merge Cond: \
(query_ids.doc_id = doc_to_cat.doc_id)" "              ->  Index Scan using \
query_ids_doc_id on query_ids  (cost=0.00..2912.05 rows=200001 width=4) (actual \
time=0.021..81.291 rows=200001 loops=1)" "              ->  Index Scan using \
doc_to_cat_doc_id on doc_to_cat  (cost=0.00..14543.09 rows=1000001 width=8) (actual \
time=0.017..281.769 rows=599978 loops=1)" "Total runtime: 1195.397 ms"

explain analyze
select distinct int_array_enum(cat_arr) from doc_to_cat_arr join query_ids using \
(doc_id) "Unique  (cost=13676.57..13836.57 rows=19732 width=29) (actual \
time=1061.490..1246.595 rows=8002 loops=1)" "  ->  Sort  (cost=13676.57..13756.57 \
rows=200001 width=29) (actual time=1061.488..1150.451 rows=400002 loops=1)" "        \
Sort Key: (int_array_enum(doc_to_cat_arr.cat_arr))" "        Sort Method:  quicksort  \
Memory: 31039kB" "        ->  Merge Join  (cost=2318.98..10859.01 rows=200001 \
width=29) (actual time=163.840..816.697 rows=400002 loops=1)" "              Merge \
Cond: (doc_to_cat_arr.doc_id = query_ids.doc_id)" "              ->  Index Scan using \
doc_to_cat_arr_doc_id on doc_to_cat_arr  (cost=0.00..11311.10 rows=500025 width=33) \
(actual time=0.022..359.673 rows=300002 loops=1)" "              ->  Index Scan using \
query_ids_doc_id on query_ids  (cost=0.00..2912.05 rows=200001 width=4) (actual \
time=0.016..81.370 rows=200001 loops=1)" "Total runtime: 1251.613 ms"

explain analyze
select cat_id from cat_to_doc_arr
where doc_arr && (select int_array_aggregate(q.doc_id) from (select doc_id from \
query_ids limit 20000) as q) This query should never be run - too slow even with \
limit 20k of input ids.

So .. final best result is more than 1 second (on fast machine) for test dataset 5 \
times less than needed. So I am far away from achieving good results. I have to ask \
again if anyone faced similar situation, and is there any way to achive closer to \
optimal performance using postgresql functionality and extensibility?

Chavdar Kopoev

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: 2008-11-26, 19:40:47
To: Chavdar Kopoev [mailto:chavdar@kopoev.com]
Subject: Re: [PERFORM] many to many performance

Chavdar Kopoev wrote:

> I want to use as a data storage postgresql. Tried several data structures, testing \
> btree, gin, gist indecies over them, but best achieved performance for a 10 times \
> smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower more than 5 \
> times.

Can you post your queries and table definitions so people trying to help
you know what you did / tried to do? A downloadable self contained
example might also be useful.

Please also post the output of `EXPLAIN' on your queries, eg:

EXPLAIN SELECT blah, ... FROM blah;

> I read about postgresql bitmap indecies and "data lookup" when scanning indecies to \
> get a value for current transaction. Downloaded latest v8.4 snapshot, compiled it, \
> but as I see there is no bitmap index in it. Maybe if I download HEAD revision I \
> will find them there, dont know.

Bitmap index scans are an internal function that's used to combine two
indexes on the fly during a query (or at least use both of them in one
table scan). You don't make a bitmap index, you just make two ordinary
btree indexes and let Pg take care of this for you.

If you query on the columns of interest a lot, you might want to use a
multi-column index instead.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic