'Re: [PERFORM] Insert performance for large transaction with multiple COPY FROM'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       pgsql-performance
Subject:    Re: [PERFORM] Insert performance for large transaction with multiple COPY FROM
From:       Horst Dehmer <horst.dehmer () gmail ! com>
Date:       2013-01-18 23:15:15
Message-ID: 273CF8C2-159B-44FF-B382-61A76541F435 () gmail ! com
[Download RAW message or body]

Hey Jeff (and others)!

First of all: Thanks for your detailed explanations and guide lines.


On 17.01.2013, at 18:12, Jeff Janes <jeff.janes@gmail.com> wrote:

> So the theory is that the presence of idx_4 is causing the trigger to
> pick a poor plan (i.e. one using idx_4) while its absence removes that
> temptation?

Yes. And auto_explain confirms this for the first record \
(obj_item_loc_obj_item_id_idx = idx_4 from last my last mail):

2013-01-18 22:50:21 CET LOG:  duration: 0.021 ms  plan:
	Query Text: SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id = \
NEW.obj_item_id AND obj_item_loc.loc_id = NEW.loc_id AND obj_item_loc.obj_item_loc_ix \
= NEW.obj_item_loc_ix  Index Scan using obj_item_loc_obj_item_id_idx on obj_item_loc  \
(cost=0.00..8.27 rows=1 width=382)  Index Cond: (obj_item_id = $15)
	  Filter: ((loc_id = $16) AND (obj_item_loc_ix = $17))
2013-01-18 22:50:21 CET CONTEXT:  SQL statement "SELECT * FROM obj_item_loc WHERE \
obj_item_loc.obj_item_id = NEW.obj_item_id AND obj_item_loc.loc_id = NEW.loc_id AND \
obj_item_loc.obj_item_loc_ix = NEW.obj_item_loc_ix"  PL/pgSQL function \
obj_item_loc_before_insert() line 5 at SQL statement  COPY obj_item_loc, line 1: \
"10903011224100007276	10903010224100015110	10900024100000029720	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N	\..."


and for one of the last records:

2013-01-18 22:53:20 CET LOG:  duration: 16.088 ms  plan:
	Query Text: SELECT * FROM obj_item_loc WHERE obj_item_loc.obj_item_id = \
NEW.obj_item_id AND obj_item_loc.loc_id = NEW.loc_id AND obj_item_loc.obj_item_loc_ix \
= NEW.obj_item_loc_ix  Index Scan using obj_item_loc_obj_item_id_idx on obj_item_loc  \
(cost=0.00..8.27 rows=1 width=382)  Index Cond: (obj_item_id = $15)
	  Filter: ((loc_id = $16) AND (obj_item_loc_ix = $17))

I see a linear increase of the duration from 0.0x ms to over 16 ms (apart from a few \
nasty outliers with about 22 ms). Although even at the end there are still a few \
durations < 0.03 but mostly 15 ms and above.

> True disk reads are much more expensive, but given how few reads you
> have relative to hits, I now think that in aggregate the hits are more
> of a concern than the reads are.  In other words, you seem to be CPU
> bound, not IO bound.

Yes, definitely CPU bound, as top shows 99+% CPU utilization.

> Even more so I think (but not with much confidence) that most of your
> "reads" are actually coming from the OS cache and not from the disk.
> PG cannot distinguish true disk reads from OS cache reads.
> 
> When was the last time you reset the stats?  That is, are your
> reported numbers accumulated over several loads, with some having idx4
> and some not?

I set up a fresh database before each test run. So the stats should be clean.

> More formally, use use auto_explain and set
> auto_explain.log_nested_statements to true.  I haven't verified this
> works with triggers, just going by the description I think it should.

Nice tip! Works for triggers as well.

> Your use case is a little unusual.  If you are bulk loading into an
> initially empty table, usually you would remove the trigger and add it
> after the load (with some kind of bulk operation to make up for
> whatever it was the trigger would have been doing).  On the other
> hand, if you are bulk loading into a "live" table and so can't drop
> the trigger, then the live table should have good-enough preexisting
> statistics to make the trigger choose a good plan.

My case is indeed unusual as for the whole model of 276 tables there will never be an \
update nor a delete on any row. The model is rather short-lived, from a few hours to \
a few months. COPY FROM/TO are the only ways to get data into the database and back \
out. And in between there is lots of graph traversal and calculation of convex hulls. \
But the lengthy transaction are by far not the common case.

Having said that, I'm no longer sure if a RDBMS is the right tool for the backend. \
Maybe indexing and storing with a plain full text search engine is. Dunno...

Thanks again!

--
Horst

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic