'Re: is single row update improved with function'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: is single row update improved with function
From:       David Rowley <david.rowley () 2ndquadrant ! com>
Date:       2017-12-31 23:31:10
Message-ID: CAKJS1f8ArajJ1fEWB2+YpxEM2D=SqCCS1Jzx=n99DxCkRYbw1w () mail ! gmail ! com
[Download RAW message or body]

On 1 January 2018 at 12:06, Rob Sargent <robjsargent@gmail.com> wrote:
> I must update 3M of 100M records, with tuple specific modifications.  I can \
> generate the necessary sql, but I'm wondering if files of simple update statements \
> affecting a single row is more effective than files of a function call doing the \
> same update given the necessary values, including where clause restrictions?  The \
> query plan set by the first should be decent for the remainder. 
> Alternatively, would a bulk load into a table of replacement values and join info \
> be the fastest way?

It's probably also worth thinking about this table's usage pattern. If
this table is an otherwise static table, then you may wish to think
about the little bit of bloat that doing the UPDATEs in a single
transaction would cause.

> Either way I can break the updates into roughly 393 transactions (7500 rows \
> affected per tx) or 8646 transactions (350 rows  per tx) if less is more in this \
> world.

If you were to perform the UPDATEs in batches it would allow you to
run a VACUUM between the UPDATEs. However, it might not be so
important as 3 million rows in 100 million is just 3%, so assuming all
your rows are the same size, then even doing this as a single
transaction would only cause 3% churn on the table. Possibly some of
the UPDATEs would reuse existing free space within the table, but if
they don't then it would only mean an extra 3% bloat.

As for which is faster. It's most likely going to depend on the query
plan for the UPDATE statements. If you need to perform 3 million seq
scans on the table, by doing 3 million individual statements, that's
likely not going to perform well. 3 million statements is likely not a
good option in any case as it means parsing and planning 3 million
UPDATE statements. Even your 393 statements might not be very good if
each of those UPDATEs must perform a seq scans on the 100 million row
table, but if each of those 393 statements can make use of an index to
easily get those 7500 rows, then that might be a better option than
doing the single UPDATE join method you mentioned.

It does sound like something you could take offline and benchmark if
performance is that critical. It's not really possible for us to tell
which is faster without seeing the schema, UPDATE statements and query
plans chosen.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic