'[Slony1-general] Slony 2.2.3 extended lag recovery'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       slony1-general
Subject:    [Slony1-general] Slony 2.2.3 extended lag recovery
From:       Glyn Astill <glynastill () yahoo ! co ! uk>
Date:       2015-01-29 17:01:38
Message-ID: 685898092.2705713.1422550898490.JavaMail.yahoo () mail ! yahoo ! com
[Download RAW message or body]

Hi All,

We're currently running slony 2.2.3 with 4 pg 9.0 nodes.  Occasionally since we \
upgraded from 2.1 I've been seeing some "humps" where subscribers are lagging and \
taking an extended period of time to recover.

I can't ever reproduce it and I've come to a dead end.  I'm going to waffle a bit \
below, but I'm hoping someone can see something I'm missing.

These humps appear to not really correlate with increased activity on the origin, and \
I've been struggling to put my finger on anything aggravating the issue.  Today \
however I've seen the same symptoms, and the start times of the lag align with an \
exclusive lock on a subscribers replicated table whilst vaccum full was run.

Whilst I'd expect that to cause some lag and a bit of a backlog, the vacuum full took \
only 2 minutes and the lag builds up gradually afterwards.  Eventually after a long \
time replication will catch up, but it's out of proportion to our transaction rate, \
and a restart of the slon on the subscriber causes it to catch up very swiftly.  I've \
attached a graph of sl_status from the origin showing the time and event lag buildup \
and a pretty swift slice on the end of the humps where I restart the slons.

The graph (attached) shows nodes 4 & 5 starting to lag first, as they were the first \
to have the vacuum full run, then node 7 starts to lag when it has the same vacuum \
full run (at this point the lag on the two other nodes hadn't been noticed).  This \
excerpt from one of the subscribers shows the copy being blocked:

2015-01-29 10:09:54 GMT [13246]: [39-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: Logswitch to \
sl_log_1 initiated 2015-01-29 10:09:54 GMT [13246]: [40-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  SQL statement "SELECT \
"_main_replication".logswitch_start()" 2015-01-29 10:12:04 GMT [13243]: [9-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  duration: 5089.684 \
ms  statement: COPY "_main_replication"."sl_log_1" ( log_origin, \
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, log_cmdtype, \
log_cmdupdncols,log_cmdargs) FROM STDIN 2015-01-29 10:19:05 GMT [13243]: [10-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  process 13243 still \
waiting for RowExclusiveLock on relation 279233 of database 274556 after 1000.038 ms \
at character 13 2015-01-29 10:19:05 GMT [13243]: [11-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost QUERY:  UPDATE ONLY \
"myschema"."table_being_full_vacuumed" SET "text" = $1 WHERE "address" = $2; \
2015-01-29 10:19:05 GMT [13243]: [12-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost CONTEXT:  COPY sl_log_1, \
line 37: "8    1108084090    2    1219750937    myschema    table_being_full_vacuumed \
U    1    {text,"",address,some_address_data}" 2015-01-29 10:19:05 GMT [13243]: \
[13-1] app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost STATEMENT:  COPY \
"_main_replication"."sl_log_1" ( log_origin, \
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, log_cmdtype, \
log_cmdupdncols,log_cmdargs) FROM STDIN 2015-01-29 10:20:43 GMT [13243]: [14-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  process 13243 \
acquired RowExclusiveLock on relation 279233 of database 274556 after 98754.902 ms at \
character 13 2015-01-29 10:20:43 GMT [13243]: [15-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost QUERY:  UPDATE ONLY \
"myschema"."table_being_full_vacuumed" SET "text" = $1 WHERE "address" = $2; \
2015-01-29 10:20:43 GMT [13243]: [16-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost CONTEXT:  COPY sl_log_1, \
line 37: "8    1108084090    2    1219750937    myschema    table_being_full_vacuumed \
U    1    {text,"",address,some_address_data}" 2015-01-29 10:20:43 GMT [13243]: \
[17-1] app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost STATEMENT:  COPY \
"_main_replication"."sl_log_1" ( log_origin, \
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, log_cmdtype, \
log_cmdupdncols,log_cmdargs) FROM STDIN 2015-01-29 10:20:43 GMT [13243]: [18-1] \
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  duration: 98915.154 \
ms  statement: COPY "_main_replication"."sl_log_1" ( log_origin, \
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, log_cmdtype, \
log_cmdupdncols,log_cmdargs) FROM STDIN 2015-01-29 10:22:00 GMT [13246]: [41-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: log switch to \
sl_log_1 complete - truncate sl_log_2 2015-01-29 10:22:00 GMT [13246]: [42-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  PL/pgSQL function \
"cleanupevent" line 94 at assignment 2015-01-29 10:34:01 GMT [13246]: [43-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: Logswitch to \
sl_log_2 initiated 2015-01-29 10:34:01 GMT [13246]: [44-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  SQL statement "SELECT \
"_main_replication".logswitch_start()" 2015-01-29 10:46:08 GMT [13246]: [45-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: could not lock \
sl_log_1 - sl_log_1 not truncated 2015-01-29 10:46:08 GMT [13246]: [46-1] \
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  PL/pgSQL function \
"cleanupevent" line 94 at assignment

After this the copies go through cycles of increasing and decreacing duration, which \
I'm guessing is something normal (perhaps syncs being grouped?), and I'm seeing \
messages stating "could not lock sl_log_1 - sl_log_1 not truncated" a couple of times \
before the switch completes, and again I'm guessing this is just blocking because of \
inserts capturing changes and is normal? Autovacuum hasn't hit sl_log at all during \
this period.

Does anyone have any ideas?  I've debug logs from the slons, and postgres logs I can \
send off list if anyone has any ideas.

Thanks
Glyn

["lag.gif" (image/gif)]

_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

[prev in list] [next in list] [prev in thread] [next in thread]