'Re: terminating walsender process due to replication timeout'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: terminating walsender process due to replication timeout
From:       AYahorau () ibagroup ! eu
Date:       2019-05-30 9:07:35
Message-ID: OF021E7D57.6028B6B4-ON4325840A.002EA348-4325840A.0032221F () iba ! by
[Download RAW message or body]

This is a multipart message in MIME format.
--=_alternative 0032221E4325840A_=
Content-Type: text/plain; charset="US-ASCII"

Hello,

I took a look in postgresql source code. As far as I understood walsender 
can send some data to walreceiver regarding some changes and so-called 
keepalive messages.
Exactly these keepalive messages walsender sends periodically once per  
wal_sender_timeout seconds (once per 1 second in my case) and expects to 
get  responses from the walreceiver that everything goes ok. 

I switched on trace and repeated my test.  I found out that walreceiver 
starts processing of the keepalive message in 3 seconds after its sending.
As far as I understand it happened because walreceiver was busy by 
accepting all the changes from the master and writing it to its standby 
WAL logs.
So the standby postgres log was overloaded overfilled by the entries 
regarding writing its own WALs:
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 
latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0 for 
Heap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 
latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8 for 
Heap/UPDATE: off 53 xmax 559 ; new off 129 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 
latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48 for 
Heap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 
latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80 for 
Heap/UPDATE: off 54 xmax 559 ; new off 130 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 
latestObservedXid 559

So  because of writing large volume of data it was not able to handle the 
next messages quickly. It seems not to be related to network bandwidth or 
CPU saturation.
 
Thereby, I  see some kind of a contradiction with the official description 
of wal_sender_timeout  parameter:
Terminate replication connections that are inactive longer than the 
specified number of milliseconds.
This is useful for the sending server to detect a standby crash or network 
outage.

During my test the connection between master and standby  was active and 
there was no any network outage. So according to the description there was 
no need  to terminate 
replication connection.

So, I have some questions:
Is there any way (e. g. via configuration of other) to make the response 
time to keepalive messages independent of the amount of data that the 
walreceiver has to process?
If there is no such a way is it possible to update wal_sender_timeout 
documentation so it reflects reality?

Best regards, 
Andrei



From:   Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
To:     AYahorau@ibagroup.eu, 
Cc:     pgsql-general@postgresql.org, rene.romero.b@gmail.com
Date:   24/05/2019 09:42
Subject:        Re: terminating walsender process due to replication 
timeout



Hello.

At Fri, 17 May 2019 11:04:58 +0300, AYahorau@ibagroup.eu wrote in 
<OFE11EC62A.504EB2B3-ON432583FD.002BE231-432583FD.002C666E@iba.by>
> Can frequent database operations cause getting a standby server behind? 
Is 
> there a way to avoid this situation?
> I checked that walsender works well in my test  if I set 
> wal_sender_timeout at least to 5 second.

It depends on the transaction (WAL) traffic and the bandwidth of
your network. The performacne difference between master and
standby also affects.

The possibilities I can guess are:

- The bandwidth is narrow to the traffic.

- The standby performs poorer than the master.

- Your network is having a sort of trouble. Virtual network
  (local network in a virtual environment) tends to suffer
  network troubles caused by CPU saturation or something else.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



--=_alternative 0032221E4325840A_=
Content-Type: text/html; charset="US-ASCII"

<font size=2 face="sans-serif">Hello,<br>
<br>
I took a look in postgresql source code. As far as I understood walsender
can send some data to walreceiver regarding some changes and so-called
keepalive messages.<br>
Exactly these keepalive messages walsender sends periodically once per
</font><font size=2 color=#2f2f2f face="Arial">&nbsp;</font><tt><font size=2 \
color=#2f2f2f>wal_sender_timeout</font></tt><font size=2 color=#2f2f2f face="Arial"> \
seconds (once per 1 second in my case)</font><font size=2> </font><font size=2 \
color=#2f2f2f face="Arial">and expects to get&nbsp; responses from \
the&nbsp;<i>walreceiver</i>&nbsp;that everything goes ok.</font><font size=2> \
</font><font size=2 face="sans-serif"><br> <br>
I switched on trace and repeated my test. &nbsp;I found out that walreceiver
starts processing of the keepalive message in 3 seconds after its sending.<br>
As far as I understand it happened because </font><font size=2 color=#2f2f2f \
face="Arial"><i>walreceiver</i>&nbsp;was busy by accepting all the changes from the \
master and writing it to its standby WAL logs.</font><font size=2><br>
</font><font size=2 color=#2f2f2f face="Arial">So the standby postgres
log was overloaded overfilled by the entries regarding writing its own
WALs:</font><font size=2><br>
</font><tt><font size=2 color=#2f2f2f>18642 2019-05-28 05:08:31 EDT 00000
DEBUG: record known xact 559 latestObservedXid 559<br>
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0 \
</font></tt><tt><font size=2 color=#a1009f>for</font></tt><tt><font size=2 \
                color=#2f2f2f>
Heap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK<br>
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559<br>
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8 \
</font></tt><tt><font size=2 color=#a1009f>for</font></tt><tt><font size=2 \
                color=#2f2f2f>
Heap/UPDATE: off 53 xmax 559 ; </font></tt><tt><font size=2 \
color=#a1009f>new</font></tt><tt><font size=2 color=#2f2f2f> off 129 xmax 0<br>
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559<br>
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48 \
</font></tt><tt><font size=2 color=#a1009f>for</font></tt><tt><font size=2 \
                color=#2f2f2f>
Heap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK<br>
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559<br>
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80 \
</font></tt><tt><font size=2 color=#a1009f>for</font></tt><tt><font size=2 \
                color=#2f2f2f>
Heap/UPDATE: off 54 xmax 559 ; </font></tt><tt><font size=2 \
color=#a1009f>new</font></tt><tt><font size=2 color=#2f2f2f> off 130 xmax 0<br>
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559</font></tt>
<br><font size=2 face="sans-serif"><br>
</font><font size=2 color=#2f2f2f face="Arial">So&nbsp; because of writing
large volume of data it was not able to handle the next messages quickly.
It seems not to be related to network bandwidth or CPU saturation.<br>
</font><font size=2> </font><font size=2 face="sans-serif"><br>
Thereby, I &nbsp;see some kind of a contradiction with the official description
of <i>wal_sender_timeout </i>&nbsp;parameter:<br>
</font><tt><font size=2 color=#2f2f2f>Terminate replication connections
that are inactive longer than the specified number of milliseconds</font></tt><font \
size=2 face="sans-serif">.<br> </font><tt><font size=2 color=#2f2f2f>This is useful \
</font></tt><tt><font size=2 color=#a1009f>for</font></tt><tt><font size=2 \
color=#2f2f2f> the sending server to detect a standby crash or network \
outage.</font></tt><font size=2 face="sans-serif"><br> <br>
During my test the connection between master and standby &nbsp;was active
and there was no any network outage. So according to the description there
was no need &nbsp;to terminate <br>
replication connection.<br>
<br>
So, I have some questions:<br>
Is there any way (e. g. via configuration of other) to make the response
time to keepalive messages independent of the amount of data that the walreceiver
has to process?<br>
If there is no such a way is it possible to update </font><font size=2 color=#2f2f2f \
face="Arial">wal_sender_timeout documentation so it reflects reality?<br>
</font>
<br><font size=2 face="sans-serif">Best regards, </font>
<br><font size=2 face="sans-serif">Andrei</font>
<br>
<br>
<br>
<br><font size=1 color=#5f5f5f face="sans-serif">From: &nbsp; &nbsp; &nbsp;
&nbsp;</font><font size=1 face="sans-serif">Kyotaro HORIGUCHI \
&lt;horiguchi.kyotaro@lab.ntt.co.jp&gt;</font> <br><font size=1 color=#5f5f5f \
face="sans-serif">To: &nbsp; &nbsp; &nbsp; &nbsp;</font><font size=1 \
face="sans-serif">AYahorau@ibagroup.eu, </font>
<br><font size=1 color=#5f5f5f face="sans-serif">Cc: &nbsp; &nbsp; &nbsp;
&nbsp;</font><font size=1 face="sans-serif">pgsql-general@postgresql.org,
rene.romero.b@gmail.com</font>
<br><font size=1 color=#5f5f5f face="sans-serif">Date: &nbsp; &nbsp; &nbsp;
&nbsp;</font><font size=1 face="sans-serif">24/05/2019 09:42</font>
<br><font size=1 color=#5f5f5f face="sans-serif">Subject: &nbsp; &nbsp;
&nbsp; &nbsp;</font><font size=1 face="sans-serif">Re: terminating
walsender process due to replication timeout</font>
<br>
<hr noshade>
<br>
<br>
<br><tt><font size=2>Hello.<br>
<br>
At Fri, 17 May 2019 11:04:58 +0300, AYahorau@ibagroup.eu wrote in \
&lt;OFE11EC62A.504EB2B3-ON432583FD.002BE231-432583FD.002C666E@iba.by&gt;<br> &gt; Can \
frequent database operations cause getting a standby server behind? Is <br>
&gt; there a way to avoid this situation?<br>
&gt; I checked that walsender works well in my test &nbsp;if I set <br>
&gt; wal_sender_timeout at least to 5 second.<br>
<br>
It depends on the transaction (WAL) traffic and the bandwidth of<br>
your network. The performacne difference between master and<br>
standby also affects.<br>
<br>
The possibilities I can guess are:<br>
<br>
- The bandwidth is narrow to the traffic.<br>
<br>
- The standby performs poorer than the master.<br>
<br>
- Your network is having a sort of trouble. Virtual network<br>
 &nbsp;(local network in a virtual environment) tends to suffer<br>
 &nbsp;network troubles caused by CPU saturation or something else.<br>
<br>
regards.<br>
<br>
-- <br>
Kyotaro Horiguchi<br>
NTT Open Source Software Center<br>
<br>
</font></tt>
<br>
--=_alternative 0032221E4325840A_=--


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic