'[GENERAL] Error stopping postgresql service on a standby server.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    [GENERAL] Error stopping postgresql service on a standby server.
From:       Dipti Bharvirkar <dbharvirkar () gmail ! com>
Date:       2012-08-31 7:59:38
Message-ID: CAL097-FMKWhBdMPWVgTmZBJYOVR_iUWx-sx_84oJGsF3csBe5g () mail ! gmail ! com
[Download RAW message or body]

Hi,

In our project, we use Postgres 9.1.3 version and asynchronous streaming
replication.
In recent times, on couple of our setups, we saw issues stopping Postgres
service on the standby server after streaming replication was setup.

The command "service postgresql stop" returned with a failure message. We
use "pg_ctl stop -D '$PGDATA' -s -m fast" in the Postgres service script to
stop the server.
To see if there were some active client connections that were causing a
failure in stopping Postgres service, I ran the query "SELECT * FROM
pg_stat_activity;".
It failed with the following error: psql: FATAL:  the database system is
shutting down

"ps -ef | grep postgres" returned the following:
postgres 14033     1  0 Aug28 ?        00:00:01
/usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 14044 14033  0 Aug28 ?        00:00:00 postgres: logger process
postgres 14046 14033  0 Aug28 ?        00:00:00 postgres: writer process
postgres 14047 14033  0 Aug28 ?        00:00:00 postgres: stats collector
process
postgres 14912 14033  0 Aug28 ?        00:00:00 postgres: wal receiver
process
root     31519  3003  0 06:18 pts/2    00:00:00 grep postgres

"netstat -anp | grep 5432" returns the following:
tcp        0      0 0.0.0.0:5432                0.0.0.0:*
LISTEN      14033/postmaster
tcp        0      0 127.0.0.1:5432              127.0.0.1:60597
TIME_WAIT   -
tcp        0      0 127.0.0.1:5432              127.0.0.1:60589
TIME_WAIT   -
tcp    67288      0 1.1.1.1:61500 <http://47.11.49.176:61500>          2
.2.2.2:5432 <http://47.11.49.190:5432>           ESTABLISHED
14912/postgres: wal

I had a few queries based on some of the observations -

   1. On one of the setups where similar issue was observed, we stopped
   Postgres service on the master server. As a result of this, the sender
   process on the master server and consequently the receiver process on
   standby stopped. After this, Postgres service could successfully be stopped
   on the standby server. This fact coupled with the output of the two
   commands mentioned above makes me believe that it is the "wal receiver"
   process that is not getting terminated because of which the Postgres
   service on standby server does not stop. Is this assumption right?
   2. If yes, what could be the possible cause for the receiver process to
   not terminate? Shouldn't it stop gracefully when a shutdown command is
   received? When the issue occurred, we had minimal activity on the master
   server. There were no long running transactions being committed to the
   master and streamed to the standby when the issue occurred. Even if there
   were, could it cause the receiver process to not terminate?
   3. How can we avoid running into this issue? Could we be missing some
   step that is essential for a graceful shutdown of the service on a standby?
   4. On one setup where the issue was seen, since "-m fast" option with
   "pg_ctl stop" did not help in stopping the service, I used the "-m
   immediate" option. The service stopped (I understand that this option
   aborts the processes without a clean shutdown and so is not a safe option).
   The service would not start back up. We saw the "invalid record length"
   error during the startup (I guess this was expected since it wasn't a clean
   shutdown). A pg_resetxlog helped recover from this issue. However, that
   seems risky too since there is a chance of data inconsistency. What is the
   best way to recover from this error if it occurs again?

Thanks,
Dipti

[Attachment #3 (text/html)]

Hi,<div><br></div><div>In our project, we use Postgres 9.1.3 version and asynchronous \
streaming replication.</div><div>In recent times, on couple of our setups, we saw \
issues stopping Postgres service on the standby server after streaming replication \
was setup.</div>

<div><br></div><div>The command &quot;service postgresql stop&quot; returned with a \
failure message. We use &quot;pg_ctl stop -D &#39;$PGDATA&#39; -s -m fast&quot; in \
the Postgres service script to stop the server.</div> <div>To see if there were some \
active client connections that were causing a failure in stopping Postgres service, I \
ran the query &quot;SELECT * FROM pg_stat_activity;&quot;.</div> <div>It failed with \
the following error: psql: FATAL:  the database system is shutting \
down</div><div><br></div><div>&quot;ps -ef | grep postgres&quot; returned the \
following:</div><div>postgres 14033     1  0 Aug28 ?        00:00:01 \
/usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/data</div>

<div>postgres <a href="tel:14044%2014033%20%C2%A00" value="+14044140330" \
target="_blank">14044 14033  0</a> Aug28 ?        00:00:00 postgres: logger \
process</div><div>postgres <a href="tel:14046%2014033%20%C2%A00" value="+14046140330" \
target="_blank">14046 14033  0</a> Aug28 ?        00:00:00 postgres: writer \
process</div> <div>postgres <a href="tel:14047%2014033%20%C2%A00" \
value="+14047140330" target="_blank">14047 14033  0</a> Aug28 ?        00:00:00 \
postgres: stats collector process</div> <div>postgres 14912 14033  0 Aug28 ?        \
00:00:00 postgres: wal receiver process</div><div>root     31519  3003  0 06:18 pts/2 \
00:00:00 grep postgres</div><div><br></div><div>&quot;netstat -anp | grep 5432&quot; \
returns the following: </div>

<div>tcp        0      0 <a href="http://0.0.0.0:5432" \
target="_blank">0.0.0.0:5432</a>                0.0.0.0:*                   LISTEN    \
14033/postmaster</div><div>tcp        0      0 <a href="http://127.0.0.1:5432" \
target="_blank">127.0.0.1:5432</a>              <a href="http://127.0.0.1:60597" \
target="_blank">127.0.0.1:60597</a>             TIME_WAIT   -</div>

<div>tcp        0      0 <a href="http://127.0.0.1:5432" \
target="_blank">127.0.0.1:5432</a>              <a href="http://127.0.0.1:60589" \
target="_blank">127.0.0.1:60589</a>             TIME_WAIT   -</div><div>tcp    67288  \
0 1<a href="http://47.11.49.176:61500" target="_blank">.1.1.1:61500</a>          2<a \
href="http://47.11.49.190:5432" target="_blank">.2.2.2:5432</a>           ESTABLISHED \
14912/postgres: wal   </div>

<div><br></div><div>I had a few queries based on some of the observations \
-</div><div><ol><li>On one of the setups where similar issue was observed, we stopped \
Postgres service on the master server. As a result of this, the sender process on the \
master server and consequently the receiver process on standby stopped. After this, \
Postgres service could successfully be stopped on the standby server. This fact \
coupled with the output of the two commands mentioned above makes me believe that it \
is the &quot;wal receiver&quot; process that is not getting terminated because of \
which the Postgres service on standby server does not stop. Is this assumption \
right?</li>

<li>If yes, what could be the possible cause for the receiver process to not \
terminate? Shouldn&#39;t it stop gracefully when a shutdown command is received? When \
the issue occurred, we had minimal activity on the master server. There were no long \
running transactions being committed to the master and streamed to the standby when \
the issue occurred. Even if there were, could it cause the receiver process to not \
terminate?</li>

<li>How can we avoid running into this issue? Could we be missing some step that is \
essential for a graceful shutdown of the service on a standby?</li><li>On one setup \
where the issue was seen, since &quot;-m fast&quot; option with &quot;pg_ctl \
stop&quot; did not help in stopping the service, I used the &quot;-m immediate&quot; \
option. The service stopped (I understand that this option aborts the processes \
without a clean shutdown and so is not a safe option). The service would not start \
back up. We saw the &quot;invalid record length&quot; error during the startup (I \
guess this was expected since it wasn&#39;t a clean shutdown). A pg_resetxlog helped \
recover from this issue. However, that seems risky too since there is a chance of \
data inconsistency. What is the best way to recover from this error if it occurs \
again?</li>

</ol>Thanks,</div><div>Dipti</div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic