[prev in list] [next in list] [prev in thread] [next in thread]
List: postgresql-general
Subject: Many Backends stuck in wait event IPC/ParallelFinish
From: Steven Winfield <Steven.Winfield () cantabcapital ! com>
Date: 2018-01-30 17:01:30
Message-ID: E9FA92C2921F31408041863B74EE4C2001A479E590 () CCPMAILDAG03 ! cantab ! local
[Download RAW message or body]
Hi,
We just had an incident on one of our non-production databases where 14 unr=
elated queries were all hung in wait event IPC / ParallelFinish. We had sys=
tematically called pg_cancel/terminate_backend on all other backends except=
these (and the autovacuum process mentioned below) to make sure there wasn=
't some other resource that they were deadlocked on.
We attached gdb to a number of the backends, and found their backtraces to =
look like this:
#0 0x00007f9ea3e77903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000077cb5e in WaitEventSetWait ()
#2 0x000000000077d149 in WaitLatch ()
#3 0x00000000004f1d75 in WaitForParallelWorkersToFinish ()
#4 0x00000000006294e7 in ExecParallelFinish ()
#5 0x000000000063a57d in ExecShutdownGather ()
...
#6 0x0000000000629978 in ExecShutdownNode () <-- Then zero or more o=
f
#7 0x0000000000676c01 in planstate_tree_walker () <-- this pair
...
#10 0x0000000000629925 in ExecShutdownNode ()
#11 0x000000000062494e in standard_ExecutorRun ()
#12 0x00007f9e99d73f5d in pgss_ExecutorRun () from /remote/install/sw/exter=
nal/20180117-4-64/lib/pg_stat_statements.so
#13 0x00000000007a5c24 in PortalRunSelect ()
#14 0x00000000007a7316 in PortalRun ()
#15 0x00000000007a2b49 in exec_simple_query ()
#16 0x00000000007a4157 in PostgresMain ()
#17 0x000000000047926f in ServerLoop ()
#18 0x00000000007200cc in PostmasterMain ()
#19 0x000000000047af97 in main ()
We also sent one of the backends a SIGABRT, so we have a core dump to play =
with. The only other backend running at the time was an autovacuum process,=
which may also have been hung - it didn't have a wait event in pg_stat_act=
ivity, but I didn't get a chance to strace it or attach gdb as the database=
restarted itself after we sent the SIGABRT.
The host is running Postgres v10.1 on RHEL7.4.
Any ideas what could have caused this, or what we could do to investigate t=
his further?
Thanks,
Steve.
[Attachment #3 (text/html)]
<html xmlns:v="urn:schemas-microsoft-com:vml" \
xmlns:o="urn:schemas-microsoft-com:office:office" \
xmlns:w="urn:schemas-microsoft-com:office:word" \
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" \
xmlns="http://www.w3.org/TR/REC-html40"> <head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"Lucida Console";
panose-1:2 11 6 9 4 5 4 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">We just had an incident on one of our non-production databases \
where 14 unrelated queries were all hung in wait event IPC / ParallelFinish. We had \
systematically called pg_cancel/terminate_backend on all other backends except these \
(and the autovacuum process mentioned below) to make sure there wasn’t some \
other resource that they were deadlocked on.<o:p></o:p></p> <p \
class="MsoNormal"><o:p> </o:p></p> <p class="MsoNormal">We attached gdb to a \
number of the backends, and found their backtraces to look like this:<o:p></o:p></p> \
<p class="MsoNormal"><o:p> </o:p></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#0 0x00007f9ea3e77903 in \
__epoll_wait_nocancel () from /lib64/libc.so.6<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:"Lucida Console"">#1 \
0x000000000077cb5e in WaitEventSetWait ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:"Lucida Console"">#2 \
0x000000000077d149 in WaitLatch ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#3 0x00000000004f1d75 in \
WaitForParallelWorkersToFinish ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#4 0x00000000006294e7 in \
ExecParallelFinish ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#5 0x000000000063a57d in \
ExecShutdownGather ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida \
Console"">… &nbs \
p; \
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Lucida Console"">#6 \
0x0000000000629978 in ExecShutdownNode () <-- \
Then zero or more of<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#7 0x0000000000676c01 in \
planstate_tree_walker () <-- this pair<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:"Lucida \
Console"">…<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#10 0x0000000000629925 in \
ExecShutdownNode ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#11 0x000000000062494e in \
standard_ExecutorRun ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#12 0x00007f9e99d73f5d in \
pgss_ExecutorRun () from \
/remote/install/sw/external/20180117-4-64/lib/pg_stat_statements.so<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Lucida Console"">#13 \
0x00000000007a5c24 in PortalRunSelect ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:"Lucida Console"">#14 \
0x00000000007a7316 in PortalRun ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#15 0x00000000007a2b49 in \
exec_simple_query ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console"">#16 0x00000000007a4157 in PostgresMain \
()<o:p></o:p></span></p> <p class="MsoNormal"><span style="font-family:"Lucida \
Console"">#17 0x000000000047926f in ServerLoop ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:"Lucida Console"">#18 \
0x00000000007200cc in PostmasterMain ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:"Lucida Console"">#19 \
0x000000000047af97 in main ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:"Lucida Console""><o:p> </o:p></span></p> <p \
class="MsoNormal">We also sent one of the backends a SIGABRT, so we have a core dump \
to play with. The only other backend running at the time was an autovacuum process, \
which may also have been hung - it didn’t have a wait event in \
pg_stat_activity, but I didn’t get a chance to strace it or attach gdb as the \
database restarted itself after we sent the SIGABRT.<o:p></o:p></p> <p \
class="MsoNormal"><o:p> </o:p></p> <p class="MsoNormal">The host is running \
Postgres v10.1 on RHEL7.4.<o:p></o:p></p> <p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Any ideas what could have caused this, or what we could do to \
investigate this further?<o:p></o:p></p> <p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal">Steve.<o:p></o:p></p>
</div>
</body>
</html>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic