[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Many Backends stuck in wait event IPC/ParallelFinish
From:       Steven Winfield <Steven.Winfield () cantabcapital ! com>
Date:       2018-01-30 17:01:30
Message-ID: E9FA92C2921F31408041863B74EE4C2001A479E590 () CCPMAILDAG03 ! cantab ! local
[Download RAW message or body]

Hi,

We just had an incident on one of our non-production databases where 14 unr=
elated queries were all hung in wait event IPC / ParallelFinish. We had sys=
tematically called pg_cancel/terminate_backend on all other backends except=
 these (and the autovacuum process mentioned below) to make sure there wasn=
't some other resource that they were deadlocked on.

We attached gdb to a number of the backends, and found their backtraces to =
look like this:

#0  0x00007f9ea3e77903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x000000000077cb5e in WaitEventSetWait ()
#2  0x000000000077d149 in WaitLatch ()
#3  0x00000000004f1d75 in WaitForParallelWorkersToFinish ()
#4  0x00000000006294e7 in ExecParallelFinish ()
#5  0x000000000063a57d in ExecShutdownGather ()
...
#6  0x0000000000629978 in ExecShutdownNode ()       <-- Then zero or more o=
f
#7  0x0000000000676c01 in planstate_tree_walker ()  <-- this pair
...
#10 0x0000000000629925 in ExecShutdownNode ()
#11 0x000000000062494e in standard_ExecutorRun ()
#12 0x00007f9e99d73f5d in pgss_ExecutorRun () from /remote/install/sw/exter=
nal/20180117-4-64/lib/pg_stat_statements.so
#13 0x00000000007a5c24 in PortalRunSelect ()
#14 0x00000000007a7316 in PortalRun ()
#15 0x00000000007a2b49 in exec_simple_query ()
#16 0x00000000007a4157 in PostgresMain ()
#17 0x000000000047926f in ServerLoop ()
#18 0x00000000007200cc in PostmasterMain ()
#19 0x000000000047af97 in main ()

We also sent one of the backends a SIGABRT, so we have a core dump to play =
with. The only other backend running at the time was an autovacuum process,=
 which may also have been hung - it didn't have a wait event in pg_stat_act=
ivity, but I didn't get a chance to strace it or attach gdb as the database=
 restarted itself after we sent the SIGABRT.

The host is running Postgres v10.1 on RHEL7.4.

Any ideas what could have caused this, or what we could do to investigate t=
his further?

Thanks,
Steve.

[Attachment #3 (text/html)]

<html xmlns:v="urn:schemas-microsoft-com:vml" \
xmlns:o="urn:schemas-microsoft-com:office:office" \
xmlns:w="urn:schemas-microsoft-com:office:word" \
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" \
xmlns="http://www.w3.org/TR/REC-html40"> <head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:"Lucida Console";
	panose-1:2 11 6 9 4 5 4 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";
	mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri","sans-serif";
	mso-fareast-language:EN-US;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">We just had an incident on one of our non-production databases \
where 14 unrelated queries were all hung in wait event IPC / ParallelFinish. We had \
systematically called pg_cancel/terminate_backend on all other backends except these \
(and  the autovacuum process mentioned below) to make sure there wasn&#8217;t some \
other resource that they were deadlocked on.<o:p></o:p></p> <p \
class="MsoNormal"><o:p>&nbsp;</o:p></p> <p class="MsoNormal">We attached gdb to a \
number of the backends, and found their backtraces to look like this:<o:p></o:p></p> \
<p class="MsoNormal"><o:p>&nbsp;</o:p></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#0&nbsp; 0x00007f9ea3e77903 in \
__epoll_wait_nocancel () from /lib64/libc.so.6<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:&quot;Lucida Console&quot;">#1&nbsp; \
0x000000000077cb5e in WaitEventSetWait ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:&quot;Lucida Console&quot;">#2&nbsp; \
0x000000000077d149 in WaitLatch ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#3&nbsp; 0x00000000004f1d75 in \
WaitForParallelWorkersToFinish ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#4&nbsp; 0x00000000006294e7 in \
ExecParallelFinish ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#5&nbsp; 0x000000000063a57d in \
ExecShutdownGather ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida \
Console&quot;">&#8230;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs \
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:&quot;Lucida Console&quot;">#6&nbsp; \
0x0000000000629978 in ExecShutdownNode ()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;-- \
Then zero or more of<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#7&nbsp; 0x0000000000676c01 in \
planstate_tree_walker ()&nbsp; &lt;-- this pair<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:&quot;Lucida \
Console&quot;">&#8230;<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#10 0x0000000000629925 in \
ExecShutdownNode ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#11 0x000000000062494e in \
standard_ExecutorRun ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#12 0x00007f9e99d73f5d in \
pgss_ExecutorRun () from \
/remote/install/sw/external/20180117-4-64/lib/pg_stat_statements.so<o:p></o:p></span></p>
 <p class="MsoNormal"><span style="font-family:&quot;Lucida Console&quot;">#13 \
0x00000000007a5c24 in PortalRunSelect ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:&quot;Lucida Console&quot;">#14 \
0x00000000007a7316 in PortalRun ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#15 0x00000000007a2b49 in \
exec_simple_query ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;">#16 0x00000000007a4157 in PostgresMain \
()<o:p></o:p></span></p> <p class="MsoNormal"><span style="font-family:&quot;Lucida \
Console&quot;">#17 0x000000000047926f in ServerLoop ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:&quot;Lucida Console&quot;">#18 \
0x00000000007200cc in PostmasterMain ()<o:p></o:p></span></p> <p \
class="MsoNormal"><span style="font-family:&quot;Lucida Console&quot;">#19 \
0x000000000047af97 in main ()<o:p></o:p></span></p> <p class="MsoNormal"><span \
style="font-family:&quot;Lucida Console&quot;"><o:p>&nbsp;</o:p></span></p> <p \
class="MsoNormal">We also sent one of the backends a SIGABRT, so we have a core dump \
to play with. The only other backend running at the time was an autovacuum process, \
which may also have been hung - it didn&#8217;t have a wait event in \
pg_stat_activity, but  I didn&#8217;t get a chance to strace it or attach gdb as the \
database restarted itself after we sent the SIGABRT.<o:p></o:p></p> <p \
class="MsoNormal"><o:p>&nbsp;</o:p></p> <p class="MsoNormal">The host is running \
Postgres v10.1 on RHEL7.4.<o:p></o:p></p> <p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">Any ideas what could have caused this, or what we could do to \
investigate this further?<o:p></o:p></p> <p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal">Steve.<o:p></o:p></p>
</div>
</body>
</html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic