[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: [HACKERS] LWLock deadlock and gdb advice
From:       Jeff Janes <jeff.janes () gmail ! com>
Date:       2015-07-31 16:48:28
Message-ID: CAMkU=1wS8R7rz67jYEZoc4P_rXp0frbZerdP1gcCbE=w8fpMvw () mail ! gmail ! com
[Download RAW message or body]

On Thu, Jul 30, 2015 at 8:22 PM, Andres Freund <andres@anarazel.de> wrote:

> On 2015-07-30 09:03:01 -0700, Jeff Janes wrote:
> > On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de>
> wrote:
> > > What do you think about something roughly like the attached?
> > >
> >
> > I've not evaluated the code, but applying it does solve the problem I was
> > seeing.
>
> Cool, thanks for testing! How long did you run the test and how long did
> it, on average, previously take for the issue to occur?
>

I had run it for 24 hours, while it usually took less than 8 hours to look
up before.  I did see it within a few minutes one time when I checked out a
new HEAD and then forgot to re-apply your or Heikki's patch.

But now I've got the same situation again, after 15 hours, with your
patch.  This is probably all down to luck.  The only differences that I can
think of is that I advanced the base to e8e86fbc8b3619da54c, and turned on
JJ_vac and set log_autovacuum_min_duration=0.


checkpointer, 31516
(gdb) bt
#0  0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1  0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd3c5f8) at
pg_sema.c:387
#2  0x00000000006a5754 in LWLockAcquireCommon (l=0x7f84c42edf80,
valptr=0x7f84c42edfa8, val=18446744073709551615) at lwlock.c:1041
#3  LWLockAcquireWithVar (l=0x7f84c42edf80, valptr=0x7f84c42edfa8,
val=18446744073709551615) at lwlock.c:915
#4  0x00000000004cbd86 in WALInsertLockAcquireExclusive () at xlog.c:1445
#5  0x00000000004d4195 in CreateCheckPoint (flags=64) at xlog.c:8181
#6  0x000000000064ea53 in CheckpointerMain () at checkpointer.c:493
#7  0x00000000004e328a in AuxiliaryProcessMain (argc=2,
argv=0x7fff6eed1f50) at bootstrap.c:428
#8  0x00000000006563f7 in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:5167
#9  0x000000000065884a in reaper (postgres_signal_arg=<value optimized
out>) at postmaster.c:2670

(gdb) f 2
(gdb) p *(lock->owner)
$1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x7f84ccd27c88,
sem = {semId = 1480359975, semNum = 12}, waitStatus = 0, procLatch =
{is_set = 1,
    is_shared = 1 '\001', owner_pid = 31540}, lxid = 3209, pid = 31540,
pgprocno = 92, backendId = 12, databaseId = 16384, roleId = 10,
  recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 2
'\002', lwWaitLink = {prev = 0x7f84c42edf10, next = 0x7f84ccd38e60},
  waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0,
waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0},
  myProcLocks = {{prev = 0x7f84ccd38bb8, next = 0x7f84ccd38bb8}, {prev =
0x7f84ccd38bc8, next = 0x7f84ccd38bc8}, {prev = 0x7f84ccd38bd8,
      next = 0x7f84ccd38bd8}, {prev = 0x7f84ccd38be8, next =
0x7f84ccd38be8}, {prev = 0x7f84ccd38bf8, next = 0x7f84ccd38bf8}, {prev =
0x7f84ccd38c08,
      next = 0x7f84ccd38c08}, {prev = 0x7f84ccd38c18, next =
0x7f84ccd38c18}, {prev = 0x7f84ccd38c28, next = 0x7f84ccd38c28}, {prev =
0x7f84ccd38c38,
      next = 0x7f84ccd38c38}, {prev = 0x7f84ccd38c48, next =
0x7f84ccd38c48}, {prev = 0x7f84ccd38c58, next = 0x7f84ccd38c58}, {prev =
0x7f84ccd38c68,
      next = 0x7f84ccd38c68}, {prev = 0x7f84ccd38c78, next =
0x7f84ccd38c78}, {prev = 0x7f84ccd38c88, next = 0x7f84ccd38c88}, {prev =
0x7f84ccb0ceb0,
      next = 0x7f84ccb0ceb0}, {prev = 0x7f84ccd38ca8, next =
0x7f84ccd38ca8}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock =
0x7f84c42e9800,
  fpLockBits = 160528697655296, fpRelId = {0, 0, 0, 0, 0, 0, 0, 0, 0, 2679,
2610, 16413, 16411, 16415, 16414, 16408}, fpVXIDLock = 1 '\001',
  fpLocalTransactionId = 3209}


backend 31540:

#0  0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1  0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd38b38) at
pg_sema.c:387
#2  0x00000000006a53d4 in LWLockWaitForVar (lock=0x7f84c42edf00,
valptr=0x7f84c42edf28, oldval=0, newval=0x7fff6eeceaf8) at lwlock.c:1438
#3  0x00000000004d3276 in WaitXLogInsertionsToFinish (upto=183341449216) at
xlog.c:1576
#4  0x00000000004d3b33 in AdvanceXLInsertBuffer (upto=183345636352,
opportunistic=<value optimized out>) at xlog.c:1888
#5  0x00000000004d8df2 in GetXLogBuffer (ptr=183345636352) at xlog.c:1669
#6  0x00000000004d8fea in CopyXLogRecordToWAL (rdata=0xbd2410,
fpw_lsn=<value optimized out>) at xlog.c:1276
#7  XLogInsertRecord (rdata=0xbd2410, fpw_lsn=<value optimized out>) at
xlog.c:1008
#8  0x00000000004dc72c in XLogInsert (rmid=13 '\r', info=32 ' ') at
xloginsert.c:453
#9  0x0000000000478f41 in ginPlaceToPage (btree=0x7fff6eecee20,
stack=0x1be11e0, insertdata=<value optimized out>, updateblkno=<value
optimized out>,
    childbuf=0, buildStats=0x0) at ginbtree.c:418
#10 0x00000000004798a3 in ginInsertValue (btree=0x7fff6eecee20,
stack=0x1be11e0, insertdata=0x7fff6eeceea0, buildStats=0x0) at
ginbtree.c:748
#11 0x00000000004734aa in ginEntryInsert (ginstate=0x7fff6eecf200,
attnum=39408, key=1, category=<value optimized out>, items=0x1c81c20,
nitem=54,
    buildStats=0x0) at gininsert.c:234
#12 0x000000000047edbb in ginInsertCleanup (ginstate=0x7fff6eecf200,
vac_delay=<value optimized out>, stats=0x0) at ginfast.c:843
#13 0x000000000047fae6 in ginHeapTupleFastInsert (ginstate=0x7fff6eecf200,
collector=<value optimized out>) at ginfast.c:436
#14 0x000000000047386a in gininsert (fcinfo=<value optimized out>) at
gininsert.c:531
...

(gdb) f 2
(gdb) p *(lock->owner)
$1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x0, sem = {semId
= 1480425513, semNum = 0}, waitStatus = 0, procLatch = {is_set = 0,
    is_shared = 1 '\001', owner_pid = 31516}, lxid = 0, pid = 31516,
pgprocno = 112, backendId = -1, databaseId = 0, roleId = 0,
  recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 0
'\000', lwWaitLink = {prev = 0x7f84ccd3cc10, next = 0x7f84ccd3a5e0},
  waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0,
waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0},
  myProcLocks = {{prev = 0x7f84ccd3c678, next = 0x7f84ccd3c678}, {prev =
0x7f84ccd3c688, next = 0x7f84ccd3c688}, {prev = 0x7f84ccd3c698,
      next = 0x7f84ccd3c698}, {prev = 0x7f84ccd3c6a8, next =
0x7f84ccd3c6a8}, {prev = 0x7f84ccd3c6b8, next = 0x7f84ccd3c6b8}, {prev =
0x7f84ccd3c6c8,
      next = 0x7f84ccd3c6c8}, {prev = 0x7f84ccd3c6d8, next =
0x7f84ccd3c6d8}, {prev = 0x7f84ccd3c6e8, next = 0x7f84ccd3c6e8}, {prev =
0x7f84ccd3c6f8,
      next = 0x7f84ccd3c6f8}, {prev = 0x7f84ccd3c708, next =
0x7f84ccd3c708}, {prev = 0x7f84ccd3c718, next = 0x7f84ccd3c718}, {prev =
0x7f84ccd3c728,
      next = 0x7f84ccd3c728}, {prev = 0x7f84ccd3c738, next =
0x7f84ccd3c738}, {prev = 0x7f84ccd3c748, next = 0x7f84ccd3c748}, {prev =
0x7f84ccd3c758,
      next = 0x7f84ccd3c758}, {prev = 0x7f84ccd3c768, next =
0x7f84ccd3c768}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock =
0x7f84c42e9d00,
  fpLockBits = 0, fpRelId = {0 <repeats 16 times>}, fpVXIDLock = 0 '\000',
fpLocalTransactionId = 0}


Cheers,


Jeff

[Attachment #3 (text/html)]

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jul 30, 2015 \
at 8:22 PM, Andres Freund <span dir="ltr">&lt;<a href="mailto:andres@anarazel.de" \
target="_blank">andres@anarazel.de</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><span class="">On 2015-07-30 09:03:01 -0700, Jeff Janes \
wrote:<br> &gt; On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund &lt;<a \
href="mailto:andres@anarazel.de">andres@anarazel.de</a>&gt; wrote:<br> </span><span \
class="">&gt; &gt; What do you think about something roughly like the attached?<br> \
&gt; &gt;<br> &gt;<br>
&gt; I&#39;ve not evaluated the code, but applying it does solve the problem I \
was<br> &gt; seeing.<br>
<br>
</span>Cool, thanks for testing! How long did you run the test and how long did<br>
it, on average, previously take for the issue to \
occur?<br></blockquote><div><br></div><div>I had run it for 24 hours, while it \
usually took less than 8 hours to look up before.   I did see it within a few minutes \
one time when I checked out a new HEAD and then forgot to re-apply your or \
Heikki&#39;s patch.</div><div><br></div><div>But now I&#39;ve got the same situation \
again, after 15 hours, with your patch.   This is probably all down to luck.   The \
only differences that I can think of is that I advanced the base to  \
e8e86fbc8b3619da54c, and turned on JJ_vac and set  \
log_autovacuum_min_duration=0.</div><div><br></div><div><br></div><div>checkpointer, \
31516</div><div><div>(gdb) bt</div><div>#0   0x0000003dcb6eaf27 in semop () from \
/lib64/libc.so.6</div><div>#1   0x00000000006481df in PGSemaphoreLock \
(sema=0x7f84ccd3c5f8) at pg_sema.c:387</div><div>#2   0x00000000006a5754 in \
LWLockAcquireCommon (l=0x7f84c42edf80, valptr=0x7f84c42edfa8, \
val=18446744073709551615) at lwlock.c:1041</div><div>#3   LWLockAcquireWithVar \
(l=0x7f84c42edf80, valptr=0x7f84c42edfa8, val=18446744073709551615) at \
lwlock.c:915</div><div>#4   0x00000000004cbd86 in WALInsertLockAcquireExclusive () at \
xlog.c:1445</div><div>#5   0x00000000004d4195 in CreateCheckPoint (flags=64) at \
xlog.c:8181</div><div>#6   0x000000000064ea53 in CheckpointerMain () at \
checkpointer.c:493</div><div>#7   0x00000000004e328a in AuxiliaryProcessMain (argc=2, \
argv=0x7fff6eed1f50) at bootstrap.c:428</div><div>#8   0x00000000006563f7 in \
StartChildProcess (type=CheckpointerProcess) at postmaster.c:5167</div><div>#9   \
0x000000000065884a in reaper (postgres_signal_arg=&lt;value optimized out&gt;) at \
postmaster.c:2670</div></div><div><br></div><div><div>(gdb) f 2</div><div>(gdb) p \
*(lock-&gt;owner)<br></div><div>$1 = {links = {prev = 0x0, next = 0x0}, \
procgloballist = 0x7f84ccd27c88, sem = {semId = 1480359975, semNum = 12}, waitStatus \
= 0, procLatch = {is_set = 1,</div><div>      is_shared = 1 &#39;\001&#39;, owner_pid \
= 31540}, lxid = 3209, pid = 31540, pgprocno = 92, backendId = 12, databaseId = \
16384, roleId = 10,</div><div>   recoveryConflictPending = 0 &#39;\000&#39;, \
lwWaiting = 1 &#39;\001&#39;, lwWaitMode = 2 &#39;\002&#39;, lwWaitLink = {prev = \
0x7f84c42edf10, next = 0x7f84ccd38e60},</div><div>   waitLock = 0x0, waitProcLock = \
0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = \
{prev = 0x0, next = 0x0},</div><div>   myProcLocks = {{prev = 0x7f84ccd38bb8, next = \
0x7f84ccd38bb8}, {prev = 0x7f84ccd38bc8, next = 0x7f84ccd38bc8}, {prev = \
0x7f84ccd38bd8,</div><div>         next = 0x7f84ccd38bd8}, {prev = 0x7f84ccd38be8, \
next = 0x7f84ccd38be8}, {prev = 0x7f84ccd38bf8, next = 0x7f84ccd38bf8}, {prev = \
0x7f84ccd38c08,</div><div>         next = 0x7f84ccd38c08}, {prev = 0x7f84ccd38c18, \
next = 0x7f84ccd38c18}, {prev = 0x7f84ccd38c28, next = 0x7f84ccd38c28}, {prev = \
0x7f84ccd38c38,</div><div>         next = 0x7f84ccd38c38}, {prev = 0x7f84ccd38c48, \
next = 0x7f84ccd38c48}, {prev = 0x7f84ccd38c58, next = 0x7f84ccd38c58}, {prev = \
0x7f84ccd38c68,</div><div>         next = 0x7f84ccd38c68}, {prev = 0x7f84ccd38c78, \
next = 0x7f84ccd38c78}, {prev = 0x7f84ccd38c88, next = 0x7f84ccd38c88}, {prev = \
0x7f84ccb0ceb0,</div><div>         next = 0x7f84ccb0ceb0}, {prev = 0x7f84ccd38ca8, \
next = 0x7f84ccd38ca8}}, subxids = {xids = {0 &lt;repeats 64 times&gt;}}, backendLock \
= 0x7f84c42e9800,</div><div>   fpLockBits = 160528697655296, fpRelId = {0, 0, 0, 0, \
0, 0, 0, 0, 0, 2679, 2610, 16413, 16411, 16415, 16414, 16408}, fpVXIDLock = 1 \
&#39;\001&#39;,</div><div>   fpLocalTransactionId = \
3209}</div></div><div><br></div><div><br></div><div>backend \
31540:</div><div><br></div><div><div>#0   0x0000003dcb6eaf27 in semop () from \
/lib64/libc.so.6</div><div>#1   0x00000000006481df in PGSemaphoreLock \
(sema=0x7f84ccd38b38) at pg_sema.c:387</div><div>#2   0x00000000006a53d4 in \
LWLockWaitForVar (lock=0x7f84c42edf00, valptr=0x7f84c42edf28, oldval=0, \
newval=0x7fff6eeceaf8) at lwlock.c:1438</div><div>#3   0x00000000004d3276 in \
WaitXLogInsertionsToFinish (upto=183341449216) at xlog.c:1576</div><div>#4   \
0x00000000004d3b33 in AdvanceXLInsertBuffer (upto=183345636352, \
opportunistic=&lt;value optimized out&gt;) at xlog.c:1888</div><div>#5   \
0x00000000004d8df2 in GetXLogBuffer (ptr=183345636352) at xlog.c:1669</div><div>#6   \
0x00000000004d8fea in CopyXLogRecordToWAL (rdata=0xbd2410, fpw_lsn=&lt;value \
optimized out&gt;) at xlog.c:1276</div><div>#7   XLogInsertRecord (rdata=0xbd2410, \
fpw_lsn=&lt;value optimized out&gt;) at xlog.c:1008</div><div>#8   0x00000000004dc72c \
in XLogInsert (rmid=13 &#39;\r&#39;, info=32 &#39; &#39;) at \
xloginsert.c:453</div><div>#9   0x0000000000478f41 in ginPlaceToPage \
(btree=0x7fff6eecee20, stack=0x1be11e0, insertdata=&lt;value optimized out&gt;, \
updateblkno=&lt;value optimized out&gt;,</div><div>      childbuf=0, buildStats=0x0) \
at ginbtree.c:418</div><div>#10 0x00000000004798a3 in ginInsertValue \
(btree=0x7fff6eecee20, stack=0x1be11e0, insertdata=0x7fff6eeceea0, buildStats=0x0) at \
ginbtree.c:748</div><div>#11 0x00000000004734aa in ginEntryInsert \
(ginstate=0x7fff6eecf200, attnum=39408, key=1, category=&lt;value optimized out&gt;, \
items=0x1c81c20, nitem=54,</div><div>      buildStats=0x0) at \
gininsert.c:234</div><div>#12 0x000000000047edbb in ginInsertCleanup \
(ginstate=0x7fff6eecf200, vac_delay=&lt;value optimized out&gt;, stats=0x0) at \
ginfast.c:843</div><div>#13 0x000000000047fae6 in ginHeapTupleFastInsert \
(ginstate=0x7fff6eecf200, collector=&lt;value optimized out&gt;) at \
ginfast.c:436</div><div>#14 0x000000000047386a in gininsert (fcinfo=&lt;value \
optimized out&gt;) at \
gininsert.c:531</div></div><div>...</div><div><br></div><div><div>(gdb) f \
2</div><div>(gdb) p *(lock-&gt;owner)<br></div><div>$1 = {links = {prev = 0x0, next = \
0x0}, procgloballist = 0x0, sem = {semId = 1480425513, semNum = 0}, waitStatus = 0, \
procLatch = {is_set = 0,</div><div>      is_shared = 1 &#39;\001&#39;, owner_pid = \
31516}, lxid = 0, pid = 31516, pgprocno = 112, backendId = -1, databaseId = 0, roleId \
= 0,</div><div>   recoveryConflictPending = 0 &#39;\000&#39;, lwWaiting = 1 \
&#39;\001&#39;, lwWaitMode = 0 &#39;\000&#39;, lwWaitLink = {prev = 0x7f84ccd3cc10, \
next = 0x7f84ccd3a5e0},</div><div>   waitLock = 0x0, waitProcLock = 0x0, waitLockMode \
= 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = \
0x0},</div><div>   myProcLocks = {{prev = 0x7f84ccd3c678, next = 0x7f84ccd3c678}, \
{prev = 0x7f84ccd3c688, next = 0x7f84ccd3c688}, {prev = 0x7f84ccd3c698,</div><div>    \
next = 0x7f84ccd3c698}, {prev = 0x7f84ccd3c6a8, next = 0x7f84ccd3c6a8}, {prev = \
0x7f84ccd3c6b8, next = 0x7f84ccd3c6b8}, {prev = 0x7f84ccd3c6c8,</div><div>         \
next = 0x7f84ccd3c6c8}, {prev = 0x7f84ccd3c6d8, next = 0x7f84ccd3c6d8}, {prev = \
0x7f84ccd3c6e8, next = 0x7f84ccd3c6e8}, {prev = 0x7f84ccd3c6f8,</div><div>         \
next = 0x7f84ccd3c6f8}, {prev = 0x7f84ccd3c708, next = 0x7f84ccd3c708}, {prev = \
0x7f84ccd3c718, next = 0x7f84ccd3c718}, {prev = 0x7f84ccd3c728,</div><div>         \
next = 0x7f84ccd3c728}, {prev = 0x7f84ccd3c738, next = 0x7f84ccd3c738}, {prev = \
0x7f84ccd3c748, next = 0x7f84ccd3c748}, {prev = 0x7f84ccd3c758,</div><div>         \
next = 0x7f84ccd3c758}, {prev = 0x7f84ccd3c768, next = 0x7f84ccd3c768}}, subxids = \
{xids = {0 &lt;repeats 64 times&gt;}}, backendLock = 0x7f84c42e9d00,</div><div>   \
fpLockBits = 0, fpRelId = {0 &lt;repeats 16 times&gt;}, fpVXIDLock = 0 \
&#39;\000&#39;, fpLocalTransactionId = \
0}</div></div><div><br></div><div><br></div><div>Cheers,</div><div><br></div><div><br></div><div>Jeff</div></div></div></div>




[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic