'[DRBD-user] Fwd: Kernel Oops on peer when removing LVM snapshot'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       drbd-user
Subject:    [DRBD-user] Fwd:  Kernel Oops on peer when removing LVM snapshot
From:       Paul Gideon Dann <pdgiddie () gmail ! com>
Date:       2015-06-22 10:31:18
Message-ID: CALZj-VqgO28hr3u=s=qwHNosGwLTBnAo5skbXsnKedkLh_f2hg () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

(Forwarding to list --- sorry!)

Hi Robert; thanks for answering! Yes, I considered that possibility myself.
However, as this is a single-primary resource, the DRBD block device isn't
available to the LVM layer on the secondary side. The VG is not visible
until the host becomes primary for that resource (at which point the VG and
LVs appear automatically), and the LVM layer holds the resource open until
the VG is deactivated. So I'm pretty confident that the metadata changes
are completely invisible to the LVM layer. I'm certainly not seeing
anything in the kernel logs to do with LVM during these oopses. I agree
that the fact that it's an LVM operation that triggers this does suggest
that there's something going on with LVM metadata here, but I think it must
be some kind of bug, rather than misconfiguration.

I know that DRBD on LVM is not an unusual use-case, so if the changing LVM
metadata is the issue, I'd expect this issue to be well-documented by now.
But it seems not to be, which makes me think it's an unexpected issue. If I
were running in dual-primary mode, of course, I'd need to set up clustered
LVM to lock the metadata correctly. Maybe I need to do that anyway, but I
don't know why...

Paul

On 22 June 2015 at 10:41, Robert Altnoeder <robert.altnoeder@linbit.com>
wrote:

>  If I did not misunderstand what this is about, then the problem seem to
> be this:
>
> You are using a DRBD device as the physical volume for a volume group. As
> soon as something changes in that volume group, e.g. you add or remove
> volumes (such as snapshots), the metadata for that volume group on the
> physical volume changes.
> That is what you replicate to the peer (the secondary), so all the LVM on
> the peer can see, is data magically changing on its physical volume, and
> that's where the kernel Oops is coming from, because data is not supposed
> to change without the local node knowing about it. This is an unsafe
> scenario unless there is some kind of synchronization in place on the LVM
> level (e.g. "Clustered LVM" aka CLVM -- instead of the normal LVM, which is
> not designed to operate on shared or replicated storage).
>
> br,
> Robert
>
> On 06/22/2015 11:06 AM, Paul Gideon Dann wrote:
>
>  So no ideas concerning this, then? I've seen the same thing happen on
> another resource, now. Actually, it doesn't need to be a snapshot: removing
> any logical volume causes the oops. It doesn't happen for every resource,
> though.
>
> [...snip...]
>
>  Paul
>
> On 16 June 2015 at 11:51, Paul Gideon Dann <pdgiddie@gmail.com> wrote:
>
>>  This is an interesting (though frustrating) issue that I've run into
>> with DRBD+LVM, and having finally exhausted everything I can think of or
>> find myself, I'm hoping the mailing list might be able to offer some help!
>>
>>  My setup involves DRBD resources that are backed by LVM LVs, and then
>> formatted as PVs themselves, each forming its own VG.
>>
>>  System VG -> Backing LV -> DRBD -> Resource VG -> Resource LVs
>>
>>  The problem I'm having happens only for one DRBD resource, and not for
>> any of the others. This is what I do:
>>
>>  I create a snapshot of the Resource LV (meaning that the snapshot will
>> also be replicated via DRBD), and everything is fine. However, when I
>> *remove* the snapshot, the *secondary* peer oopses immediately:
>>
>> [...snip...]
>>
>>  Cheers,
>>  Paul
>>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>

[Attachment #5 (text/html)]

<div dir="ltr"><br><div class="gmail_quote">(Forwarding to list --- \
sorry!)<br><br></div><div class="gmail_quote"><div dir="ltr"><div><div>Hi Robert; \
thanks for answering! Yes, I considered that possibility myself. However, as this is \
a single-primary resource, the DRBD block device isn&#39;t available to the LVM layer \
on the secondary side. The VG is not visible until the host becomes primary for that \
resource (at which point the VG and LVs appear automatically), and the LVM layer \
holds the resource open until the VG is deactivated. So I&#39;m pretty confident that \
the metadata changes are completely invisible to the LVM layer. I&#39;m certainly not \
seeing anything in the kernel logs to do with LVM during these oopses. I agree that \
the fact that it&#39;s an LVM operation that triggers this does suggest that \
there&#39;s something going on with LVM metadata here, but I think it must be some \
kind of bug, rather than misconfiguration.<br><br></div>I know that DRBD on LVM is \
not an unusual use-case, so if the changing LVM metadata is the issue, I&#39;d expect \
this issue to be well-documented by now. But it seems not to be, which makes me think \
it&#39;s an unexpected issue. If I were running in dual-primary mode, of course, \
I&#39;d need to set up clustered LVM to lock the metadata correctly. Maybe I need to \
do that anyway, but I don&#39;t know why...<br><br></div>Paul<br><br></div><div \
class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On 22 June 2015 \
at 10:41, Robert Altnoeder <span dir="ltr">&lt;<a \
href="mailto:robert.altnoeder@linbit.com" \
target="_blank">robert.altnoeder@linbit.com</a>&gt;</span> \
wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">  

  <div text="#000000" bgcolor="#FFFFFF">
    <div>If I did not misunderstand what this is
      about, then the problem seem to be this:<br>
      <br>
      You are using a DRBD device as the physical volume for a volume
      group. As soon as something changes in that volume group, e.g. you
      add or remove volumes (such as snapshots), the metadata for that
      volume group on the physical volume changes.<br>
      That is what you replicate to the peer (the secondary), so all the
      LVM on the peer can see, is data magically changing on its
      physical volume, and that&#39;s where the kernel Oops is coming from,
      because data is not supposed to change without the local node
      knowing about it. This is an unsafe scenario unless there is some
      kind of synchronization in place on the LVM level (e.g. &quot;Clustered
      LVM&quot; aka CLVM -- instead of the normal LVM, which is not designed
      to operate on shared or replicated storage).<br>
      <br>
      br,<br>
      Robert<span><br>
      <br>
      On 06/22/2015 11:06 AM, Paul Gideon Dann wrote:<br>
    </span></div>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div><span>So no ideas concerning this, then? I&#39;ve seen the same
            thing happen on another resource, now. Actually, it doesn&#39;t
            need to be a snapshot: removing any logical volume causes
            the oops. It doesn&#39;t happen for every resource, though.<br>
            <br></span>
            [...snip...]<br>
          </div>
          <br>
        </div>
        Paul<br>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote"><span>On 16 June 2015 at 11:51, Paul Gideon
          Dann <span dir="ltr">&lt;<a href="mailto:pdgiddie@gmail.com" \
target="_blank">pdgiddie@gmail.com</a>&gt;</span>  wrote:<br>
          </span><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex">  <div dir="ltr"><span>
              <div>This is an interesting (though frustrating) issue
                that I&#39;ve run into with DRBD+LVM, and having finally
                exhausted everything I can think of or find myself, I&#39;m
                hoping the mailing list might be able to offer some
                help!<br>
                <br>
              </div>
              <div>My setup involves DRBD resources that are backed by
                LVM LVs, and then formatted as PVs themselves, each
                forming its own VG.<br>
                <br>
              </div>
              <div>System VG -&gt; Backing LV -&gt; DRBD -&gt; Resource
                VG -&gt; Resource LVs<br>
                <br>
              </div>
              <div>The problem I&#39;m having happens only for one DRBD
                resource, and not for any of the others. This is what I
                do:<br>
                <br>
              </div>
              </span><div><span>I create a snapshot of the Resource LV (meaning that
                the snapshot will also be replicated via DRBD), and
                everything is fine. However, when I *remove* the
                snapshot, the *secondary* peer oopses immediately:<br>
                <br></span>
                [...snip...]<br>
                <br>
              </div>
              <div>Cheers,<br>
              </div>
              <div>Paul<br>
              </div>
            </div>
          </blockquote>
        </div>
      </div>
    </blockquote>
    <br>
  </div>

<br></div></div>_______________________________________________<br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com" \
target="_blank">drbd-user@lists.linbit.com</a><br> <a \
href="http://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" \
target="_blank">http://lists.linbit.com/mailman/listinfo/drbd-user</a><br> \
<br></blockquote></div><br></div> </div><br></div>

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

[prev in list] [next in list] [prev in thread] [next in thread]