'Re: [Xen-users] io hang with lvm on md raid1'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xen-users
Subject:    Re: [Xen-users] io hang with lvm on md raid1
From:       Tomas Mozes <hydrapolic () gmail ! com>
Date:       2016-10-18 8:00:14
Message-ID: CAG6MAzQkyMLiVAGjQefKtdi3c=-DLHkOABQcbmBkAGYsak8GoA () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


On Tue, Oct 18, 2016 at 8:23 AM, Sarah Newman <srn@prgmr.com> wrote:

> On 10/17/2016 11:10 PM, Tomas Mozes wrote:
> > On Mon, Oct 17, 2016 at 11:11 PM, Glenn Enright <glenn@rimuhosting.com
> <mailto:glenn@rimuhosting.com>> wrote:
> >
> >     On 10/10/16 16:06, Sarah Newman wrote:
> >
> >         On 10/09/2016 02:23 PM, Glenn Enright wrote:
> >
> >             Bump? I've now replicated this on raid10 and raid6 as well,
> so its not caused by the raid level. An example of a blkback process that is
> >             stuck is
> >             below, if that offers any additional insight. In all cases
> I'm seeing dmeventd stuck first though
> >
> >
> >         Maybe related? https://bugzilla.kernel.org/
> show_bug.cgi?id=119841 <https://bugzilla.kernel.org/show_bug.cgi?id=119841
> >
> >
> >         Xen4CentOS uses 3.18, not 4.4. You could try the Xen4CentOS
> kernel and see if you get the same errors. Unfortunately EOL for 3.18 is
> supposed
> >         to be
> >         January 2017.
> >
> >         --Sarah
> >
> >
> >     Thanks for your followup Sarah, I have to admit I was not able to
> pin down the exact cause. We have since implemented a workaround for the
> issue.
> >
> >     As close as I can determine... for historical reasons related to
> sparse file support we were using cp to copy off an lvm snapshot. Which
> clearly
> >     was not tolerant of io problems. We are now using dd with
> conv=sparse and since then have not seen any further recurrences of the
> lockup.
>
> > We have a similar problem, but it's not related to LVM snapshots. Our
> domU running MariaDB hangs on a highly loaded server after some time (for
> > example after a mysql restore / percona xtrabackup base backup).
> Sometimes we cannot even ssh to the server and it needs to be destroyed via
> xl.
> >
> > The domU runs in PV mode, all mount points are logical volumes taken
> from the dom0, kernel 4.4 and xen 4.6.3. It's happening randomly (on two
> servers).
>
> What observations from the dom0 makes you think this is a related problem?
> My understanding is that Glenn's problems started from cp running in the
> context of the dom0, and that the blkback processes and dmeventd hung.
>
> You should probably set up a login on hvc0 for your domUs if you haven't
> already.
>
> --Sarah
>

I suppose it either can be xen, xfs, lvm, kernel or hardware related. Since
it happened on different hardware, we are looking elsewhere. This is what
comes closest - lvm, kernel 4.4, xen 4.6 and hang and nothing in the dmesg.
Maybe I'm wrong, I'm trying to simulate the problem and then issuing echo w
> /proc/sysrq-trigger. Any other advice is appreciated.

We do have hvc0 for domUs, but it was impossible to login with them when it
happened.

[Attachment #5 (text/html)]

<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Oct \
18, 2016 at 8:23 AM, Sarah Newman <span dir="ltr">&lt;<a href="mailto:srn@prgmr.com" \
target="_blank">srn@prgmr.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">On 10/17/2016 11:10 PM, Tomas Mozes wrote:<br> \
<span class="gmail-">&gt; On Mon, Oct 17, 2016 at 11:11 PM, Glenn Enright &lt;<a \
href="mailto:glenn@rimuhosting.com">glenn@rimuhosting.com</a> &lt;mailto:<a \
href="mailto:glenn@rimuhosting.com">glenn@rimuhosting.com</a>&gt;<wbr>&gt; wrote:<br> \
&gt;<br> &gt;        On 10/10/16 16:06, Sarah Newman wrote:<br>
&gt;<br>
&gt;              On 10/09/2016 02:23 PM, Glenn Enright wrote:<br>
&gt;<br>
&gt;                    Bump? I&#39;ve now replicated this on raid10 and raid6 as \
well, so its not caused by the raid level. An example of a blkback process that \
is<br> &gt;                    stuck is<br>
&gt;                    below, if that offers any additional insight. In all cases \
I&#39;m seeing dmeventd stuck first though<br> &gt;<br>
&gt;<br>
</span>&gt;              Maybe related? <a \
href="https://bugzilla.kernel.org/show_bug.cgi?id=119841" rel="noreferrer" \
target="_blank">https://bugzilla.kernel.org/<wbr>show_bug.cgi?id=119841</a> &lt;<a \
href="https://bugzilla.kernel.org/show_bug.cgi?id=119841" rel="noreferrer" \
target="_blank">https://bugzilla.kernel.org/<wbr>show_bug.cgi?id=119841</a>&gt;<br> \
<span class="gmail-">&gt;<br> &gt;              Xen4CentOS uses 3.18, not 4.4. You \
could try the Xen4CentOS kernel and see if you get the same errors. Unfortunately EOL \
for 3.18 is supposed<br> &gt;              to be<br>
&gt;              January 2017.<br>
&gt;<br>
&gt;              --Sarah<br>
&gt;<br>
&gt;<br>
&gt;        Thanks for your followup Sarah, I have to admit I was not able to pin \
down the exact cause. We have since implemented a workaround for the issue.<br> \
&gt;<br> &gt;        As close as I can determine... for historical reasons related to \
sparse file support we were using cp to copy off an lvm snapshot. Which clearly<br> \
&gt;        was not tolerant of io problems. We are now using dd with conv=sparse and \
since then have not seen any further recurrences of the lockup.<br> <br>
</span><span class="gmail-">&gt; We have a similar problem, but it&#39;s not related \
to LVM snapshots. Our domU running MariaDB hangs on a highly loaded server after some \
time (for<br> &gt; example after a mysql restore / percona xtrabackup base backup). \
Sometimes we cannot even ssh to the server and it needs to be destroyed via xl.<br> \
&gt;<br> &gt; The domU runs in PV mode, all mount points are logical volumes taken \
from the dom0, kernel 4.4 and xen 4.6.3. It&#39;s happening randomly (on two \
servers).<br> <br>
</span>What observations from the dom0 makes you think this is a related problem? My \
understanding is that Glenn&#39;s problems started from cp running in the<br> context \
of the dom0, and that the blkback processes and dmeventd hung.<br> <br>
You should probably set up a login on hvc0 for your domUs if you haven&#39;t \
already.<br> <span class="gmail-HOEnZb"><font color="#888888"><br>
--Sarah<br></font></span></blockquote><div><br></div><div>I suppose it either can be \
xen, xfs, lvm, kernel or hardware related. Since it happened on different hardware, \
we are looking elsewhere. This is what comes closest - lvm, kernel 4.4, xen 4.6 and \
hang and nothing in the dmesg. Maybe I&#39;m wrong, I&#39;m trying to simulate the \
problem and then issuing echo w &gt; /proc/sysrq-trigger. Any other advice is \
appreciated.<br><br></div><div>We do have hvc0 for domUs, but it was impossible to \
login with them when it happened.<br></div></div><br></div></div>


[Attachment #6 (text/plain)]

_______________________________________________
Xen-users mailing list
Xen-users@lists.xen.org
https://lists.xen.org/xen-users

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic