'Re: Issues controller takeover/givebacks'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       toasters
Subject:    Re: Issues controller takeover/givebacks
From:       Michael Garrison <mcgarr () umich ! edu>
Date:       2014-07-29 14:53:58
Message-ID: CACkdw7D2wqQWHcFiCSVU6FnTuqhtK9uQTrwaqNrkwyfLUt8E4A () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

We've experienced a few problems with takeover/giveback events - but not
for the reasons you mentioned. We've had issues due to data ontap bugs and
also due to controller load that resulted in either failed giveback or the
cluster to panic. Our track record for successful to/gb isn't stellar, so
we're always hesitant when it has to happen, even with all of the steps in
place to work around past issues.

--
Mike Garrison

On Mon, Jul 28, 2014 at 6:46 PM, Andrew Laurence <atlauren@me.com> wrote:

> Greetings all.  Do you ever find issues with takeover/giveback events?
>  We've had two relatively recently which resulted in service outages.
>
> Some months ago we had a service outage during an upgrade to 8.1.4p1 on
> our 3240 pair.  During one of the givebacks, the controller came back
> without any networking.  NetApp speculates the 10GE driver didn't
> initialize on when the controller rebooted.  Either way, our storage
> service was AWOL as a result.
>
> And then Friday we had a real controller crash.  This one occurred on an
> offsite HA pair of 2040s.  Lately we have been chasing down a PCI error on
> one controller; NetApp eventually came out to replace that controller.
>  After moving operations via 'cf takeover', it turned out that the new
> controller had very old SP firmware, older than is supported for the
> running 8.1.4.P1.  The tech called an engineer for guidance, but just as
> the engineer came on WebEx the running controller panicked; all services
> were offline.  (NetApp is still investigating this; we're uploading
> different files and working through the process.)
>
> While we can look at each of these and consider them anomalies for
> different reasons, it's still very worrisome that the core availability
> technology has twice resulted in service outages.
>
> Any thoughts?
>
> Thanks,
> Andrew
>
> --
> Andrew Laurence                Office of Information Technology
> atlauren@uci.edu               University of California, Irvine
>
>
>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>

[Attachment #5 (text/html)]

<div dir="ltr">We&#39;ve experienced a few problems with takeover/giveback events - \
but not for the reasons you mentioned. We&#39;ve had issues due to data ontap bugs \
and also due to controller load that resulted in either failed giveback or the \
cluster to panic. Our track record for successful to/gb isn&#39;t stellar, so \
we&#39;re always hesitant when it has to happen, even with all of the steps in place \
to work around past issues.  <div> <br></div><div>--</div><div>Mike \
Garrison</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, \
Jul 28, 2014 at 6:46 PM, Andrew Laurence <span dir="ltr">&lt;<a \
href="mailto:atlauren@me.com" target="_blank">atlauren@me.com</a>&gt;</span> \
wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex">Greetings all.   Do you ever find issues with \
takeover/giveback events?   We&#39;ve had two relatively recently which resulted in \
service outages.<br>

<br>
Some months ago we had a service outage during an upgrade to 8.1.4p1 on our 3240 \
pair.   During one of the givebacks, the controller came back without any networking. \
NetApp speculates the 10GE driver didn&#39;t initialize on when the controller \
rebooted.   Either way, our storage service was AWOL as a result.<br>

<br>
And then Friday we had a real controller crash.   This one occurred on an offsite HA \
pair of 2040s.   Lately we have been chasing down a PCI error on one controller; \
NetApp eventually came out to replace that controller.   After moving operations via \
&#39;cf takeover&#39;, it turned out that the new controller had very old SP \
firmware, older than is supported for the running 8.1.4.P1.   The tech called an \
engineer for guidance, but just as the engineer came on WebEx the running controller \
panicked; all services were offline.   (NetApp is still investigating this; we&#39;re \
uploading different files and working through the process.)<br>

<br>
While we can look at each of these and consider them anomalies for different reasons, \
it&#39;s still very worrisome that the core availability technology has twice \
resulted in service outages.<br> <br>
Any thoughts?<br>
<br>
Thanks,<br>
Andrew<br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Andrew Laurence                        Office of Information Technology<br>
<a href="mailto:atlauren@uci.edu">atlauren@uci.edu</a>                      \
University of California, Irvine<br> <br>
<br>
<br>
<br>
_______________________________________________<br>
Toasters mailing list<br>
<a href="mailto:Toasters@teaparty.net">Toasters@teaparty.net</a><br>
<a href="http://www.teaparty.net/mailman/listinfo/toasters" \
target="_blank">http://www.teaparty.net/mailman/listinfo/toasters</a><br> \
</font></span></blockquote></div><br></div>

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

[prev in list] [next in list] [prev in thread] [next in thread]