'[c-nsp] Very strange ME3600 err-disabled on Te0/1 Te0/2 problem'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cisco-nsp
Subject:    [c-nsp] Very strange ME3600 err-disabled on Te0/1 Te0/2 problem
From:       Joe Bender <jbender () clearrate ! com>
Date:       2015-09-30 19:45:45
Message-ID: 5E37612FCA29CC4894C9751E2445A8445749367E () CRC-Exchange ! corp ! clearrate ! net
[Download RAW message or body]

Everyone,

We've had a very strange problem with our ME3600s, specifically ME-3600X-24TS-M, that \
we use in a ring topology, with the ten gigabit ports running purely "no switchport" \
L3 links between sites.  All of them are running 15.3(3) S5 at the moment.

We've had several incidents on multiple switches (5, to be exact)  where both Te0/1 \
and Te0/2 drop offline due to unknown causes.  Switches seem to fail somewhat \
randomly.

Symptoms are:

Both interfaces have BFD detect a node down, simultaneously.
30-45 seconds later UDLD detects unidirectional link, puts both interfaces into \
error-disabled. Other side doesn't notice anything amiss other than the interface \
going down because the node having issues shuts the port down.

Active interventions like removing udld from the ports, then shut/no shut won't bring \
the ports back online.  They'll come out of error-disable,  but that's about it.  The \
*only* thing that'll bring the ten gigabit interfaces back online is a switch reload. \
There's no tracebacks or anything else.

Changes that have been recently made in the network include installing two ASR920s \
with current IOS XE (03.16.00.S ) versions into the network to replace a single \
ME3600 to remove a single-point-of-failure at a critical spot in the ring.  The nodes \
that have had issues are the "closest" nodes in the ring to where the ASRs were \
installed, but aren't necessarily directly connected to said ASRs.  These also tend \
to be the nodes with the highest amount of traffic flowing through them.

At the moment, we haven't had a problem with them in several days, and as a result \
Cisco TAC has been somewhat useless as they're insisting they can't help us if we \
can't have them actively looking at one of them when it fails.  This, for us, is very \
difficult, because I can't wait the 35-40 minutes it usually takes TAC to get in, \
webex, etc as this is taking our customers down, and they're understandably getting a \
bit twitchy with some of these issues.  I can't even get a script to run against the \
switch for us to pull additional information other than a show tech because they're \
insisting they can't tell me what they might run to get info (which I find strange)

One TAC team thought it was a bug or some sort of ASIC/software but haven't been able \
to isolate anything that might be the cause, bug-id or otherwise.  Right now I'm \
dealing with someone who seems to be concentrating on the fact that he's seeing UDLD \
firing as a result of the interfaces going offline.  I'm also getting an immense \
amount of pushback from the support engineers about getting this escalated because we \
don't have a lot of information about the problem, and it hasn't come back in days.

As a result, I'm reaching out to the list here, because we're at our wits' end , \
afraid that we're sitting on a ticking timebomb in our network, waiting for one of \
these sites to have both TenGigabit interfaces fall offline at the same time again.  \
If anyone has seen ANYTHING like this happen on the ME3600 ten gig ports, or can get \
me in touch with a Cisco resource that will actually take me seriously and actually \
help us look for causes instead of constantly blaming UDLD (seriously both interfaces \
                at the same time?), I'd appreciate the info.
-Joseph Bender

_______________________________________________
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

[prev in list] [next in list] [prev in thread] [next in thread]