[prev in list] [next in list] [prev in thread] [next in thread]
List: cisco-nsp
Subject: [c-nsp] Very strange ME3600 err-disabled on Te0/1 Te0/2 problem
From: Joe Bender <jbender () clearrate ! com>
Date: 2015-09-30 19:45:45
Message-ID: 5E37612FCA29CC4894C9751E2445A8445749367E () CRC-Exchange ! corp ! clearrate ! net
[Download RAW message or body]
Everyone,
We've had a very strange problem with our ME3600s, specifically ME-3600X-24TS-M, that \
we use in a ring topology, with the ten gigabit ports running purely "no switchport" \
L3 links between sites. All of them are running 15.3(3) S5 at the moment.
We've had several incidents on multiple switches (5, to be exact) where both Te0/1 \
and Te0/2 drop offline due to unknown causes. Switches seem to fail somewhat \
randomly.
Symptoms are:
Both interfaces have BFD detect a node down, simultaneously.
30-45 seconds later UDLD detects unidirectional link, puts both interfaces into \
error-disabled. Other side doesn't notice anything amiss other than the interface \
going down because the node having issues shuts the port down.
Active interventions like removing udld from the ports, then shut/no shut won't bring \
the ports back online. They'll come out of error-disable, but that's about it. The \
*only* thing that'll bring the ten gigabit interfaces back online is a switch reload. \
There's no tracebacks or anything else.
Changes that have been recently made in the network include installing two ASR920s \
with current IOS XE (03.16.00.S ) versions into the network to replace a single \
ME3600 to remove a single-point-of-failure at a critical spot in the ring. The nodes \
that have had issues are the "closest" nodes in the ring to where the ASRs were \
installed, but aren't necessarily directly connected to said ASRs. These also tend \
to be the nodes with the highest amount of traffic flowing through them.
At the moment, we haven't had a problem with them in several days, and as a result \
Cisco TAC has been somewhat useless as they're insisting they can't help us if we \
can't have them actively looking at one of them when it fails. This, for us, is very \
difficult, because I can't wait the 35-40 minutes it usually takes TAC to get in, \
webex, etc as this is taking our customers down, and they're understandably getting a \
bit twitchy with some of these issues. I can't even get a script to run against the \
switch for us to pull additional information other than a show tech because they're \
insisting they can't tell me what they might run to get info (which I find strange)
One TAC team thought it was a bug or some sort of ASIC/software but haven't been able \
to isolate anything that might be the cause, bug-id or otherwise. Right now I'm \
dealing with someone who seems to be concentrating on the fact that he's seeing UDLD \
firing as a result of the interfaces going offline. I'm also getting an immense \
amount of pushback from the support engineers about getting this escalated because we \
don't have a lot of information about the problem, and it hasn't come back in days.
As a result, I'm reaching out to the list here, because we're at our wits' end , \
afraid that we're sitting on a ticking timebomb in our network, waiting for one of \
these sites to have both TenGigabit interfaces fall offline at the same time again. \
If anyone has seen ANYTHING like this happen on the ME3600 ten gig ports, or can get \
me in touch with a Cisco resource that will actually take me seriously and actually \
help us look for causes instead of constantly blaming UDLD (seriously both interfaces \
at the same time?), I'd appreciate the info.
-Joseph Bender
_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic