FS#12258 — VAC1

Incident Report for Network & Infrastructure

Resolved

We are noticing instablities at the level
of the VAC infrastructur.
We are investigating.

We had an interruption on the 20th of December
2 times and on the 21st of December one time.

Update(s):

Date: 2015-01-05 09:32:44 UTC
The issue is fixed.
We are sorry that it took as long as it did to resolve the issue.

Date: 2015-01-05 09:31:12 UTC
We have reconfigured permanent mitigation.
Mitigationperm-ipv4 # ls | wc -l
17049
IPv4 has default protection again.

Date: 2015-01-05 09:28:06 UTC
Forgetting the ACL:
On 30-31 October 2013, we had a concern
on the router in Warsaw var-1-6k and we
Spent 2 days to stabilize.
http://status.ovh.net/?do=details&id=5678

The reconfiguration of LAGs was
made at that time .. and 14 months after it has
been exploited: from December 20 infra
MAV1 and MAV2, itself, has been the target of
DDoS and experienced instability 10-20
about 25-30 seconds after 3 weeks.
Normally, ACLs block these DDoS.

Date: 2015-01-05 09:18:34 UTC
watching the traffic that goes to the CPU and MAV1 MAV2
it was apparent that there is traffic that should not
be. Som traffic should be dropped on the input on the backbone
because it is intended for the backbone. basically it is the DDoS
to routers.

According to research on the backbone, it was perceived that
LAG with TPIX in Warsaw was not reconfigured properly:
missing an ACL that protects the backbone.
So the attacks from Poland to the destination
equipment networks passed without problem.

I am going to add (5:14 p.m.)
and reconfigure the LAG .

Date: 2015-01-04 14:13:43 UTC
Deactivated. We ask the TAC to delicer the spares
of M2 and FABs on the Cisco Nexus 7009 in order
to replace everything.

Date: 2015-01-04 14:12:41 UTC
The VAC1 is back online. w8

Date: 2015-01-04 14:11:49 UTC
Jan 4 09:23:30 %PLATFORM-2-MOD_REMOVE: Module 6 removed
Jan 4 09:25:23 %PLATFORM-2-MOD_REMOVE: Module 8 removed

Date: 2015-01-04 14:11:31 UTC
The problem is still there.
We interrupt the VAC1.
We insert the spare card and we will reconfigure the 2 F2 cards.

Date: 2015-01-04 14:10:35 UTC
We disconfigured the IPV6 which can also problems
via the OSPFv3.

Date: 2015-01-04 14:09:52 UTC
The VAC1 is rebooted. We let it stabilize.

The VAC is back online.

Date: 2015-01-04 14:08:59 UTC
We think, no, we feel (feeling) that the problem is a programmation
bug of TCAM in the case where we use a VDC with a mix of the cards
M2 and F2 . The concerned VDC does not tell anything.
Only the OSPF is cut and then is back.

We removed all the programmed routed on the vac1-2-n7
we removed the \"soft\" of BGP to avoid to take from the RAM
and we interrupted the BGP sessions with the mixed VDC.
\"sh resources\" is however correct.

We reboot all the VAC1 entirely.

Date: 2015-01-02 16:19:13 UTC
The cards have been delivered and we're configuring the software.
After having fully rebooted the chassis, we're not sure that it's a hardware issue. It could be that some configurations remained on the cards and we just needed to reboot them to clear them fully. This sort of thing is quite common with hot swaps.

We've therefore inserted the new F2 cards into slots 7 and 9, and if there's still an issue, we will insert and reconfigure all the ports of card 6 and 9 and then send back cards 6 and 8.

Date: 2015-01-02 13:11:17 UTC
Scheduled delivery date: 02-JAN-2015 15:36 (GMT +1)
Line: 1.1 Product: N7K-F248XP-25E= Quantity: 1
Line: 2.1 Product: N7K-F248XP-25E= Quantity: 1

Date: 2015-01-02 13:08:12 UTC
the VAC1 has been rebooted. So all the VDC have been rebooted. No old codes are left on the VAC1 which could have stayed as a result of the hot swap with the ISSU. Everything has been reprogrammed since the beginning, so everythings been explored.

VAC1 is now operating again.

At the same time, we should receive the cards shortly.

Date: 2015-01-02 13:05:06 UTC
We're reloading the VAC1 chassis

admin# reload |
!!!WARNING! there is unsaved configuration in VDC!!!
This command will reboot the system. (y/n)? [n] y

Date: 2015-01-02 13:04:23 UTC
VAC1 has been disabled. Packet forwarding stops occasionally and so OSPF is down, not the other way around. We're waiting for the spare Cisco cards.

Date: 2015-01-02 10:35:07 UTC
We've opened a Cisco TAC case to get spare cards.

In the meantime, we've added static routes to the the VAC1 so as to debug it more effectively.

If the OSPF goes down again, the customer's traffic will therefore not get cut off. We will then be able to debug as we wish and find the source of the issue without impacting customers.

Date: 2015-01-02 09:26:49 UTC
VAC1 has been disabled.

Date: 2014-12-29 09:12:00 UTC
no more instabilities. all is OK.

we will close the thread. we're also going to update the VAC2 and VAC3 with the same settings (VAC3 is already partly on VAC1 configuration) but there's still the RAM taken up by the vacX-2-n7 and the version of NX-OS. we will do this at the beginning of January.

Date: 2014-12-25 22:17:24 UTC
VAC1 is UP again. We will be monitoring for 48H before confirming that it was \"fixed\".

Date: 2014-12-24 11:46:01 UTC
VAC1 is okay. We are verifying that the configuration is complete.
It is complete.

We will go back in production for a minute to validate thar the configuration
is completed well. It is going gooid.

We disactivate.

We are going to reactivate tomorrow, on Dec.25 and see if VAC1 still poses problems.

Date: 2014-12-24 01:11:46 UTC
done:
admin# sh redundancy status
Redundancy mode
---------------
administrative: HA
operational: HA

This supervisor (sup-2)
-----------------------
Redundancy state: Active
Supervisor state: Active
Internal state: Active with HA standby

Other supervisor (sup-1)
------------------------
Redundancy state: Standby

Supervisor state: HA standby
Internal state: HA standby

We will monitor tonight before switching VAC1 back to production.

Date: 2014-12-24 01:10:53 UTC
All EPLDs are now updated!

We made a last switchover for the route.

Date: 2014-12-24 01:10:29 UTC
Good :

Module 1 EPLD upgrade is successful.

On lance le switchover :

admin# system switchover
admin#

Date: 2014-12-24 01:10:12 UTC
The upgrade to 6.2.10 was done without problems (Module 8 took pity on us and has not crashed this time: it's Christmas in advance!).

We are moving to EPLD, first on the standby sup

admin# install module 1 epld bootflash:n7000-s2-epld.6.2.10.img

Copy complete, now saving to disk (please wait)...

EPLD image signature verification passed

Compatibility check:
Module Type Upgradable Impact Reason
------ ---- ---------- ---------- ------
1 SUP Yes disruptive Module Upgradable

Retrieving EPLD versions... Please wait.

Images will be upgraded according to following table:
Module Type EPLD Running-Version New-Version Upg-Required
------ ---- ------------- --------------- ----------- ------------
1 SUP Power Manager SPI 34.000 37.000 Yes
1 SUP IO SPI 1.012 1.013 Yes
The above modules require upgrade.
Do you want to continue (y/n) ? [n] y

Starting Module 1 EPLD Upgrade
Module 1 : Power Manager SPI [Upgrade Started ]
Module 1 : Power Manager SPI [Erasing ] : 100.00%
Module 1 : Power Manager SPI [Programming ] : 100.00% (1464788 of 1464788 total bytes)
Module 1 : IO SPI [Upgrade Started ]

Date: 2014-12-23 22:46:43 UTC
vac1-1-n7 is UP again. We had to make multiple modifications on the configuration then reload all of it .. fun on chocolate bar ..

It is started for 8.2.10 and we will achieve the task ..

Date: 2014-12-23 20:20:00 UTC
Card 8 seems to have problems after the crash. LACP problems may be a consequence.

We will proceed to unplug/plug the card in the chassis.

Date: 2014-12-23 19:40:23 UTC
The set of modules are in version 6.2.8a. However, we still have LACP problems.

We are launching an issue in order to move from version 6.2.8a to version 6.2.10.

Date: 2014-12-23 19:34:10 UTC
A VDC is not sending its LACP packs due to an issue failure.

We will launch a second issue in order to properly update module 8.

Date: 2014-12-23 18:26:19 UTC
The moment of truth: we will shift the VAC1 in ISSU (6.2.2 to 6.2.8a and 6.2.10)

Date: 2014-12-23 15:36:18 UTC
All the necessary files to be debuged were taken. We try to reapply the settings and restart conf to return to a stable state.

Date: 2014-12-23 15:35:15 UTC
It's still crazy:

vac1-3-n7 # sh ip bgp sum
BGP table version is 240527, IPv4 Unicast config peers 4 capable peers 4

vac1-3-n7 # sh run | i bgp
vac1-3-n7 #

This is the first time that I've seen commands
lost while everything remains in place. I think
the problem comes from this. The configuration appears in place
but probably it is no longer fully programmed.
The pieces that are missing are probably the origin of
dysfunctions.

We look at the configuration backups and updates from the same time frame.

Date: 2014-12-23 07:08:44 UTC
VAC 1 was disabled. On vac1-3-n7 BGP configuration is no longer in \"sh run\", sh ip bgp sum but shows all sessions are still running.

We will update the chassis with a new version of NX-OS
It feel like a series of bugs.

Date: 2014-12-23 06:56:27 UTC
We re-enabled VAC1.

If VAC1 still causes any issues, we will then disable it to run deeper maintenance.

Date: 2014-12-23 06:48:20 UTC
The vac1-2-n7 henceforth has enough RAM to take any configuration. it is possible that it is the original the problem even if it is rather odd since the problem is located vac1-3-n7.

Date: 2014-12-23 06:47:35 UTC
vac1-2-n7 configuration failed.. I like surprises.
Well, it is being pushed from scratch.

Date: 2014-12-23 06:46:20 UTC
We have added a more resources on VDC and rebooted.

admin# reload vdc vac1-2-n7
Are you sure you want to reload this vdc (y/n)? [no] yes
2014 Dec 22 22:37:23 admin %$ VDC-1 %$ %VDC_MGR-2-VDC_OFFLINE: vdc 5 is now offline
2014 Dec 22 22:37:23 admin %$ VDC-1 %$ %SYSMGR-STANDBY-2-SHUTDOWN_SYSTEM_LOG: vdc 5 will shut down soon.
admin# 2014 Dec 22 22:38:11 admin %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 5 has come online

Date: 2014-12-23 06:45:35 UTC
Once again, we are disabling VAC1.

Date: 2014-12-23 06:45:14 UTC
OSPF is down on vac1-3-n7.

2014 Dec 22 21:42:27 vac1-3-n7 %OSPFV3-5-ADJCHANGE: ospfv3-16276 [18598] on port-channel1 went DOWN
2014 Dec 22 21:42:29 vac1-3-n7 %OSPF-5-ADJCHANGE: ospf-16276 [18599] on port-channel1 went DOWN
2014 Dec 22 21:43:13 vac1-3-n7 %OSPFV3-4-SYSLOG_SL_MSG_WARNING: OSPF-4-NEIGH_ERR: message repeated 84 times in last 14 sec
2014 Dec 22 21:43:19 vac1-3-n7 %OSPF-4-SYSLOG_SL_MSG_WARNING: OSPF-4-NEIGH_ERR: message repeated 16 times in last 11 sec

Date: 2014-12-22 08:37:40 UTC
Well, everything seems okay now. This is probably a bug.
An upgrade to the latest version of NX-OS is required
in the coming days.

We are going to reactivate VAC1. The DDoS are going to cleaned on the 3 VAC:
VAC1 (RBX/GRA) VAC2 (SBG) and VAC3 (BHS).

Date: 2014-12-21 18:34:52 UTC
They are all UP. Here my feeling is telling me
that the vac1-3-n7 is good again. We look in depth.

Date: 2014-12-21 18:33:59 UTC
All the VDCs restarted their L3.

Date: 2014-12-21 18:33:31 UTC
We are going to do a switchover of the SUP

admin# system switchover
admin#

Date: 2014-12-21 18:33:09 UTC
We restarted the vac1-3-n7 which has problems.

admin# reload vdc vac1-3-n7
Are you sure you want to reload this vdc (y/n)? [no] y
2014 Dec 21 19:11:11 admin %$ VDC-1 %$ %VDC_MGR-2-VDC_OFFLINE: vdc 3 is now offline
2014 Dec 21 19:11:11 admin %$ VDC-1 %$ %SYSMGR-STANDBY-2-SHUTDOWN_SYSTEM_LOG: vdc 3 will shut down soon.
admin# 2014 Dec 21 19:11:53 admin %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 3 has come online

It is UP and we have the logs again.

Date: 2014-12-21 18:32:29 UTC
I love mihaps :)

admin# reload module 2 ?

force-dnld Reboot a specific module to force NetBoot and image download

admin# reload module 2
Active sup reload is not supported.

Date: 2014-12-21 18:31:49 UTC
admin# sh system redundancy status
Redundancy mode
---------------
administrative: HA
operational: HA

This supervisor (sup-2)
-----------------------
Redundancy state: Active
Supervisor state: Active
Internal state: Active with HA standby

Other supervisor (sup-1)
------------------------
Redundancy state: Standby
Supervisor state: HA standby
Internal state: HA standby

Date: 2014-12-21 18:31:41 UTC
admin# reload module 1
This command will reboot standby supervisor module. (y/n)? [n] y
about to reset standby sup
This command will reboot standby supervisor module. (y/n)? [n] y
about to reset standby sup
admin# sh module
Mod Ports Module-Type Model Status
--- ----- ----------------------------------- ------------------ ----------
1 0 Supervisor Module-2 powered-up
2 0 Supervisor Module-2 N7K-SUP2E active *

Date: 2014-12-21 18:31:33 UTC
We are going to restrt the SUP1 which is in stand-by
and then wait the synchronization and then restart
the active sup. This will allow us to hot-swap
on a new sup and have again the logs
in all the VDCs . Maybe.

Date: 2014-12-21 18:30:18 UTC
The VAC1 is instable. One of the elements does not
work correcctly on the OSPF sessions and does some UP/DOWN.
Sometimes, not always.

When it happens, the trafic in the VAC1 is cleanded but
it's not rejected correctly in the internal network.

We have just cut VAC1 in order to find the origin
of the problem. The trafic is cleaned by VAC2 and VAC3

Posted Dec 21, 2014 - 10:51 UTC