FS#14176 — GSW Paris

Incident Report for Network & Infrastructure

Resolved

The Globalswitch PoP is down. We're investigating...

Update(s):

Date: 2015-07-29 16:01:48 UTC
Hello,

An incident has just occurred on the routing of the 2 routers in Paris: gsw-1-a9. The incident was caused by a human error. One of the engineers in the network team (that’s my team) accidentally deleted the OSPF configuration on the router, despite double checking the configuration. He confirmed that he had done it without realising (in pilot mode) .. and as a result the gsw-1-a9 router went down.

Everything should still have continued to function, however. But a bug on the 3rd reflector router rf-3-a1 stopped this router alerting the rest of the backbone that gsw-1-a9 was down. rf-2-a1 did it, but rf-2-a1 was down during the outage. The backbone therefore continued to behave as if the gsw-1-a9 router was up. We saw that there was a looping problem in the traceroutes.

We’ve rebooted all BGP sessions on rf-3-a1, but given that only rf-2-a1 was synchronising the BGP between all the routers in Europe because rf-1-a1 and gsw-1-a9 were down, the European network was intermittent: each router was pinging or not pinging for 60-120 seconds.

Everything then came back up; then we reconfigured the gsw-1-a9 router.
The backbone is up.

Please accept our sincere apologies for this outage. Human errors can happen and the backbone is meant to prevent this type of issue. We will investigate to find the bug on our RR (ASR1002). Then we will give the team a good hiding…

Best,
Octave

Date: 2015-07-29 15:07:04 UTC
Everything is up.

Date: 2015-07-29 15:06:57 UTC
Summary of the gsw-1-a9 configuration:
We cut off the BGP sessions with PNI and Transit.
We reinstated the OSPF configuration.
It's up.
We're remounting the BGP sessions with the peers.

Date: 2015-07-29 15:05:29 UTC
Resetting rf-3-a1 fixed the communication problem which should have been fixed by isolating the gsw-1-a9 router.

Traffic is back to normal. Mainly the connections managed by gsw-1-a9 were impacted:
- 50% Free
- 50% Orange
- 30% Telefonica (Backup)
- 50% Google Eurupe

Transit:
- 20G Cogent
- 40G Tata
- 20G Level3
- 10G Telia

The reste of the backbone continued to function as normal.

Date: 2015-07-29 15:03:32 UTC
rf-1-a1 is down with GSW.

We reset rf-3-a1, which apparently has a bug.
We were therefore only connected to a RR rf-2-a1 for a few minutes.

Date: 2015-07-29 15:02:03 UTC
The th2-1-a9 and some other links are unstable. The other GSW routers are still communicating.

We're investigating...

Apparently one of the \"reflector\" routers (rf-3-a1) didn't communicate with the other routers that GSW was down, so the GSW routers remained installed.

We're cutting off the BGP session towards rf-3-a1 and th2-1-a9 to see if that works.

That fixed it. So that's where the issue is.

We're cutting off all BGP sessions.

rf-3-a1#clear ip bgp *

Date: 2015-07-29 14:09:58 UTC
The outage was caused by human error. The OSPF configuration cut off the GSW router.

The TH2 in Paris has taken over the traffic.

Posted Jul 29, 2015 - 14:07 UTC