The Globalswitch PoP is down. We're investigating...
Date: 2015-07-29 16:01:48 UTC Hello,
An incident has just occurred on the routing of the 2 routers in Paris: gsw-1-a9. The incident was caused by a human error. One of the engineers in the network team (that’s my team) accidentally deleted the OSPF configuration on the router, despite double checking the configuration. He confirmed that he had done it without realising (in pilot mode) .. and as a result the gsw-1-a9 router went down.
Everything should still have continued to function, however. But a bug on the 3rd reflector router rf-3-a1 stopped this router alerting the rest of the backbone that gsw-1-a9 was down. rf-2-a1 did it, but rf-2-a1 was down during the outage. The backbone therefore continued to behave as if the gsw-1-a9 router was up. We saw that there was a looping problem in the traceroutes.
We’ve rebooted all BGP sessions on rf-3-a1, but given that only rf-2-a1 was synchronising the BGP between all the routers in Europe because rf-1-a1 and gsw-1-a9 were down, the European network was intermittent: each router was pinging or not pinging for 60-120 seconds.
Everything then came back up; then we reconfigured the gsw-1-a9 router.
The backbone is up.
Please accept our sincere apologies for this outage. Human errors can happen and the backbone is meant to prevent this type of issue. We will investigate to find the bug on our RR (ASR1002). Then we will give the team a good hiding…
Date: 2015-07-29 15:07:04 UTC Everything is up.
Date: 2015-07-29 15:06:57 UTC Summary of the gsw-1-a9 configuration:
We cut off the BGP sessions with PNI and Transit.
We reinstated the OSPF configuration.
We're remounting the BGP sessions with the peers.
Date: 2015-07-29 15:05:29 UTC Resetting rf-3-a1 fixed the communication problem which should have been fixed by isolating the gsw-1-a9 router.
Traffic is back to normal. Mainly the connections managed by gsw-1-a9 were impacted:
- 50% Free
- 50% Orange
- 30% Telefonica (Backup)
- 50% Google Eurupe