TH2 POP

Incident Report for Network & Infrastructure

Resolved

We have a problem with our TH2 POP.

We have isolated th2-1-a9 at 12:00 (OSPF firstly and then BGP with all peers).
We did a switchover of a RSP on this same a9, but we then detect some BFD errors.

Impact may have been felt from 11:07 CEST to 11:29 CEST on all services. Some others impacts may have been felt after on telecom service.

We have opened a P1 on our provider which performed a diagnostic on the router.

Our provider has detected a problem with the fib manager. The fib manager was blocked by a bug, and then caused a parity error on all router, which caused the incident.

With our provider, we decided to reload all location of the router.

Now, we are currently de-isolate BGP sessions and then OSPF sessions.
We also monitor all logs of the router.

Update(s):

Date: 2017-08-22 12:13:22 UTC
We have carried out some investigations into this and to be clear, there is no SPOF in the architecture.

In Paris alone, we have all the peerings and all the connections for 2 PoPs which are 20km apart. The DCs are connected to 3-6 PoPs simultaneously.

In Paris, there are 2 pairs of routers that function independently from one another. As proof, you can see clearly in the task travaux that the TH2 routers were completely isolated for 4 hours. For 4 hours, the TH2 infrastructures were not in use.

Were you cut off from the world for the 4 hours that TH2 was under maintenance with the Cisco TAC engineers? NO, cutting off a router had no impact. So no SPOF.

Each router on our backbone pushes between 300Gbps and 700Gbps. If the electricity is cut off from a router, the network needs a bit of time to recalculate. How much time, if there is a power outage at a PoP? A few seconds have to be counted for the OSPF to be cut, and for the network to recalculate, but most importantly it’s the 120 seconds of BGP that takes time.

What we had was a bug, and it’s more annoying because the outage isn’t as simple as an electrical outage. We can never predict how the network will function as this depends on the nature of the bug. In this case, it was especially complicated :(

And yet to isolate the routers immediately, we’re using an even faster protocol than OSPF and BGP. BFD checks if the links are UP every 100ms, 10x per second. If a vulnerability is detected, everything drops to less than 100ms, and nothing can be seen. You can’t imagine how many link incident tickets we manage every day without any effect on prod.

To make things even faster, we manage the links between our DCs and the PoPs through 2 fibre optic paths. Each 100G is sent via 2 paths between 300-800km, and from the other side, the system detects if the fibre optic signal is still being received. If there is a vulnerability, boom, this drops to

50ms.

So why did the TH2 issue affect prod? As explained earlier, there was a software bug...in the BFD itself. The TH2 router crashed the BFD process. Normally the router would have isolated from itself.

But it didn’t do that. TH2 sent the BFD packets from the RBX GRA SBG routers, but it stopped sending the BFD packets to the other routers. Suddenly, RBX GRA and SBG isolated TH2, but TH2 didn’t do the same. We had to intervene manually on the infrastructures, analyse the situation and make the right decisions immediately.

In this particular case it’s still too long - man has an awful tendency to pick its battles. We cut off TH2, BGP OSPF, and we isolated the router, then all the problems started again as everything doubled in overcapacity on GSW, where we had exactly the same router worth several million euros, which did its job.

On the backbone, we will update the routers over the month of August, since this has been planned for a long time. The latest version fixes the “check parity” issues, and we will find out more about the bug that has affected us, whether it has already been fixed or will be fixed.

Date: 2017-07-26 12:39:32 UTC
All is stable now, we still monitor th2-1-a9.

Posted Jul 26, 2017 - 12:11 UTC