[Worldwide] Network OVH

Incident Report for Network & Infrastructure

Resolved

Start time: 13/10/2021 07:20 UTC
Impact : Since 07:20 UTC this morning the entire OVH network is unavailable. We are experiencing a network incident located in the United States. All the technical teams are working to resolve the incident
Comment : Since 08:22 UTC all services are gradually returning following the isolation of network equipment in the US.
—————————————————
Heure de début : 13/10/2021 07:20 UTC
Impact : Depuis 07h20 UTC ce matin, l'ensemble du réseau OVH est indisponible. Nous sommes confrontés à un incident réseau situé aux Etats-Unis. Toutes les équipes techniques travaillent à la résolution de cet incident.
Comment : Depuis 08:22 UTC L’ensemble des services reviennent progressivement suite à l’isolation d’un équipement réseau sur aux Etats Unis.

=========================

For weeks we are experiencing heavy DDoS attacks which are being mitigated every day.

In order to improve our defense mechanisms, we have been continuously improving our configurations to keep on enhancing the level of protection we provide to our customers.

A change had been prepared and validated by our Change Advisory Board (CAB) with the right Method of Procedures (MOP) & peer reviewed (announced on 2021-10-12 at 16:28 CET)
http://travaux.ovh.net/?do=details&id=53785

2021-10-13 09:05 CET - The scheduled change is started as expected with a window (http://travaux.ovh.net/?do=details&id=53785)
2021-10-13 09:18 CET - The change actions are being processed as expected (BGP isolation, changes, configuration updates)
2021-10-13 09:20 CET - During the route-map modification, an Issue occurred : router didn't take the last digit in the entry. The route-map aimed at redistributing BGPv4 into OSPF. All IPv6 traffic were accessible.
2021-10-13 09:21 CET - The team detected an issue on the router behavior & escalated immediately
2021-10-13 09:25 CET - Beginning of the crisis management process, in full compliance with our implemented procedures (the lag between the crisis is due to the buffer we take for the convergence time)
2021-10-13 09:30 CET - The rollback procedure didn't work so we took the decision to shut down physically the related device & requested an onsite assistance to do so
2021-10-13 09:45 CET - DC Team is joining the telco room in order to launch the mitigation plan 2
2021-10-13 10:00 CET - DC technician kicks-off operations in the telco room (3:00 am local time)
2021-10-13 10:02 CET - First request was initially to unplug the optical equipment in order to isolate the connectivity & get the service backed-up
2021-10-13 10:10 CET - Finally we took the decision to power off the faulty router
2021-10-13 10:18 CET- The faulty device is shutdown (It takes 2min for convergence)
2021-10-13 10:20 CET - First services restored
2021-10-13 10:30 CET - Stabilization of the connectivity in order to restore all the remaining services
2021-10-13 10:57 CET - End of the crisis from a technical perspective
2021-10-13 10:30 CET - Ongoing actions in order finalize & sanity check our network & finalize to restore some remaining non-blocking services (Travaux tasks will be following up on the actions)

OVHcloud operates a global backbone reaching all continents. To ensure the best reach possible to its customers the backbone is fully meshed.
• By nature this mesh means that all the routers participating in the backbone are directly or indirectly connected to one another and constantly exchanging routing information.

During the outage, the full Internet routing table was being announced in the OVHcloud IGP. The massive influx of routing information on the IGP led some routers to miss behave : OSPF table got full, overloading RAM and CPU. The impact was the IPv4 routing only and all IPv6 traffic were accessible.

Our newer routers started to use D2 VIN as the default gateway for all the internet traffic, hence causing the traffic to flow to the US. This led to an unability to process the traffic properly for IPv4 on all our sites.

We were able to take back control over the situation very quickly with the access to the physical faulty equipment and isolate it from the network.
(Once the D2 was put offline the network reconverged, emptying the OSPF tables on the devices and routing traffic to the nominal gateways).

Our immediate actions is to re-assess our validation procedure on such type of devices (which applies and commits the command line natively) & reinforces accordingly the change process.

As this incident impacted our customers using IPv4 protocol, our teams across the globe have been following the situation as closely as can be, to help them recover and keep them up to date.

We sincerely apologize for the inconvenience.

Update(s):

Date: 2021-10-13 09:49:56 UTC
Nous avons mené le matin du 13 octobre à 9H12 (CET/heure de Paris) des interventions sur un routeur de notre Datacentre de Vint Hill aux Etats Unis, ce qui a entraîné des perturbations sur l’ensemble de notre réseau. Ces interventions visaient à renforcer nos protections anti DDoS, attaques qui ont été particulièrement intenses ces dernières semaines.
Les équipes d'OVHcloud sont rapidement intervenues pour isoler l'équipement à 10h15. Les services ont été rétablis depuis cette intervention.
Nous menons actuellement une campagne de vérification auprès de nos clients pour confirmer le rétablissement de tous leurs services.
Nous présentons nos plus sincères excuses à l'ensemble de nos clients impactés et ferons preuve de la plus grande transparence sur les causes et les conséquences de cet incident.

-----------------
In the morning of October 13th at 9:12 am (CET/Paris time), we carried out interventions on a router in our Vint Hill datacenter in the United States, which led to disturbances on our entire network. These interventions were aimed at reinforcing our anti-DDoS protections, attacks that have been particularly intense over the course of recent weeks.
The OVHcloud teams quickly intervened to isolate the equipment at 10:15 am. Services have been restored since this intervention.
We are currently conducting a verification campaign with our customers to confirm that all their services have been restored.
We sincerely apologize to all our customers affected by this incident and we commit to be as transparent as possible about the causes and consequences of this incident.

Date: 2021-10-13 09:20:20 UTC

Start time: 13/10/2021 07:20 UTC
End time : 13/10/2021 08:22 UTC
Impact : Between 07:20 UTC and 08:22 UTC, the entire OVH network was unavailable. We were confronted with a network incident located in the United States. Most services are now available. The last services suffering from disruptions will have an independent travaux task.
—————————————————
Heure de début : 13/10/2021 07:20 UTC
Heure de fin : 13/10/2021 08:22 UTC
Impact : Entre 07h20 UTC et 08:22 UTC, l'ensemble du réseau OVH était indisponible. Nous avons été confrontés à un incident réseau situé aux Etats-Unis. La majorité des services sont maintenant disponibles. Les derniers services souffrant de perturbations vont disposer d'une tâche travaux indépendantes.

Posted Oct 13, 2021 - 08:30 UTC