[Global][Infrastructure] - Backbone incident notification

Incident Report for Network & Infrastructure

Postmortem

Summary

On October 30th at 13h16 UTC, some customers experienced symptoms ranging from increased latency to packet loss. This situation has impacted several telecom providers at different levels.

The root cause of this incident is the defective configuration in a peer network (Worldstream) which is used by OVHcloud.

More specifically our peer announced the full internet table through its peering session instead of advertising only its own routes. This caused a significant amount of traffic to prefer worldstream peering, causing saturation on our peering links.

OVHcloud network teams have quickly identified the issue and put mitigation in place at 13h38 UTC fixing the issue for its customers. By applying this, the incident was fully closed at 13h40 UTC.

Following this event and post incident deep analysis, OVHcloud Network teams have identified that maximum prefix-limit configuration was not applied on one peering link with Worldstream which made learning the full table possible. The faulty configuration has been updated to avoid a new occurrence of such issue

To further secure OVHcloud network a full review of peering devices configuration has been performed and automatic mitigation has been put in place when relevant (automatic shut of peer session when number of routes announced exceed a certain limit).

The root cause of this incident is the defective configuration in a peer network (Worldstream) and a configuration not applied on one peering link with Worldstream.

Timeline
30/10/2024 13:16 UTC - Wrong peering announcement starting at Worldstream. No impact for customers.
30/10/2024 13:23 UTC - Start of incident Detection of instabilities in OVHcloud network Customers are starting to log tickets since experiencing latencies or encountering difficulties to reach OVHcloud services.
30/10/2024 13:27 UTC - Investigation and monitoring analysis starting by OVHcloud network team Customers are experiencing latencies or encountering difficulties to reach OVHcloud services.
30/10/2024 13:38 UTC - Identification of the issue and mitigation engaged to fixed the issue.
30/10/2024 13:40 UTC - End of incident.

Incident duration : 17 minutes

Improvement and action plan

Update of the faulty peering device configuration to add the missing threshold parameter => DONE
Double check of all other peering devices to ensure that the threshold parameter is present and correctly configured => DONE

Posted Oct 30, 2024 - 16:14 UTC

Resolved

All reported impacted network&services are operational.

Start time : 30/10/2024 13:23 UTC
End time : 30/10/2024 13:40 UTC

We thank you for your understanding and patience throughout this incident

Posted Oct 30, 2024 - 14:52 UTC

Identified

A fix has been implemented.
Most of the impacted network & services are operational anew.

We are hard working on remaining ones.

Posted Oct 30, 2024 - 14:05 UTC

Investigating

An incident is in progress on our backbone infrastructure.

Here is detail for this incident :

Start time : 30/10/2024 13:23 UTC
Service impact : You may have experienced instabilities in our network and services.
Ongoing actions : Investigating

Our teams are fully mobilized and we will keep you informed of developments and the resolution of the incident.

We apologize for the inconvenience and thank you for your understanding.

Posted Oct 30, 2024 - 13:51 UTC

This incident affected: Infrastructure || Backbone.