On October 30th at 13h16 UTC, some customers experienced symptoms ranging from increased latency to packet loss. This situation has impacted several telecom providers at different levels.
The root cause of this incident is the defective configuration in a peer network (Worldstream) which is used by OVHCloud.
More specifically our peer announced the full internet table through its peering session instead of advertising only its own routes. This caused a significant amount of traffic to prefer worldstream peering, causing saturation on our peering links.
OVHCloud network teams have quickly identified the issue and put mitigation in place at 13h38 UTC fixing the issue for its customers. By applying this, the incident was fully closed at 13h40 UTC.
Following this event and post incident deep analysis, OVHCloud Network teams have identified that maximum prefix-limit configuration was not applied on one peering link with Worldstream which made learning the full table possible. The faulty configuration has been updated to avoid a new occurrence of such issue
To further secure OVHCloud network a full review of peering devices configuration has been performed and automatic mitigation has been put in place when relevant (automatic shut of peer session when number of routes announced exceed a certain limit).
The root cause of this incident is the defective configuration in a peer network (Worldstream) and a configuration not applied on one peering link with Worldstream.
Timeline
30/10/2024 13:16 UTC - Wrong peering announcement starting at Worldstream. No impact for customers.
30/10/2024 13:23 UTC - Start of incident Detection of instabilities in OVHcloud network Customers are starting to log tickets since experiencing latencies or encountering difficulties to reach OVHcloud services.
30/10/2024 13:27 UTC - Investigation and monitoring analysis starting by OVHcloud network team Customers are experiencing latencies or encountering difficulties to reach OVHcloud services.
30/10/2024 13:38 UTC - Identification of the issue and mitigation engaged to fixed the issue.
30/10/2024 13:40 UTC - End of incident.
Incident duration : 17 minutes