RBX

Incident Report for Network & Infrastructure

Resolved

Thursday, 09 November 2017, 22:08PM

Dear customer,
This morning, we experienced an incident on the optical network which connects our Roubaix (RBX) site with 6 of the 33 points of presence (POPs) on our network. Paris (TH2 and GSW), Frankfurt (FRA), Amsterdam (AMS), London (LDN), Brussles (BRU).
The Roubaix site is connected via 6 fibre optic cables to these 6 POPs: 2x RBX<>BRU, 2x RBX<>LDN, 2x RBX<>Paris (1x RBX<>TH2 et 1x RBX<>GSW). These 6 fibre optic cables are connected to a system of optical nodes which means each fibre optic cable can carry 80 x 100 Gbps.
For each 100 G connected to the routers, we use two optical paths which are in distinct geographic locations. If any fibre optic link is cut, the system reconfigures in 50ms and all the links stay UP.
To connect RBX to our POPs, we have 4.4Tbps capacity, 44x100G: 12x 100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to the GRA DC and 2x100G to SBG DC.

At 8:01, all the 100G links, 44x 100G, were lost in one go. Given that we have a redundancy system in place, the root of the problem could not be the physical shutdown of 6 optical fibres simultaneously. We could not do a remote diagnostic of the chassis because the management interfaces were not working. We had to intervene directly in the routing rooms themselves, to sort out the chassis: disconnect the cables between the chassis and restart the system and finally do the diagnostics with the equipment manufacturer. Attempts to reboot the system took a long time because each chassis needs 10 to 12 minutes to boot. This is the main reason that it the incident lasted such a long time.

Diagnostic: all the interface cards that we use, ncs2k-400g-lk9, ncs2k-200g-cklc, went into \"standby\" mode. This could have been due to a loss of configuration. We therefore recovered the backup and reset the configuration, which allowed the system to reconfigure all the interface cards. The 100Gs in the routers came back naturally and the RBX connection to the 6 POPs was restored at 10:34.

There is clearly a software bug on the optical equipment. The database with the configuration is saved 3 times and copied to 2 monitoring cards. Despite all these security measures, the database disappeared. We will work with the OEM to find the source of the problem and help fix the bug. We do not doubt the equipment manufacturer, even if this type of bug is particularly critical. Uptime is a matter of design that must consider every eventuality, including when nothing else works. OVH must make sure to be even more paranoid than it already is in every system that it designs.

Bugs can exist, but incidents impacting our customers are not acceptable. This is an OVH issue, because despite all investments in the network, in the fibre, in the technologies, we still experienced 2 hours of downtime on all our infrastructure in Roubaix.

One solution could be to create 2 optical node systems instead of one. 2 systems mean 2 databases and so if we lose configuration, only one system is down. If 50% of the links went through one of the systems, today we would have lost 50% of the capacity but not 100% of links. This is one of the projects we started 1 month ago, the chassis have been ordered and they will arrive in the coming days. We can start the configuration and migration work in 2 weeks. Given today's incident, this project has become a priority for all of our infrastructures, all DCs, all POPs.

In the cloud infrastructure business, only those who are vigilant to the point of paranoia will last. The quality of service is down to two elements. All incidents which can be anticipated \"by design\". And the incidents where we learn from our mistakes. This incident has forced us to raise the bard even higher to get closer to “zero risk”.

We are sincerely sorry for the 2 hours and 33 minutes of downtime on the RBX site. In the coming days, all customers affected by this incident will receive an email explaining how to activate their SLA.

Sincerely
Octave

Friday, 10 November 2017, 20:56PM
Incident Summary:
8h00: All 100G links to DC Roubaix are down.
8h15: Unable to connect to the node
8h40: We restart the master frame electrically.
9:00 am: The node is still unreachable.
9:15 am: We unwire the management node.
9:30 am: We regain control of Roubaix.
9h40: We can see all the frames but no alarm on the frame and the circuit configuration has disappeared.
10h00: We inject the last database backup on the node
10h15: The circuits start to go up again
10h30: Most of the circuits are up, 8 are still down
11h00: Some transponders can’t be detected by the system, and an amplifier is out of order, the RMA of the amplifier is launched.
11h30: We reset all transponders not recognized, all circuits are up
14h15: Replacement of the amplifier is completed
14h30: All circuits are up, functional protections and the last alarms have been dealth with.

Explanation:
According to the logs gathered from all the frames of the Roubaix node (20), it appears that we had 3 separate events cascading on the Roubaix node:

1. Node Controller CPU overload (master frame)
Each optical node has a master frame that allows exchanging information between nodes and swapping with its slave frames. On this master frame, the database is saved on 2 controller cards as well as the LCD.

From 7:50 a. m., we noticed that Roubaix starts to have communication problems with the nodes directly connected to it and show a CPU overload on the master frame. As of today, we are unsure what caused this CPU overload. Despite the SBG down earlier, we are looking at all the potential causes. The manufacturer's teams are still investigating this cause. We scheduled a call on Saturday, November 11th, to find out more about the root cause.

2. Cascade switchover
Following the CPU overload of the node, the master frame made a switchover of the controller boards. After the first switchover of controllers and CPU overload, we came across a known Cisco software bug. This bug happens on large nodes and results in a controllers’ switchover occurring every 30 seconds. Normally this switchover stabilizes itself. This bug will be fully fixed by the 10.8 release to be available on November 31st.

3. Loss of the database
At 8:00 am, following the cascade switchover event, we came across another software bug which de-synchronize timing between the 2 controller cards of the master frame. This bug caused a command sent to the controller ordering cards to set database to 0. The master frame controllers sent this new information to the Slaves frames and lost all 100G links from Roubaix. This bug is fixed in release 10.7 and now available.

Action Plan:

Here is the action plan that will be implemented with the manufacturer's advice:

-Two weeks ago, we launched the replacement of Roubaix and Gravelines controllers with TNCS (instead of TNCE) bringing double the CPU power and double the RAM. We received the first 2 yesterday for Roubaix and we will do the swap as soon as possible, after validating the process with the manufacturer's. We're going to push the replacement of the Controllers on the Strasbourg and Frankfurt nodes as well.

-We are now pushing the software upgrade on all nodes to go to 10.8

-Today we are using version 10.5.2.6, we have to go through an intermediate version 10.5.2.7 to be able to go in 10.7 or 10.8 afterwards.

-We will split the large nodes (POP/DC) to have at least 2 nodes controllers per POP/DC

Summary:
Step 1: Replacement of TNCE on RBX/GRA (ETA: Monday 13th November evening for RBX, Tuesday 14th November evening for GRA)
Step 2: Upgrade Software in 10.8 (ETA possible: 4 weeks)
Step 3: Split of the large nodes (ETA: TBA. It is necessary to decide the right strategy and establish the precise protocol and then work on the roadmap)

Potential Split Strategy:

It is possible to completely split the network into 2 fully independent networks at the management level (with always the possibility of re-splitting the nodes inside each network). With a \"smart\" red and blue distribution of the optical lines between 2 networks, each DC can reach each POP over 2 distinct networks.

Update(s):

Date: 2017-11-20 13:57:08 UTC
Szanowni Państwo,

dziś rano miał miejsce incydent w sieci światłowodowej, która łączy nasze centrum danych Roubaix (RBX) z 6 z 33 punktów międzynarodowej wymiany ruchu (PoP) wchodzącymi w skład naszej sieci szkieletowej: Paryżem (TH2 oraz GSW), Frankfurtem (FRA), Amsterdamem (AMS), Londynem (LDN), Brukselą (BRU).a

Centrum danych RBX jest połączone za pomocą 6 światłowodów do 6 punktów PoP: 2x RBX<>BRU, 2x RBX<>LDN, 2x RBX<>Paris (1x RBX<>TH2 oraz 1x RBX<>GSW). Te łącza prowadzą do systemu nodów sieciowych, które dają nam 80 długości fal na 100Gbps w każdym światłowodzie.

Na każde pasmo 100G podłączone do routerów, wykorzystujemy 2 ścieżki optyczne, które są geograficznie odrębne. W przypadku przerwania światłowodu, na przykład w przypadku prac ziemnych, system jest ponownie konfigurowany w ciągu 50ms i wszystkie łącza pozostają aktywne.
Do połączenia Roubaix z punktami PoP wykorzystujemy przepustowość 4,4TBps, czyli 44 łącza po 100G każde: 12x 100G do Paryża, 8x100G do Londynu, 2x100G do Brukseli, 8x100G do Amsterdamu, 10x100G do Frankfurtu, 2x100G do centrum danych Graveline (GRA) oraz 2x100G do centrum danych w Strasburgu.

O 8:01 nagle wszystkie łącza 100G, z 44 dostępnych, utraciły połączenie. Biorąc pod uwagę system redundancji, który mamy wdrożony, przyczyną problemu nie mogło być przecięcie wszystkich 6 światłowodów jednocześnie.
Nie mogliśmy przeprowadzić diagnostyki zdalnie, ponieważ interfejs zarządzania nie był dostępny. Musieliśmy podjąć więc interwencję bezpośrednio w sali routingu, bezpośrednio na urządzeniu sieciowym: odłączyliśmy kable sieciowe, aby zrestartować system i w końcu przeprowadzić diagnostykę z dostawcą urządzeń sieciowych. Próby zrestartowania urządzeń trwały bardzo długo, każde urządzenie uruchamiało się od 10 do 12 minut. To główny czynnik odpowiedzialny za czas trwania awarii.

Diagnostyka: wszystkie karty transponderów, których używamy: ncs2k-400g-lk9, ncs2k-200g-cklc, przeszły w tryb «standby». Taka sytuacja ma miejsce, gdy zostaje utracona konfiguracja. Przywróciliśmy więc poprzednią konfigurację z kopii zapasowej, dzięki czemu system ponownie skonfigurował wszystkie karty transponderów.
Komunikacja z routerami została przywrócona, a połączenie RBX z sześcioma punktami PoP zostało ponownie ustanowione o godzinie 10:34.
Powodem awarii jest błąd oprogramowania w urządzeniach sieciowych. Baza danych z konfiguracją jest rejestrowana trzy razy i kopiowana na dwie karty monitorujące. Mimo wszystkich tych zabezpieczeń baza zniknęła. Będziemy kontynuować współpracę z producentem sprzętu, aby znaleźć przyczynę problemu i doprowadzić do jak najszybszego usunięcia błędu oprogramowania. Nie wycofujemy zaufania, jakim darzymy dostawcę urządzeń, nawet jeżeli ten typ błędu jest szczególnie krytyczny. Wymagana dostępność jest kwestią projektu, który uwzględnia wszystkie przypadki, w tym sytuacje, kiedy wszystko przestaje działać. Tryb ograniczonego zaufania w OVH musi być jeszcze głębiej rozwinięty we wszystkich naszych projektach.
Błędy w oprogramowaniu mogą istnieć, awarie, które dotykają naszych Klientów nie. Najwyraźniej mamy do czynienia z niedociągnięciem po stronie OVH, gdyż mimo istotnych inwestycji w sieć, światłowody, technologie, właśnie doświadczyliśmy dwóch godzin przerwy w usłudze w całej naszej infrastrukturze w Roubaix.

Jednym z rozwiązań jest stworzenie 2 systemów węzłów światłowodowych zamiast jednego. Oznacza to istnienie dwóch baz danych, co w przypadku utraty konfiguracji spowodowałoby awarię jedynie jednego systemu. Jeśli 50% łączy przechodzi przez jeden z systemów, utracilibyśmy dzisiaj 50 % wydajności, nie zaś 100% połączeń.
Jest to jeden z projektów, którego realizację rozpoczęliśmy miesiąc temu, urządzenia zostały już zamówione i czekamy na ich dostawę w najbliższych dniach. W ciągu dwóch tygodni będziemy mogli rozpocząć prace konfiguracyjne oraz migrację. Biorąc pod uwagę dzisiejszy incydent, projekt ten staje się dla nas absolutnie priorytetowy w odniesieniu do całości naszej infrastruktury, wszystkich centrów danych i punktów obecności (PoP).

W branży dostawców rozwiązań chmurowych jedynie Ci, którzy nie ufają nigdy do końca, są odpowiednio zabezpieczeni. Jakość usług jest konsekwencją dwóch elementów: wszystkich incydentów wynikających z projektu infrastruktury oraz awarii spowodowanych niedociągnięciami, z których wyciągamy naukę. Dzisiejszy incydent skłania nas do ustawienia poprzeczki jeszcze wyżej, abyśmy mogli osiągnąć poziom ryzyka bliski zeru.
Jest nam niezmiernie przykro z powodu dzisiejszej przerwy w usłudze trwającej 2 godz. 33 minuty w obiekcie w Roubaix. W najbliższych dniach Klienci, którzy odczuli negatywne skutki awarii otrzymają wiadomość email dotyczącą naszych zobowiązań SLA.
Z poważaniem,
Octave Klaba

Posted Nov 20, 2017 - 11:20 UTC