OVHCloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#4145 — gsw-4
Incident Report for Network & Infrastructure
Resolved
We have a routing problem on gsw-4.



Update(s):

Date: 2010-05-11 13:24:41 UTC
Changed. The conf is synchronized. The BGP has been set off.

Everything is up again.

We are sorry for the length of the breakdown.
The hardware breakdowns are \"no net\".

Date: 2010-05-11 13:20:26 UTC
We change the card.

Date: 2010-05-11 13:18:51 UTC
We have cut the remaining \"up\" port of the second card. It seems better.
We have cut all of the routing via the card 2. All of the clients
are up.

Thus, this could be probably the card #2 in the router which has a hardware
problem and therefore, we change it in 1 hour approximately.

Date: 2010-05-11 13:12:26 UTC
Well.
Since 21h approximately, we had a problem on gsw-4-c1 which impacts 50% of our
clients bays in the Global Switch. And sometimes, this affects the gsw-3.
We have moved the routing of our secondary dns servers on a new router. Still down.
We have:
- looked for the attack which we are subjected to and we cannot find it
- looked for an attack which comes from one of the clients, same thing as well.
- we have restarted one of the 2 routing cards and some ports
were put in default. This has caused a reboot of the second card
and a few ports were put in default as well.
-we have rebooted all of the router, 95% of these are up

Therefore, we bet on the following scenario: following an attack of this
morning, something was pushed to the limit at the level of the hardware and
it broke in the after noon.

We are looking for 2 routing cards of spare and we will proceed to the change of
cards one after another. If we are lucky, it will be started again. We think that
the probability that it could be the chassis which is in default is not null.

In the first case (only the cards): all restart around midnight
In the second case (the chassis): around 1h30/2h00 a.m

Date: 2010-05-11 12:15:43 UTC
we are searching.

Date: 2010-05-11 12:14:21 UTC
We have just been attacked. The attack is blocked now, without having necessarily anything to do with this morning.

Date: 2010-04-29 09:27:49 UTC
The origin of the problem is probably with the maintenance task ,carried out in emergency, and
which we perform this morning on Frankfurt on Decix.
http://travaux.ovh.com/?do=details&id=4131

Thus, probably the shutdown/no shutdown of DECIX caused
a small overload on the VSS at the level of the recalculation of
BGP tables. This recurrent problem of VSS overload at the level of BGP will be
resolved soon with the introduction of 2 ASR 1000 for
collector routes of the whole network. This is a router which is specifically
designed for large BGP tables and many BGP operations.

Date: 2010-04-29 09:12:57 UTC
OVH commentary-Thursday, April 29, 2010, 10:34

We have some strange logs on a few routers regarding the IPs which are used
for the routers of Global Switch.

Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22890: Apr 29 09:29:23 GMT: %COMMON_FIB-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22891: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_DFC8-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22892: Apr 29 09:29:23 GMT: %COMMON_FIB-SW2_DFC9-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22893: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_DFC9-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22894: Apr 29 09:29:23 GMT: %COMMON_FIB-SW2_DFC8-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22895: Apr 29 09:29:23 GMT: %COMMON_FIB-SW2_SPSTBY-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22896: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_SP-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22897: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_DFC1-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info

It seems that, this morning, routers do not like the announcement of 213.251.190.48/28
in OSPF and BGP.

We have just removed the BGP announcement. We keep only the OSPF.
A nice bug again
Posted Apr 29, 2010 - 08:57 UTC