FS#9224 — RBX1 internal network

Scheduled Maintenance Report for Network & Infrastructure

Completed

As part of
https://status.ovh.co.uk/?do=details&id=5292
we are going to change the OSPF AREA of all
the small routers in RBX1.

Update(s):

Date: 2013-08-30 14:04:58 UTC
All done.

Date: 2013-08-30 14:04:46 UTC
Old RBX/GRA network shut down. The traffic is
flowly well between the DCs and across the new
internal network.

Date: 2013-08-30 14:02:58 UTC
Done.

Old SBG/RBX network shut down. Mitigation by VAC 2
in SBG and VAC3 in BHS is reaching servers hosted
correctly in RBX1. yes :)
Packets from SBG to RBX are passing on the new network.

Date: 2013-08-30 13:59:12 UTC
Done.

We can now isolate the internal network
from the backbone.

Date: 2013-08-30 13:55:14 UTC
Configuration finished. We have no more issues.

We will now isolate RBX1 from the backbone
like all other DC routers.

Date: 2013-08-30 09:09:06 UTC
We will finish the reconfiguration of the
remaining 42 routers. To do so, we will
directly set up the BGP configuration, then
shut down the OSPF after the verifications.

Date: 2013-08-30 01:36:51 UTC
On RBX1 we have a very particular network configuration based on 2 routers in RBX1 rbx-1-6k and rbx-2-6k.
Those 2 routers manage the interco for about 120-130 small routers.
This architechture is used since 2006 , we have it only in RBX1 and particularity of this configuration made it complicated to establish all new services (VAC, the vrack etc.).
We had to simplify this configuration until we replace all these routers by 4 big ones, as we do in all other DCs (4 routers arrived 2 weeks ago and we are expecting to switch RBX1 by the end of September).

We knew simplifying this configuration would have an impact on the availability of RBX1 DC and we knew we will have to change some routers by spares.
So we choosed to perform the intervention by the day while we have the maximum staff available in order to intervene quickly on hard.
And it worked. We finally had to remove completely the OSPF and use only BGP.

88 routers were reconfigured ,still remaining 42 more.
We will perform the final configuration of these last 42 routers without generating new failures ,then we will remove the old conf.
After this shot, we wonder if we would have done this right from the beginning...

RBX1 problem impacted 2 other routers managing the vrack 1.0 which have not been updated since 2-3 months.
With an uptime of several years and with today's problems,we had RAM fragmentation and we had to restart it.

The other DCs have not been impacted.
The problem concerns RBX1 DC rt a part of the evening, the vrack 1.0 / IP LB.
VAC1 didn't work properly during this period.

We are sorry for the faults generated and their period.

Date: 2013-08-29 23:31:20 UTC
Internal network resumed. RBX1 network is stable.

Date: 2013-08-29 23:31:05 UTC
Communication between RBX routers doesn't go through the internal network, but through the backbone. We are checking in order to fix this problem.

Date: 2013-08-29 23:25:42 UTC
It is UP at least 1 router from 2. OSPF is cut. And everything works but on BGP.

Date: 2013-08-29 23:24:34 UTC
Remaining 2 routers to resume.

Date: 2013-08-29 23:24:05 UTC
While we try to resume the OSPF router by router,we add the BGP configuration to move OSPF from these small routers.

Date: 2013-08-29 23:20:26 UTC
We finished with routers of the first vlan (70 routers) but there is nothing going right.

OSPF process couldn't be resumed. we are now trying to simplify configuration to avoid announcing LSA then restart the routers that crashed. we already have some routers UP. but too many remains to be done.

Date: 2013-08-29 16:40:07 UTC
rbx-7/8/9/10/11/12/14/15/16/17-m1/m2: done
there is some card damage on rbx-14, rbx-4, rbx-3.
we are replacing with spares.

Date: 2013-08-29 16:38:23 UTC
rbx-3/4/5/6-m1/m2 done

Posted Aug 29, 2013 - 14:26 UTC