Date: 2013-08-30 14:04:46 UTC Old RBX/GRA network shut down. The traffic is
flowly well between the DCs and across the new
Date: 2013-08-30 14:02:58 UTC Done.
Old SBG/RBX network shut down. Mitigation by VAC 2
in SBG and VAC3 in BHS is reaching servers hosted
correctly in RBX1. yes :)
Packets from SBG to RBX are passing on the new network.
Date: 2013-08-30 13:59:12 UTC Done.
We can now isolate the internal network
from the backbone.
Date: 2013-08-30 13:55:14 UTC Configuration finished. We have no more issues.
We will now isolate RBX1 from the backbone
like all other DC routers.
Date: 2013-08-30 09:09:06 UTC We will finish the reconfiguration of the
remaining 42 routers. To do so, we will
directly set up the BGP configuration, then
shut down the OSPF after the verifications.
Date: 2013-08-30 01:36:51 UTC On RBX1 we have a very particular network configuration based on 2 routers in RBX1 rbx-1-6k and rbx-2-6k.
Those 2 routers manage the interco for about 120-130 small routers.
This architechture is used since 2006 , we have it only in RBX1 and particularity of this configuration made it complicated to establish all new services (VAC, the vrack etc.).
We had to simplify this configuration until we replace all these routers by 4 big ones, as we do in all other DCs (4 routers arrived 2 weeks ago and we are expecting to switch RBX1 by the end of September).
We knew simplifying this configuration would have an impact on the availability of RBX1 DC and we knew we will have to change some routers by spares.
So we choosed to perform the intervention by the day while we have the maximum staff available in order to intervene quickly on hard.
And it worked. We finally had to remove completely the OSPF and use only BGP.
88 routers were reconfigured ,still remaining 42 more.
We will perform the final configuration of these last 42 routers without generating new failures ,then we will remove the old conf.
After this shot, we wonder if we would have done this right from the beginning...
RBX1 problem impacted 2 other routers managing the vrack 1.0 which have not been updated since 2-3 months.
With an uptime of several years and with today's problems,we had RAM fragmentation and we had to restart it.
The other DCs have not been impacted.
The problem concerns RBX1 DC rt a part of the evening, the vrack 1.0 / IP LB.
VAC1 didn't work properly during this period.
We are sorry for the faults generated and their period.
Date: 2013-08-29 23:31:20 UTC Internal network resumed. RBX1 network is stable.
Date: 2013-08-29 23:31:05 UTC Communication between RBX routers doesn't go through the internal network, but through the backbone. We are checking in order to fix this problem.
Date: 2013-08-29 23:25:42 UTC It is UP at least 1 router from 2. OSPF is cut. And everything works but on BGP.
Date: 2013-08-29 23:24:34 UTC Remaining 2 routers to resume.
Date: 2013-08-29 23:24:05 UTC While we try to resume the OSPF router by router,we add the BGP configuration to move OSPF from these small routers.
Date: 2013-08-29 23:20:26 UTC We finished with routers of the first vlan (70 routers) but there is nothing going right.
OSPF process couldn't be resumed. we are now trying to simplify configuration to avoid announcing LSA then restart the routers that crashed. we already have some routers UP. but too many remains to be done.
Date: 2013-08-29 16:40:07 UTC rbx-7/8/9/10/11/12/14/15/16/17-m1/m2: done
there is some card damage on rbx-14, rbx-4, rbx-3.
we are replacing with spares.
Date: 2013-08-29 16:38:23 UTC rbx-3/4/5/6-m1/m2 done