For the data-centre Roubaix 2, we decided to set up the network with a target of 100% availability. That's why we used the Cisco 6509 switches in VSS configuration. It is a system based on 2 chassis running as a single. With two chassis, all is doubled, and so we should have 100% availability.
In the real world, we have several problems with the VSS, which led to service's disconnection and therefore did not meet the original contract. Basically, we have a chronic problem on the BGP. At least modified routing table, the CPU of the router is 100% for 15 minutes minimum. Nevermind. But end of 2009, we have established a strong protections on the internal network which is to isolate each server to another. It is realized through the private vlan and set up a proxy arp. Very standard as a solution. The router responds to the place of all servers and provides routing even in the same vlan. All is very secure. However the router must respond to all MAC requests of all servers and the process that runs on the VSS takes a lot of CPU.
Normally this works without properly. But the network need just to recalculate the routing tables so that the BGP takes 100% of CPU and avoid the MAC process to function. The result: the servers no longer know the MAC and there is a break in service during 1, 3, or 8 minutes, depending on the importance of recalculating the BGP tables.
We try to fix the BGP problem with specific routers that will do just this. Route reflector. Normally we should receive the material this month, but the order was improperly recorded between the distributor and manufacturer ... , and we shall get it end of September ... we decided not to wait for that delivery, and implement a solution this weekend.
But there will always the problem of MAC. So we decided to break the VSS configurations and distribute it what has always work well: the router in a single chassis. We have a little less than 30 routers in the mono chassis configuration that cause no problem. Only in the configuration dual chassis that we have problems. So we will break the chassis.
So from the past week, we'll modify the VSS in order to pass to a configuration based on a single chassis.
We will do it in four steps:
- All the links of the datacentre connected to the chassis 2 will be reconnected to the chassis 1. no break in service since all work on the chassis 1.
- All links to the Internet connected to the frame 2 will be reconnected to the chassis 1. no break in service, since any work on the chassis 1.
- Power disconnection of the chassis 2. no break in service since the chassis 2 will no longer be used.
- Configuration changing of the chassis 1 to the mono chassis version. then we'll have to reboot the router in hardware. and therefore it will take 15 minutes break in service. we're going to start at 4:00 am end of next week if it is going well.
We will attack first the vss-2 that causes the most problems.
Normally, up to Step 4, we won't have more BGP problems. It may be that across the configurations on 2 chassis this problem could be solved upon step 3 or 2, because all works only on a single chassis. But we are not sure. In any case at the end of step 4 it will be fixed.
And as the BGP will be fixed, we think it is likely that the problems of the MAC would be so. If the BGP does not work well in a double chassis, then maybe other processes do not work well in a double chassis? We will see that too.
We regret any small disconnection that Roubaix 2 customers have suffered recently that are mainly due to problems described hereby. The hardware's wrong choice that we fully assume. We thought that the manufacturer would solve the CPU problems but he believes that is normal. This equipment is therefore incompatible with our needs. We'll change it. We have mishandled the situation and we should not ask the manufacturer's help but act directly to find another solution simply. Error in problem's management.
To continue in the transparency, you may have noticed the problems on London, Amsterdam and Frankfurt about 14 days ago. Indeed, we have added security links 14 days ago. between London/Amsterdam and Paris/Frankfurt. Large heavy investments that were decided to make the backbone completely secure and 100% available even in case of problem on the optic fiber. Adding these links on routers, this has caused the saturation of available RAM and crash routers in London. This has resulted in the Amsterdam and Frankfurt trod the same reasons. Who says crash routers, says recalculation of BGP and therefore 100% of CPU on vss ... therefore those crashes were the result of service in Roubaix 2: (We fixed the problem by disabling the MPLS which is not necessary but that takes 20% of RAM. Since then it is stable).
We thought to change all routers during the holidays, but the material we wanted to implement is not available and what is available does not work. We have indeed received the new Cisco Nexus 7000 and the BGP does not work but generates error messages ... New equipment and now ... Bad choice of material yet. So big challenged in perspective ... However it also generates the delay in the planned router's changing. Then we'll contact all manufacturers in the market to see what we will establish of what we expected. An unexpected job which will cause unexpected delay on other projects ...
I think that we could not be more transparent on the last events.
Update(s): Date: 2010-08-27 15:00:55 UTC
We have changed 1x10G to Frankfurt and 1x10G to Amsterdam
from chassis #2 to chassis #1 in order to restabilize the trafic between
th 2 chassis. We continue the changes of rack switches uplinks Date: 2010-08-27 14:57:41 UTC
We will do the same operation on vss-1. We started preparing the ground
for the moving of the switch uplinks from chassis #2 to chassis #1. Date: 2010-08-11 22:28:39 UTC
The vss-2 split was performed. It seems to work better and takes less CPU.
http://status.ovh.co.uk/?do=details&id=382Date: 2010-08-11 00:34:34 UTC
We will permanently disconnect links \"virtual link\" between the two chassis vss-2.
Date: 2010-08-11 00:33:25 UTC
Always a CPU problem on the ARP Input:
vss-2-6k#sh proc cpu | i \\ ARP Input
11 1302640401311514886 99 8.55% 9.33% 9.74% 0 ARP Input
11 1302642001311514965 99 8.77% 9.28% 9.73% 0 ARP Input
11 1302655801311515094 99 24.14% 10.47% 9.97% 0 ARP Input
11 1302662441311515330 99 10.87% 10.51% 9.98% 0 ARP Input
11 1302667641311515581 99 7.03% 10.23% 9.93% 0 ARP Input
11 1302673601311515785 99 8.47% 10.09% 9.91% 0 ARP Input
11 1302680561311516086 99 10.39% 10.11% 9.92% 0 ARP Input
11 1302688121311516406 99 12.03% 10.27% 9.95% 0 ARP Input
11 1302696081311516695 99 9.83% 10.23% 9.95% 0 ARP Input Date: 2010-08-11 00:32:32 UTC
The migration from 1G and 10G port is completed. We are ready to disconnect the chassis 2.
Date: 2010-08-09 16:04:05 UTC
We have moved a little more than half of uplinks of chassis # 2 on the chassis # 1 which unbalances the traffic on uplinks. So we start migrating some of the 10G chassis # 2 to # 1 (eng, ams, rbx-97).
Date: 2010-08-09 15:57:32 UTC
We will start work at the level of vss-2. The uplinks of each vlan are gradually migrated from chassis #2 to chassis # 1.