FS#3830 — Internal Routing Roubaix

Scheduled Maintenance Report for Network & Infrastructure

Completed

In order to manage the traffic between our backbone routers on Roubaix (rbx-1-6k<>rbx-2-6k<>vss-1-6k<>vss-2-6k<>rbx-99-6k), we are establishing a new routing architecture. Switching to this new architecture would take place tonight starting from midnight.
This maintenance concerns the links Roubaix <> Brussels (bru-1-6k).

We are switching the links one by one which would not cause any impact on the traffic.

Update(s):

Date: 2010-07-31 00:25:03 UTC
MTU problem is resolved with the passing of nexus 5000 to nexus 7000:

http://status.ovh.co.uk/?do=details&id=345

Date: 2010-07-31 00:23:38 UTC
The switching is accomplished. Remains a defected link (rbx-1<>sw.int-1) passed tonight in interim. that would be fixed by tomorrow.

Date: 2010-07-31 00:21:10 UTC
We are switching the traffic on the new links sw.int-1 <> vss-1/2 and rbx-99

Date: 2010-07-31 00:20:31 UTC
We are starting the tasks.

Date: 2010-07-31 00:20:02 UTC
We are pursuing the tasks tonight hoping that dealing with MTU allows to fix the problem once at all and to switch totally on the new infra.

Date: 2010-07-31 00:11:39 UTC
It is an MTU problem and a bug.

There is no problem between Nexus 5000 and 6509 standard and/or en SXF.
We are setting the MTU 9216 and that works properly.

Nexus 5000:
policy-map type network-qos jumbo
class type network-qos class-default
mtu 9216
system qos
service-policy type network-qos jumbo

BOOTLDR: s72033_rp Software (s72033_rp-IPSERVICESK9-M), Version 12.2(18)SXF16, RELEASE SOFTWARE (fc2)
interface Port-channelXXX
mtu 9216

The bug exists between Nexus 5000 and VSS in SXI.
Cisco IOS Software, s72033_rp Software (s72033_rp-ADVIPSERVICESK9-M), Version 12.2(33)SXI3, RELEASE SOFTWARE (fc2)
2 bits are missing.
with

interface Port-channelXXX
mtu 9216

there is CRC on the interfaces
with

interface Port-channelXXX
mtu 9214

No more problems.

We have noticed it on the weft's height in BGP sessions.
Datagrams (max data segment is 9214 bytes):

# ping ip XXXX size 9216 df-bit

Type escape sequence to abort.
Sending 5, 9216-byte ICMP Echos to XXXX, timeout is 2 seconds:
Packet sent with the DF bit set
.....
Success rate is 0 percent (0/5)

-> that's OK from 9214:

#ping ip XXXX size 9214 df-bit

Type escape sequence to abort.
Sending 5, 9214-byte ICMP Echos to XXXX, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 8/52/204 ms

We are going to finalise the internal routing infrastructure with this \"workaround\" then report the bug to Cisco ...

Date: 2010-07-31 00:09:39 UTC
The traffic is switched.

Date: 2010-07-31 00:09:21 UTC
we are starting the switching operation.

Date: 2010-07-31 00:08:54 UTC
Tonight, there will be tasks on the network Roubaix2. We are switching the traffic ss-1 <> vss-2 on a new infra nexus. in case of problem, we would return back immediately.

Date: 2010-07-31 00:06:48 UTC
We reattempted switching the links 10G on the new infra but we are facing always difficulties. We are switching back to the old configuration unless rbx-1 <> rbx-2 which is the only link running correctly via this new infra.

Date: 2010-07-31 00:04:30 UTC
Defected links are now repaired. We are beneficing to repair other defected links.

Date: 2010-07-31 00:01:41 UTC
We are starting the maintenance.

Date: 2010-07-31 00:01:19 UTC
Defected links repairing will take place tonight from 23:00. Regarding the way we are improving in this part, we clutch on the switching the routing links on the new internal routing switches.

Date: 2010-07-30 23:57:36 UTC
We located some problems on the link rbx-1<>vss-2 before even the start of the switching. We established a fiber temporarily and we expect a maintenance intervention so as to repair it once at all.

We are measuring an abnormal high attenuation on the links vss-2 <> rbx-99 that we would fix.

Date: 2010-07-30 23:53:20 UTC
We are switching the links rbx-1<>vss-2 and rbx-2 <> vss-1

Date: 2010-07-30 23:52:48 UTC
We modified the MTU configuration of the N5 switches and switched the link rbx-1<>rbx-2 above. The BGP session is actually stable. we are going to switch progressively the other links.

Date: 2010-07-30 23:50:38 UTC
the problem is probably due to MTU which is XXXXX managed on N5
the XXXX to replace by \"bad\", \"differently\", etc

Date: 2010-07-30 23:49:03 UTC
Lost.

We will return the links back as before and we will forward the bugs to Cisco ...

Date: 2010-07-30 23:47:48 UTC
We believe the CRC problems caused by non compatible optics (!?) between Cisco N5 and Cisco 6509 ...

We are retesting.

Date: 2010-07-30 23:45:52 UTC
Maintenances are not running well. We have the CRC between the routers. We returned to the initial setting. With more pains because of bugs:

rbx-99-6k#sh inter ten 9/1
[...]
30 second output rate 90000 bits/sec, 98 packets/sec
[...]

No way to pass the traffic.

rbx-99-6k#conf t
Enter configuration commands, one per line. End with CNTL/Z.
rbx-99-6k(config)#inter ten 9/1
rbx-99-6k(config-if)#shutdown
rbx-99-6k(config-if)#no shutdown
rbx-99-6k#sh inter ten 9/1
[...]
30 second output rate 2345596000 bits/sec, 384765 packets/sec
[...]

This is what we call a nice bug which wastes 2h at night.

Posted Jul 30, 2010 - 23:38 UTC