Saturday, January 18, 2020

Network meltdown caused by L2 failure - mitigating impact on L3 devices

Ive been dealing with a total loss of a DC this weekend casued by some server person deciding to bridge ethernets on a host....directly connected switch was not running any kind of storm control, root guard etc whic then caused the meltdown of 2 core routers due to them running 100% cpu until loop was stopped. Believe this was due to trying to process "to the box" traffic in cpu.

First task is to implmenet storm control etc on the DC switches.....

Our network has grown massively over the years, from a simple 1 rack presence with 2 edge routers into a multi country MPLS with 8 or so core nodes running MPLS. Issue is that in legacy data centres these core devices are also acting as edge routers for the DC LAN.

What is best practise in moving services away from core devices? for example DC1 will have 2 MX routers which are full mesh with other dc's running bgp/mpls, and then vrrp between MX routers on the DC LAN into whatever switches we have in place.

In my mind i see best practise as moving DC LAN edge away from core and inserting another L3 device in between. So instead of it being:

CORE ROUTER > DC SWITCH > DC LAN

moving to a model such as:

CORE ROUTER > DC EDGE ROUTER > DC SWITCH > DC LAN

then either ebgp between edge router and core or just plain old statics for stability.

Is this the kind of model i shuld be looking at or are there better solutions in place?



No comments:

Post a Comment