Thursday, July 2, 2020

Memory behavior with full BGP tables - Brocade CER NetIron

I'm running a couple of Brocade CER 2024s and across the board I'm seeing some memory utilization behavior that I don't understand.

First thing worth nothing - The CERs behave differently to the MLX and XMR routers, in that they do not have CAM profiles. I'm reasonably sure this means I tune the memory allocation manually by changing the system max variables, the result of which I think has been giving me enough rope to hang myself with.

I'm looking to ingest full BGP tables (Yes; multiple locations, transit providers, and IX/peerings) so I've bumped the system-max ip-cache and ip-route entries up to their max (1572864) and rebooted. The memory usage jumped from %60 to %90, which I assumed meant the memory was pre-allocated. But, as soon as I started ingesting routes, the utilization started climbing. It got under %5 remaining before I decided to roll back, and the utilization has come back down to %90.

So, what's the deal? If the %60-90 jump wasn't the memory being pre-allocated, what was it? I witnessed the exact same behavior across 4 units - so it has to be my configuration or a 'feature', rather than bug.

I have soft-reconfiguration enabled on all peers - including iBGP. I've heard it can cause memory issues but with the routing table sizes I've dealt to date with I've never encountered them - prior to this full table ingestion I only have 120k routes in the FIB. Courtesy of this issue I'm looking to turn soft reconfig off across the board - but it still doesn't really explain the behavior. Is the memory pre-allocated, but I can still exceed it? And if so, to what end? Is there any way to determine where it's consumed?

The show memory command output is just.. unclear. What's the difference between SDRAM and Memory? How is the total memory divided between the MP and LP processes, can I control it?

The only troubleshooting commands I've found are; show memory histogram pool [0-3]
(0-OS, 1-Shared, 2-MP, 3-LP) - I only get output for 2 and 3, and only really useful output for 2. The output is a snapshot which spits out a list of tasks and their allocation during the last time it was alerted. The output indicates some clear memory consumers, the "bgp_io" task consuming %60 of the memory and the "bgp" task consuming %20. But I can't find anything in the documentation about either task or what they actually do.
And; show ip bgp debug memory - for which the output means very little to me.

If it's of any relevance, I'm running NetIron 6.3.0aT183 and have the RT_SCALE license - so according to documentation It can support 1.5m routes in the FIB, though I've had some advice I'm unlikely to actually realize that because of the resources being shared with vrf/ipv6/etc.

Any advice or direction would be greatly appreciated.



No comments:

Post a Comment