Thursday, August 9, 2018

How to prove it's not the network

As we all know network gets blamed first every time something bad happens. In an enterprise (30k users, 30 locations) network what do you think would be needed for 9/10 cases to prove it's not the network?

At least for us, first thing people ask is what has changed in the network. For that we're starting to use LibreNMS with Oxidized pushing configs to git. We could then quickly show what config changes have been made. I'm wondering if I should also get routing tables to Oxidized? Or is there a better way to monitor routing tables in the network?

Besides config auditing it's probably all about monitoring the network? Some things I think would be useful to alert on and have on higher priority monitoring: (besides of course device availability)

  • BGP sessions in our network (we run our own MPLS network)
  • Bandwidth usage on core and uplinks (core <-> distribution)
  • Errors on core and uplink interfaces
  • something else?

We're also implementing NetFlow monitoring to understand the traffic patterns, and maybe see the situations where the client did send the traffic but the server didn't respond?

Wondering though how we could monitor application latencies? We've tried installing Raspberry PI's to our remote locations and have them do connection tests to see if some location has suddenly worse response times than other. But it's quite hard to manage those if you have lots of services. On the DC side we could probably have everything behind our F5s and use their monitoring tools to get some data at least whether it's the client or the server.

Thanks for any ideas!



No comments:

Post a Comment