Thursday, September 27, 2018

ELK for network monitoring - where to go?

Hi all,

We've been setting up an ELK cluster which is supposed to be the centralised data lake for our monitoring services and apps. Currently, we have the following data sent to the cluster:

  1. Netflow information from routers, firewalls, datacentre, load balancer etc.
  2. Syslog from all devices
  3. Custom metrics from network equipment (using custom python/REST APIs and agents I can send pretty much all of the data I want, basically anything that can be displayed with a show command). We use this to send many metrics, from interface utilisation, STP events and MPLS routes to BGP-EVPN stats

My question is, what would be the best way to analyse this data and gather some informational insights on the network from it? I'd love to get some ideas or hear what you guys have seen/developed for your environments (or some general thoughts on ELK and network monitoring with it). Currently we're struggling to even analyse network failures retrospectively since some of the metrics and data (Syslog?) is not informational or doesn't seem to help that much...

My current ideas:

  • Build custom ML apps using open-source tools (TensorFlow, SciKit etc.) in order to predict failures (based on all gathered metrics)
  • Create some trap generating system on top of ELK (Sentinl, elastalert etc.)
  • Gather some advanced metrics, such as health measurements of an app or a path of a flow in the network (and possible feed it to the ML app mentioned above)

The way I see it now, we have two main issues: how well the solution will fit ELK (just the methods, without even talking about ELK's limitations) and how hard would it be to develop and get to the production scale level...

Cheers.



No comments:

Post a Comment