Friday, March 16, 2018

Collecting metrics, trending, per-device baseline analysis: What to use?

inb4 Zabbix, Nagios, etc. for generic threshold monitoring and polling.

I wonder if there are success stories in time series analysis per-device for networking/security devices like Cisco, Fortinet, Palo Alto, etc. - Take in critical counters and metrics of multiple devices, and the system trends/baselines the data per-device to alert oncall when a device is acting abnormally. Some devices may comfortably run at 60% with spikes at 80% every night at 9pm, and I don't want an alert, but some devices typically run at 20% and a spike to 60% would be a critical alarm.

We have 1000+ devices my team wants to monitor like this, so automatic learning is preferred to research/set each device threshold.

Tools I've heard of: Timelion(Elastic Stack), Prometheus, InfluxDB with visuals/alerting. Anything else? Most importantly, anyone here actually accomplish what I'm asking?



No comments:

Post a Comment