Wednesday, November 27, 2019

Let’s talk event-driven automation.

I think a lot of us here probably agree that network infrastructure configuration management and orchestration is relatively easy to automate, and will probably become the golden standard going forward.

What interests me a lot more, is event-driven automation at the network infrastructure level. I think that the most exciting prospects live there. What do you guys think about event-driven automation? What is already out there? What is possible to attain?

Here’s an example from a NANOG presentation that was recently shared in another thread.

https://archive.nanog.org/sites/default/files/1_Ulinic_Network_Automation_At_v1.pdf (Cloudflare’s self resilient network (starts on slide 66.))

They basically aggregate collection of network performance metrics with configured IP SLA Probe & RPM Probe results in SaltStack and automatically change configuration to either pull anycast advertisement from certain nodes, or disable peering with a transit provider depending on things like interface load, errors, or packet loss.

According to the presentation this results in 120 configuration changes a day on average, all with zero human intervention. This means that their network basically detects certain problems and attempts to correct them automatically.

I find that incredibly cool. And yeah, it’s probably not perfect, and it didn’t seem to mitigate a large scale outage they recently had due to BGP leaking, but how many routine incidents do you think they are able to mitigate with these measures, where users in a certain region who would experience service degradation due to loss and congestion don’t even notice anything because they’re suddenly routing down a different path or even hitting a completely different anycast node as soon as problems are detected?

From an enterprise perspective, I envision event-driven automation in the form of an incident being created automatically triggering a script. The incident must include the user’s pc name or IP, and the destination URL they’re trying to reach. The script would basically check dns resolution, trace the end-to-end path through the enterprise network from source to dest, and vice versa, as well as dump out interface statistics along that entire path, check firewall logs based on src/dst, and even grab the Mac and port info of the user, and automatically update the incident with all of the collected info. Within seconds of the ticket being put in, the responding technician gets all the information he needs included in the ticket to quickly determine if the problem is likely something on premise, or a distant end problem.

Going a step further you could even try to automate “fixing” the problem if certain tests fail.

When trying to think of what else could be automated, think of the last network problem you fixed at work. Could you identify what you observed to determine the root cause, what the fix action was, and what the symptoms were? How methodical was your troubleshooting process? Could it have been done by a script. Or, in other words: could you translate your troubleshooting methodology into a set of scripts, essentially “teaching” your network to “think” and act like you?

Maybe some of what we fix required deep dives into pcaps and conference calls with vendors, but are there a lot other tasks that that were simple quick finds based on some output we found in CLI. “oh, there’s no return route to the host,” or “oh, 50k input errors a second. Let’s shut this port and try cleaning the fiber and checking the SFP.”

What else do you all see developing in the future? I know Cisco is doing some interesting things in the campus arena with their SD-Access. It may not be the most popular thing out there, but the general concept is extremely cool and has enormous potential. (Basically certain configuration like vlan, access levels, firewall rules and more can “follow” a user around the network wherever they go.)

How do you all think event driven automation can integrate into the network to help us do our jobs better... or put us all out of a job, if that’s how you see things. ;) Kidding on that last one. Happy Turkey Day everyone!



No comments:

Post a Comment