Friday, January 8, 2021

The whole internet was down after one tiny little mistake

First of all I'm using a throw away account for this post. Something really weird happened and I just thought I would share the story with you guys.

I work for a major telecom provider in a country with a population of about 40mil, we have around 15mil clients (consumers and businesses).

Last week, an engineer in the maintenance/operations team was migrating some public /30 subnets (enterprise clients) configured in our global public internet vrf. He was migrating them from the PE router to a smaller aggregation router.

However, for one client (/30), when he configured the interface on the new router, he put /3 instead of /30.

As a result, thousands of public addresses on our network were duplicated, and ended blackholed, including our DNS servers.

So there was a nationwide outage for a few hours, before anyone could figure out what was going on.

The guy is still keeping his job by the way.

And to be honest, mistakes like these do happen, but I think we should implement something somewhere to keep mistakes like these from causing a huge outage like this.

Has anything like this ever happened to you guys?



No comments:

Post a Comment