Tuesday, August 20, 2019

Summary of CenturyLink's December 27, 2018 outage.

The FCC report on CenturyLink's DWDM outage that occurred last year is out.

Full Report: https://docs.fcc.gov/public/attachments/DOC-359134A1.pdf

In summary:

In the early morning of December 27, 2018, a switching module in CenturyLink’s Denver, Colorado node spontaneously generated four malformed management packets. Malformed packets are packets that, while not rare, are not typically generated on a network and are usually discarded immediately due to characteristics that indicate that the packets are invalid. In this instance, the malformed packets included fragments of valid network management packets that are typically generated.

Each malformed packet shared four attributes that contributed to the outage:
1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices;
2) a valid header and valid checksum;
3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and
4) a size larger than 64 bytes. CenturyLink and Infinera state that, despite an internal investigation, they do not know how or why the malformed packets were generated

Due to the packets’ broadcast destination address, the malformed network management packets were delivered to all connected nodes. Consequently, each subsequent node receiving the packet retransmitted the packet to all its connected nodes, including the node where the malformed packets originated. Each connected node continued to retransmit the malformed packets across the proprietary management channel to each node with which it connected because the packets appeared valid and did not have an expiration time. This process repeated indefinitely.

The exponentially increasing transmittal of malformed packets resulted in a never-ending feedback loop that consumed processing power in the affected nodes, which in turn disrupted the ability of the nodes to maintain internal synchronization. Specifically, instructions to output line modules would lose synchronization when instructions were sent to a pair of line modules, but only one line module actually received the message. Without this internal synchronization, the nodes’ capacity to route and transmit data failed. As these nodes failed, the result was multiple outages across CenturyLink’s network.



No comments:

Post a Comment