Friday, July 27, 2018

ARP Broadcast Flood

I have a bit of a unique issue on our network that is starting to stretch beyond my skillet to diagnose further. Wondering if anyone has any ideas?

We have two stacks of Cisco 3500 switches on both ends of our manufacturing plant. We have several vLANS configured, but the two primary vLANs that get used are called OFFICE and PLC. OFFICE has around 60 WYSE thin clients and label printers working on it and PLC has around 70 various makes of PLC’s connected to individual pieces of manufacturing equipment. This configuration has been working fine over the last year that it is in place, and network utilization is extremely low. Over the last month we have had a peculiar issue pop up with five pieces of manufacturing equipment that have Rockwell 5500 PLCs in them. Every 7-10 days an event is occurring that is impacting the communication from PLC to PLC within each of these 5 machines forcing them to crash. The equipment as a whole does not drop from the network, but the communication internal to the machine is impacted. What is further interesting is that it impacts all 5 of these pieces of equipment at the same time but nothing else running on the floor. There is no disruption whatsoever to other pieces of manufacturing equipped with PLCs or the PC/printers. These issues also do not occur if these 5 pieces of equipment are disconnected from the primary network.

I was able to catch the last crash with WireShark and saw that in a 2 second stretch before the crash our Cisco switch sent out a storm of thousands of ARP Broadcasts looking for 3 IP addresses on the PLC vLAN. During normal traffic patterns we are seeing 5-6 ARP requests per 3 seconds. This flood of requests seems to be enough to impact these particular PLC’s throwing them out of sync with each other and crashing the machine. Thus far I have tried:

  1. Enabling Storm Control on the Ethernet port these devices are plugged into. a. I set the threshold at 5% and the event didn’t trip it.
  2. I searched the floor and found that two of the pieces of equipment had been plugged into ports configured for the OFFICE vlan instead of the PLC vlan. a. Can this generate the flood of ARP requests we saw? i. Our plant floor is fairly dynamic so pieces of equipment move in and out of lines at any time 24/7. b. There hasn’t been a crash since making this change, but it has only been a couple of days. I have Wireshark still running and am hoping to catch another event when it occurs. Does anyone have any other thoughts on what might be going on or where I could look next?


No comments:

Post a Comment