Hi there,
I have an issue which is driving me nuts, and I don't see any way to go deeper in order to find out what's going on.
I have a linux hosts, running several VMs over KVM. VMs are using virtio_net drivers and are connected to a OpenVswitch that is connected to several VLANs on a Intel XL710-QDA2.
The Intel XL710-QDA2 is connected to two Cisco Nexus 3132Q-X with two MTP fiber patches (VPC Portchannel).
This is a standard and quite typical setup in our datacenter -we have a lot of machines running exactly the same hardware, drivers, kernel, firmwares, etc.-.
My problem, is that I'm getting Input CRC Errors on both ports, on both Nexus switches, and I'm 100% sure this is not a physical issue because:
1.- I started getting CRC errors on both ports exactly at the same time (I see the chances of two MTP patches having issues exactly at the same time extremely low).
2.- Not the first time I'm seeing this issue (I usually managed to track it down to a process in the Linux box doing weird stuff, once I would restart the process CRCs would disappear).
The problem I'm having this time is that I cannot manage to pinpoint the source of the CRC errors. I've been checking kernel logs, process logs, ethtool stats, netstat stats, etc, etc...
So I'm trying to find a way to get more information about the problematic frames. CRC are really low -compared to all the traffic the machine has- but being able to extract more information like source or dest mac, source or dest IP would really help me find the issue.
I don't think I can really find information on the Linux host, as I adding the CRC is offloaded to the NIC.
So I've been trying to find more information on the Nexus side, but looks like there is no way to log information about frames with CRC errors (Besides the counter increase).
Sooo, does anybody know if there is any way to log some details about a bad CRC frame in Cisco Nexus 3000 platform?
My main suspect right now is an issue on Intel XL710-QDA2 driver or firmware (Would not be the first time these cards have issues). Or maybe one of our customers -this is a public cloud hypervisor- trying to do nasty stuff -not the first time it would happen either-.
Right now, the only idea I have left is to arrange a live migration of all the VMs to another host one by one, and see if I can find which one is causing the issue -if it's really a VM issue- or just empty the host, reboot it and move the load back to it (Which seems like a ugly patch that would not help identify the source of the issue).
Any help would be extremely appreciated.
Thanks!
No comments:
Post a Comment