Wednesday, September 29, 2021

CRC errors when swapping Gigabit switch with new Ten Gigabit switch

Hey guys,

I'm scratching my head on this one and decided to post to hopefully get some feedback.

We're swapping the switches in our VMware environment used for iSCSI for 10 Gigabit switches. We currently have 2x Cisco 2960X switches and they've been rocking this environment for years with no issues. The Ciscos are basically dummy switches and not connected together.

We have 2x DELL R730 ESXi servers using Broadcom BCM5720 Gigabit NIC to connect to the iSCSI switches.

The SAN, a DELL ME4024, has multiple 10Gb NIC connected to the iSCSI switches.

Everything is configured with jumbo frames (MTU >= 9000).

We're slowly upgrading our environment to 10Gb, so we decided to change the 2960X's for DELL S4128T-ON 10Gb switches (previously FORCE10 switches).

As soon as I swap the Ciscos for the DELLs, I start seeing CRC errors on the new DELL switches interfaces connected to the ESXi hosts. The SAN interfaces have no CRC at all; only the ESXi hosts. We never had any CRC with the Ciscos. VMware is heavily impacted when using the DELL switches, reporting high latency and you can feel the latency. I rolled back to the Ciscos for now.

Here are some informations I gathered :

  • The CRC errors happen on both new switches for all iSCSI NICs on the hosts. The CRC count increases based on the datastore utilization, but I've seen them as high as 1500 CRC on an interface within an hour of normal VMware operations.
  • The switches interfaces for the hosts and the hosts themselves in vCenter report the same speed and duplex (1000, Full). It does not seem to be a speed / duplex mismatch then.
  • I tried to limit to speed 1000 on the ESXi hosts and on the switches interfaces for the hosts; no change.
  • On those S4128T-ON, I can't seem to be able to remove auto-negotiation to force a Duplex Full, so I couldn't try this.
  • Cables are CAT6 and known to be working fine.
  • I ensured that jumbo frames are working fine in VMware by using vmkping with a high MTU. When the hosts are trying to write to the SAN (vMotion for instance), I will loose some pings (< 10%).
  • Firmwares for the hosts (including BCM5720) are up-to-date as of April 2021
  • Firmwares for the DELL switches are the latest available.
  • Running ESXi 7.0U2 from end of August 2021; so drivers for the Broadcom are quite recent.

The Ciscos are Gigabit switches while the DELLs are 10Gb switches. Could there be an incompatibility between the BCM5720 Gigabit NICs and the DELLs switches ? I haven't found anything relevant yet about this. There are new firmwares available that I can try to upgrade for the hosts and maybe newer drivers from VMware or Broadcom. But most of the firmwares/drivers are pretty recent.

As I said, I don't know if it is a networking issue or a VMware issue. I'm first troubleshooting the network, since the equipment that I changed is a network switch. I'm ruling out a defective switch for now, since both to the exact same thing for the hosts interfaces.

If you guys have any input or advice, that would be appreciated.

Thanks !

Neo.



No comments:

Post a Comment