Saturday, February 2, 2019

Followup to: 10gbe 70% packet loss- solved... M4300 CARP issues

First, thank you all for your support and help and ideas. Even ya'll that were spiteful. I love you guys too :)

SO as of today everything is (mostly) working as planned. Mostly because some other unexpected (expected?) issues arose, but otherwise everything is flowing along correctly.

First, let's see- CARP/VIP issues. 1 HA unit (2 machines) had a bad interconnect. I called it from day one, but I didn't know squat so it was ignored. I'm told that 'it must be something new' when I finally whittled it down to the missing interconnect port on one of the nodes. We're waiting to RMA that.

As for the other HA box, the reason the VIP IP constantly broke? Because the sysad at that site had an IP conflict on another piece of hardware. In combination with the M4300 Netgear switch (which apparently Does NOT) enforce the correct warnings or protocols. I don't know what to say here/there yet but I'm going to try and raise the issue with netgear to see if that's an outstanding bug for VIPs or if something else is weird. It was diagnosed by watching a local ARP table on Windows machine and matching line by line the MAC addresses with the other machines. Since the MAC of VIP/CARP is in a certain prefix- it was easy to find once you knew what to look for.

Second issue- the switch wasn't properly configured for IGMP. Many of you pointed to that, and I certainly spent tens of hours running it down. So (improperly) I turned it all on, and it's been working fine. That's not the correct solution but it'll do until I get the customer to sign off on accepting the hardware. That and pegging each of the settings. There's still VLAN and management interfaces that needs to be done too so some of this will be corrected then.

Third, the packet loss: See above.

Fourth, the 1x40gbE to 4x10gbe breakouts: Well, that was interesting. For the Chelsio cards to function properly the switch had to have static LAG turned off- so basically dynamic LACP. Once that was enabled everything was goodish.
In addition, it was discovered that the Chelsio adapters were NOT flashed correctly from the factory. Reflashing them to the correct firmware did the trick.
In even MORE addition, my wonderful purchasing department couldn't follow instructions and bought the wrong adapters... again.. for the 3rd time. Once I engaged the supplier directly and shipped out the gear for reflashing, they came back with the write firmware to match the hardware. Geezus I can't imagine doing this in a data center.

Fifth, performance: Even with 2x 10gbE connections but not teamed (THAT is still an issue- used to work, now broken with Intel), I can move around almost the data I need. Using iperf (in a hurry because I had 20 mins to get it done before the customer pulled my cable)

[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 928 MBytes 7.78 Gbits/sec
[ 4] 1.00-2.00 sec 751 MBytes 6.30 Gbits/sec
[ 4] 2.00-3.00 sec 783 MBytes 6.56 Gbits/sec
[ 4] 3.00-4.00 sec 788 MBytes 6.62 Gbits/sec
[ 4] 4.00-5.00 sec 792 MBytes 6.64 Gbits/sec
[ 4] 5.00-6.00 sec 752 MBytes 6.31 Gbits/sec
[ 4] 6.00-7.00 sec 92.2 MBytes 774 Mbits/sec
[ 4] 7.00-8.00 sec 93.6 MBytes 785 Mbits/sec
[ 4] 8.00-9.00 sec 111 MBytes 932 Mbits/sec
[ 4] 9.00-10.00 sec 109 MBytes 917 Mbits/sec


[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 5.08 GBytes 4.36 Gbits/sec sender
[ 4] 0.00-10.00 sec 5.08 GBytes 4.36 Gbits/sec receiver

You can see some weird stuff there, but most of the other runs were just fine.

SO, thank you all. Quite grateful for the ideas. Doing this all remotely was practically impossible but it got done.

Src Links:
https://www.reddit.com/r/Cisco/comments/a7s2em/sg350xg48_carp_ha_compatibility_netgear_m4300_and/ https://www.reddit.com/r/networking/comments/a6bzx4/10gbe_70_packet_loss/



No comments:

Post a Comment