Monday, April 19, 2021

PMTUD problem with ECMP?

I know there is a history of PMTUD breaking, mainly caused by ICMP blackholes, however this one seems to be different.

This is happening on our K8s servers, but I could reproduce it on CentOS 8.3 minimal installation as well.

ECMP is configured by static route on the client, PMTUD is enabled by default, kernel version is 4.18:

[root@centos8-client ~]# ip route default nexthop via 192.168.39.177 dev enp1s0 weight 1 # PMTUD works via this one nexthop via 192.168.39.178 dev enp1s0 weight 1 # PMTUD does not work <...> 

The MTU is set to 9000 on the client and on the internal interfaces of the routers, while it’s only 1500 on the external interfaces of the routers.

Therefore the router is expected to send an ICMP frag needed to the client when that tries to send >1500 to outside. In response the client is expected to lower its MTU for the subsequent packets towards that destination.

This works fine in those cases, when ECMP's hashing algorithm picks the first nexthop entry, .177 in my case:

[root@centos8-client ~]# ping -c2 -s1600 192.168.122.4 PING 192.168.122.4 (192.168.122.4) 1600(1628) bytes of data. From 192.168.39.177 icmp_seq=1 Frag needed and DF set (mtu = 1500) 1608 bytes from 192.168.122.4: icmp_seq=2 ttl=64 time=0.468 ms --- 192.168.122.4 ping statistics --- 2 packets transmitted, 1 received, +1 errors, 50% packet loss, time 3ms rtt min/avg/max/mdev = 0.468/0.468/0.468/0.000 ms 

However when the second nexthop gets picked PMTUD does not work:

[root@centos8-client ~]# ping -c2 -s1600 192.168.122.3 PING 192.168.122.3 (192.168.122.3) 1600(1628) bytes of data. From 192.168.39.178 icmp_seq=1 Frag needed and DF set (mtu = 1500) From 192.168.39.178 icmp_seq=2 Frag needed and DF set (mtu = 1500) --- 192.168.122.3 ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2ms 

The client evidently receives the ICMP frag needed, but apparently ignores it. Iptables, firewall on client have been ruled out.

I ruled out the routers too, because PMTUD works just fine without ECMP on any of them.

So it's really just ECMP breaking PMTUD, and breaking it only for the second nexthop, not for the first one.

Did I really stumble upon a kernel issue, or do I miss something fundamental in regards to PMTUD/ECMP?

Disabling PMTUD of course solves the problem, however I'm kind of afraid to do that. Should I just accept it as necessity?



No comments:

Post a Comment