Thursday, January 23, 2020

DHCP Failover offering clients new IPs well before the lease is up

Hi. I'm at a total loss here and could really use some pointers, or possibly an explanation of some network logic I'm not seeing. I am relatively new to networking, so please excuse me if I use the wrong terminology when describing the issues here, but I've done by utmost to read all the pertinent documentation that has anything to do with what's happening here and I'm still stumped. I'm convinced I must be missing something about how this works or should be working, so I'm going to describe what my network looks like and how I think dhcp failover works, and if there are any red flags, let me know.

DHCP leases on my network in a particular VLAN don't seem to be consistently honored. We're running a DHCP failover pair with the default 50/50 split. Rather than expand the subnet, the previous engineer combined three subnets under one SVI (one primary IP, two secondaries), and extended this one VLAN to an absurd amount of users. The SVI has IP helpers to both DHCP failover peers (infoblox), does not have proxy arp enabled, and no IP redirects. In infoblox, the three subnets are configured as a shared network, and each subnet has a DHCP range serviced by the failover pair. This seems to be the correct configuration, based on DHCP failover documentation from ISC and infoblox themselves.

It is critical that devices retain a single IP, without receiving a new one. If I'm correct, a DHCPDiscover arrives at both failover peers, and the MAC is hashed to a value between 0 and 255, with hashes < 128 going to peer 1 (m1) and > 128 going to peer 2 (m2), since we have it configured to 50/50 load balance. Here's where I think I understand, but correct me if I'm wrong: only one peer handles the offer, request, and ack, and when done, updates the other peer with the lease information. If the other peer continues to see DHCPDiscovers with an elapsed time value == max load balance delay, it will send an offer as well.

The devices that connect to the network can be expected to reboot extremely frequently, connecting to the network to get instructions from a controller, before rebooting again. It is critical that these devices retain their IPs across reboots, since changing IPs will break their relationship with the controller and ruin a lot of people's afternoons.

Occasionally, one of these devices will, at some stage of this, not receive the same IP it had before, which completely derails these tests. It's maddening! In the logs for both members, I can see that for some reason, both members seem to be responding to all DHCPDiscovers, despite the elapsed time value not increasing. They usually both respond with the correct IP in their offer, until one of them doesn't.

This has apparently been happening for years. The previous engineer worked with infoblox TAC to decide that it was an issue with IP Device Tracking, but this does not seem to be the case currently. I personally wonder if shifting the load balancing from 50/50 to 95/5 would help not send the wrong lease information, but this doesn't seem like a fix at all.

??????????



No comments:

Post a Comment