Monday, August 5, 2019

HPE Layer 3 Switches - Slow Routing?

I’m stumped. I’ve had two separate instances of this happening on two separate models of HP equipment.

My organization is a small business with 500-700 devices on premise, both Ethernet and wireless. We have four main Ethernet access VLANs, two voice VLANs, and separate VLANs for our secure wireless SSID and guest wireless SSID. We often have professional development sessions or conferences which can bring hundreds of guests on premise with their devices.

Previously, we had a single appliance serving as the gateway device for each VLAN, with two ROAS trunk links to it from our switches. This device is also our firewall/NAT device. We have two large layer 3 switches serving as the distribution switches for each side of our campus whose layer 3 functionality was not being utilized at all. So, I moved the gateway interfaces for each of our VLANs to the layer 3 switches and configured OSPF for learning routes. I have the firewall appliance serving as the core, routing between the two layer 3 switches. Seems like a more efficient architecture to me, rather than bridging VLANs across to the firewall, and hopefully reduces load on that device (we have more printers than staff members, which is insane, and some can be rather chatty on the network; I like the idea of limiting our broadcast domains as much as possible.)

As soon as this configuration change was made, users started complaining about web pages loading slowly. I have seen it myself. Web pages not cached take about 10 seconds loading, error out in the browser, then connect. They then respond totally normally for that client after the initial connection has been established. This behavior is only present when using the HP L3 switches as gateways. It totally disappears when using the firewall appliance.

Our department provides managed services for other organizations as well. In an organization of a similar size and with similar design and equipment (HP Aruba L3 switches), I have observed this exact same phenomenon. Using the switches for routing results in this latent web TCP connectivity. This organization has been suffering from this for some time (we started providing service only this year and “web pages sometimes loading slowly” was and still is the number one user complaint.)

Here is what I have ruled out: - DNS - name resolution works normally; some sites even move to a redirected URL or a CNAME alias BEFORE having the connection latency issue - Routing - the routing design is very simple - OSPF for internal subnets, then default routes on the L3 switches to the firewall appliance - ICMP shows no such latency, so why just web TCP? - Switching - nothing abnormal in the broadcast domains; we use rapid PVST+ in both networks and don’t suffer from excess broadcast traffic (plus we have multiple VLANs for such a small client group) - Browsers - this is happening on multiple browsers; the issue does not appear to be related to browser caches as it is happening on uncached sites upon first connection attempt - clearing the browser cache and reconnecting does NOT suffer the same latency - Switch firmware/software - I have upgraded software on all the L3 switches involved without any change - QoS - I do not have a single shaping or policing policy or other QoS tool running on these switches at the moment

I don’t get why this is happening ONLY when using these L3 switches. The whole point of this change was to enhance routing performance by lessening the layer 2 load on the firewall. Has anyone experienced this with HPE and have any insight? What else can I check for? It’s difficult to get a good packet capture because the problem is so sporadic.



No comments:

Post a Comment