Monday, October 7, 2019

Meraki Rant

Advanced warning, this is going to be long. First, some background, I bought a full Meraki stack for personal use back in April 2019. I got great pricing on it and I purchased an MX67, three MR42s APs, and four MS120 switches. At first, everything was great; I enjoyed configuring and tinkering with the various options / features. Sadly, this wore off shortly after I started encountering various bugs. MX bugs, MR bugs, Dashboard bugs, there seems to be no end of what doesn't work correctly. As a side note, I'm an active CCIE (for the last 13 years) and learn best when I have the gear on hand to work with. I'm not some goofball who bought some gear he knows nothing about.

On the MX side, my MX67 is afflicted by the following two really great ones in MX14.40 code:

Due to issues still under investigation, MX67(C,W) and MX68(W,CW) appliances may become inoperable after a device reboot occurs For a brief period of time upon boot, MX67(C,W) and MX68(W,CW) platforms can become bridged. This increases the likelihood of network loops forming in topologies with multiple inter-connected network devices for this brief period of time. 

Super. So, if my MX67 happens to reboot, it might just be a brick after. Maybe. Who knows? This defect has been in every 14.x release since I bought this MX67 back in April 2019. A device bricking defect is still around after 8 months? How is that possible? To go along with that, the MX67 bridges the WAN and LAN ports during bootup. That's super handy because it can bridge my cable modem to my internal LAN. At best this means that some internal devices get funky IP addresses, at worse it leaves open a nice security hole, especially if the device bricks itself while rebooting. Thankfully, MX 15.19 has a fix for the device bricking issue, but it comes with this known issue:

Due to issues still under investigation, there are significant performance regressions. 

Awesome. So, I can fix my device from turning into a brick at the cost of some unspecified "significant performance regressions." I opened a case to get clarity about what those regressions were but wasn't given any specifics. Instead I was told that they don't list bugs publicly and can't share with me what regressions might apply. I don't care which ones might apply, just tell me what they all are and I'll make the determination myself. Instead of being able to provide me guidance to make an informed decision about whether or not I wanted to risk bricking my device, or, if I wanted to deal with several performance impacts I'm instead left to just figure it out on my own. Then there's this bug, which is still open in all available versions of MX software:

After making some configuration changes on MX67(C,W) and MX68(W,CW) appliances, a period of packet loss may occur for 10 or more seconds. 

Why are there no specifics about what changes will cause a 10, or more, second period of packet loss? Surely they know what those items are, why aren't they listed? I haven't open a case on this one, but I assume I wouldn't be given any more information.

On the MR side I've run into equally frustrating bugs. First up, group policies. Meraki pushes these as a fancy way to do all kinds of interesting things. For example, one SSID can push different clients to different VLAN assignments based on what group policy they are assigned. At first, this sounded awesome, I could push my IoT devices into one VLAN and keep all my primary devices in another VLAN, all while keeping the same SSID and WPA2 PSK. Sadly, there was a huge bug in this where the access points wouldn't apply group policies correctly after reboot. This meant that after an MR or MS software upgrade, or power outage, the MRs would allow clients to connect, but some would get the wrong VLAN assignments because the group policy wasn't being applied. Then, the group policy WOULD get applied, but only after the devices in question already got a DHCP address. This meant the devices were in the correct VLAN, but had an IP address from the wrong VLAN, so nothing worked. This was recently fixed in MR26.5, 8 months after I reported the defect. Great, I'll just upgrade! The problem is MR26.5 causes frequent, random disconnects with my Nest cameras. They will just drop off the WiFI randomly for 10-20 minutes at a time, and then come back. This appears to be DHCP related as my DHCP server logs show DHCP DISCOVER / DHCP OFFER loops for the entire time the camera is offline. Sadly, this is (another) known issue in MR26.x where DHCP on bridged SSIDs sometimes just doesn't work. Seriously? DHCP on bridged SSIDs is broken? How basic is that function? It's been broken for MONTHS. Ok, no problem, I'll just skip group policies and then I can skip MR26.x and just run MR25.13. Not quite. MR25.x has a VoIP RTP packet loss defect that makes all VoIP calls via wireless completely unusable. It's not just VoIP handsets that are effected either. Facetime on iOS/MacOS also gets hosed because of this. In a household filled with iOS and MacOS devices this is unworkable.

On the MS side, I haven't run into too many issues, thankfully. These are just simple L2-only switches, how bad could it be? Sadly, I recently had an issue where the aggregation switch decided it lost connection to the Meraki cloud and therefore stopped forwarding any traffic. This took out connectivity for all devices behind it which meant I couldn't connect to anything to troubleshoot the problem. I'd be ok with this if my connection had actually gone out, but the MX67 did not report it's own connectivity problem during the exact same timeframe. Just this one MS120 switch decided to take everything down. I had to power cycle it to fix the problem. I opened a case to determine why and was told they didn't know why it dropped. Yesterday I changed STP priorities, a simple task, and it caused a 5 minute outage. STP doesn't take 5 minutes to propagate and there's no redundant paths/loops in my network, just single point to point links. Why should that take 5 minutes?

I knew what I was getting into from a feature / functionality perspective before I bought the equipment. However, there's still a number of things I find particularly annoying. In no specific order:

---Troubleshooting information is NONEXISTENT. You absolutely cannot syslog anything beyond simple firewall permit/deny messages. And, you can't count on support having any additional info, so, somethings are just impossible to troubleshoot.

---With regards to MX firewall permit/deny messages, the ONLY way to get them is via syslog. You can create the firewall rules in Dashboard, but there is zero facility to see what traffic, if any, is actually hitting those rules.

---Group policy firewall rules aren't actually firewall rules. They're ACLs. They aren't stateful. If you actually use group policy, which you shouldn't because of the above issue, have fun creating return ACL entries for all possible return traffic flows. There's even fewer options here, destination IP, port, and protocol. No source IP, no source port, nothing. This limited selection of fields makes dealing with return traffic practically impossible.

---Packet capturing via the Dashboard is a handy feature, except its broken. Want to capture for 3 minutes? Sorry, it'll stop after 10-15 seconds and show you no data. According to support this is a known issue and I've had a case open since late June with no fix.

---Stats via the Dashboard are handy, but the byte counts aren't accurate. Values shown at different places on the screen don't add up to the totals shown at the top. According to support this is another known issue.

---Absolutely no IPv6 support to speak of. In 2019.

---Want to configure specific traffic shaping rules so you can put particular traffic in a higher priority queue. Sorry, you can only specify destination IP addresses here, no source IP/port/DSCP combinations. Or, you can use the nebulous "Layer 7" rules they have predefined, but there's no specifics about what those rules are actually matching.

I could go on more, but honestly I'm tired of thinking about it. I spent a decent chunk of change on this gear and it has been a very frustrating experience. I've had way more bugs with it than the Ubiquiti gear I had before. I'm throwing in the towel. I'm gutting this Meraki gear and I've repurchased a new Ubiquiti stack and I'll use that going forward. This was truly a disappointing experience.



No comments:

Post a Comment