Wednesday, April 11, 2018

[Troubleshooting] Networked PDUs become intermittently unreachable. (xpost /r/datacenter)

disclaimer: please forgive if this post is questionable, I think this is a network issue, but I'm not sure, trying to get to the bottom of the issue...

I have a couple of crypto mining "datacenters", each one consists of about 50 servers in a shipping container.

For power machines are plugged into TrippLite networked PDUs (these ones: http://www.provantage.com/tripp-lite-pdumv30hvnet~7TRP904L.htm).

Everything (servers, PDUs, a couple raspberry pi devices) is on the same flat network. There's a router I bought from pfsense.org in each container which is more than sufficient to the task, it handles DHCP, and things are generally speaking fine.

The physical topography of the network looks like

WAN <- pfsense (gateway & DHCP) <- ethernet switch #1 <- ethernet switch #2

Everything that becomes a DHCP client is plugged into switch #1 or #2. Switch #2 is also plugged into switch #1. (These are just unmanaged gigabit ethernet switches.)

The issue that I run into is that occasionally one or more of my TrippLite PDUs become unreachable.

They show up fine if you've just plugged in the PDU, they get an address and respond normally on the network, all of them work as expected at first. However, after a day or so, PDUs randomly become unreachable.

I can't ssh into them, can't ping them, they don't even show up with arp -a, they're just not present on the network. Nevertheless, the unit is powered up and the servers plugged into its outlets are running just fine.

I can workaround the issue by physically shutting off the breaker that powers the PDU then turning it back on again. The PDU comes back, finds the network and is normally responsive.

However, this defeats the whole purpose of networked PDUs when I have to actually go there to get the PDU back online.

Has anybody else seen a similar issue with this brand of PDU? Have you seen this very thing? How did you solve it?

I'm considering writing a little process that runs on one of my rpi's that just fires a tcp ping at each PDU every other minute, see if that possibly keeps them active or whatever. But I'm not hopeful about that honestly, and it's a shitty little hack even if it does work.

Any help you can provide much appreciated!

[Edit] I have some PDUs with their network plugged into switch #1 and some into switch #2, doesn't seem to be any correlation as to which ones go offline eventually

[Edit] FWIW the PDUs get a statically configured lease with assigned IP address based on MAC address (they are the only things that do, all other devices just ask for a lease and get a random address).



No comments:

Post a Comment