Wednesday, July 3, 2019

Need help with a routine package installation that broke my clusters network connection

Hello,

I have an older High Performance Cluster running Centos7 that had a very strange problem today. I was installing some prerequisite packages for python 3.7. They installed fine on the head node, and then it failed on the child nodes due to the hostname resolution failing. I figured it was a DNS issue, so I checked the network status. It was running fine on each node, but no DNS. So, I elected to restart the nodes. That didn’t fix anything. After some more troubleshooting I restarted the head node, and now the network interface that handles ssh and other public connections is not detected on the network. Instead, the local interface to handle communication between the nodes seems to be multi casting on the network. Perhaps this was an issue before I installed these packages, but things were working fine the last time I did this for a server maintenance just a few weeks ago. The network interfaces are both up and running well since I can ping both IPs. The problem is that I just can’t ping or connect to anything else. It’s all unreachable.

Im quite new to HPC, and am at a loss. I haven’t changed anything to this system and neither has the Networking team Any suggestions on what to look for to fix this issue would be wonderful.

Thanks!



No comments:

Post a Comment