Host Disconnects Unexpectedly

Without going into greater detail, we had some networking issues on one of our ESXi hosts. One of the physical NIC’s would flap between 100Mbps, and 1Gbps. This was one of two physical NIC’s that are in an active/active team for my Management vmkernel adapter. The NIC’s in a team so there’s nothing to worry about right? Ehhh. We started to have random host disconnects unexpectedly occur. What gives?

host disconnects unexpectedly
PSOD’dy goodness

One of my colleagues was able to find this in the vpxa.log on the host:

Agent can't send heartbeats: No buffer space available

This brought us to KB2149738 “Agent can’t send heartbeats: No buffer space available” error in the VPXA log”. Essentially, it states that when the TCP/IP stack determines that a networking adapter is utilizing over 80%, the VXPA will stop sending management traffic for 70ms to help lower congestion. If heartbeats are sent out during this 70ms pause, the VPXA prevents the packet from being sent and triggers the “VPXA agent can’t send heartbeats: No buffer space available” error message in the vpxa.log on the host.

The resolution states that the error is expected and can be safely ignored as long as the host isn’t dropping out of vCenter. Well, ours dropped out of vCenter. Thankfully, the KB does provide multiple solutions for this. The first is to modify the network adapter utilization threshold. As an example, the following command modified the utilization threshold to 90%:

esxcfg-advcfg -s 90 /Net/TcpipTxqMaxUsageThreshold

The second recommended solution is to increase the amount of time vCenter will wait before marking a host as disconnected (default is 120 seconds). To accomplish this, make the following change in the Advanced vCenter settings:

config.vpxd.heartbeat.notRespondingTimeout

Upon further investigation for our issue, we found that the one physical NIC had flapped back to 100Mbps, and the second physical NIC went down. So all of the management traffic was funneling over the single 100Mbps connection, and choked under the pressure!

Because of this we opted to not implement any of the recommended solutions. We are continuing to work with our Networking team to have them fix the ports.

Leave a Reply

Your email address will not be published. Required fields are marked *