Unresponsive VMware Images

Over the past week I have had two vmware images become unresponsive.  When trying to access the images via the vmware console any action reports:

rejecting I/O to offline device

A reboot fixes the problem, however for a Linux guy that isn’t exactly acceptable.  Upon digging a little deeper it appears the problem is with disk latency or more specifically a disk communication loss or time out with the SAN.  I looked at the problem with the vmware admin and we did see a latency issue.  We reported that to the storage team.  That however does not fix my problem.  What to do…  The real problem is that systems do not like I/O temporary loss of communication with their disks.  This tends to result in a kernel panic or in this case never ending I/O errors.

Since this is really a problem of latency (or traffic) there are a couple of things that can be done on the Linux system to reduce the chances of this happening while the underlying problem is addressed.

There are two things you can address, swappiness (freeing memory by writing runtime memory to disk aka swap).  The default setting it 60 out of 100, this generates a lot of I/O.  Setting swappiness to 10 works well:

vi /etc/sysctl.conf
vm.swappiness = 10

Unfortunately for me, my systems already have this setting (but I verified it) so that isn’t my culprit.

The only other setting I could think of tweaking was the disk timeout threshold.  If you check your systems timeout it is probably set to the default of 30:

cat /sys/block/sda/device/timeout
30

Increasing this value to 180 will hopefully be sufficient to help me avoid problems in the future.  You do that by adding an entry to /etc/rc.local:

vi /etc/rc.local
echo 180 > /sys/block/sda/device/timeout

I’ll see how things go and report back if I experience any more problems with I/O.

 UPDATE (24 Sep 2015):

The above setting while good to have did not resolve he issue.  Fortunately I was logged into a system when it began having the I/O errors and I was still able to perform some admin functions.  Poking around the system and digging in the system logs dmesgs at the same time led me to this vmware knowledge base article about linux 2.6 systems and disk timeouts:

I passed this on to our vmware team.  They dug deeper and determined that installing vmware tools would accomplish the same thing.  I installed vmware tools on the server and the problem went away!  It seems vmware tools hides certain disk events that linux servers are susceptible to.  There you go, hope that helps.