Unresponsive VMware Images

Over the past week I have had two vmware images become unresponsive.  When trying to access the images via the vmware console any action reports:

rejecting I/O to offline device

A reboot fixes the problem, however for a Linux guy that isn’t exactly acceptable.  Upon digging a little deeper it appears the problem is with disk latency or more specifically a disk communication loss or time out with the SAN.  I looked at the problem with the vmware admin and we did see a latency issue.  We reported that to the storage team.  That however does not fix my problem.  What to do…  The real problem is that systems do not like I/O temporary loss of communication with their disks.  This tends to result in a kernel panic or in this case never ending I/O errors.

Since this is really a problem of latency (or traffic) there are a couple of things that can be done on the Linux system to reduce the chances of this happening while the underlying problem is addressed.

There are two things you can address, swappiness (freeing memory by writing runtime memory to disk aka swap).  The default setting it 60 out of 100, this generates a lot of I/O.  Setting swappiness to 10 works well:

vi /etc/sysctl.conf
vm.swappiness = 10

Unfortunately for me, my systems already have this setting (but I verified it) so that isn’t my culprit.

The only other setting I could think of tweaking was the disk timeout threshold.  If you check your systems timeout it is probably set to the default of 30:

cat /sys/block/sda/device/timeout

Increasing this value to 180 will hopefully be sufficient to help me avoid problems in the future.  You do that by adding an entry to /etc/rc.local:

vi /etc/rc.local
echo 180 > /sys/block/sda/device/timeout

I’ll see how things go and report back if I experience any more problems with I/O.

 UPDATE (24 Sep 2015):

The above setting while good to have did not resolve he issue.  Fortunately I was logged into a system when it began having the I/O errors and I was still able to perform some admin functions.  Poking around the system and digging in the system logs dmesgs at the same time led me to this vmware knowledge base article about linux 2.6 systems and disk timeouts:

I passed this on to our vmware team.  They dug deeper and determined that installing vmware tools would accomplish the same thing.  I installed vmware tools on the server and the problem went away!  It seems vmware tools hides certain disk events that linux servers are susceptible to.  There you go, hope that helps.

After Clone network configuration

The best part about Virtual Environments is the ability to clone new hosts from old ones.  Most of our infrastructure resides in vmware, so when we clone a system it retains the old Nic settings.  This is how you get the network interface working after cloning the system:

vi /etc/udev/rules.d/70-persistent-net-rules
       remove eth0
       rename eth1 to eth0

vi /etc/sysconfig/networking/devices/ifcfg-eth0
       change the MAC address, IP, netmask
       make sure ONBOOT=yes
       If UUID is present, delete it.

vi /etc/sysconfig/networking/profiles/default/hosts
       change the host file definitions

vi /etc/sysconfig/network
       change the HOSTNAME

That is it for configuring the system.  Sometimes you will need to run the following commands (I run them as part of the process since they do no harm).

modprobe -r vmxnet3
modprobe vmxnet3

That’s it.  I usually reboot the system to clean up anything cached, however you should be able to ifdown/ifup eth0.

Don’t forget to clean up your other configuration files that will have the old system information.  Backup definitions, monitoring, etc.