Disk Management

Disk Woes

I hope to never use this document again but thought it worth documenting in case someone else has need of the information.  I powered my desktop off for a planned power outage.  When I powered it back on the system failed to boot reporting either “Error 17” or “Error 25”, in short the software raid (mirrored disks) were corrupted…  The timing of this event could not have been better.  The power outage included our data center, so I had to power over 100 systems on without my desktop!  Thank God for Live CDs!!  Following the power on there were other issues to deal with so it was almost a week before I could deal with my failed desktop.  Here is what I tried:

“sata to USB cable” since the drive was part of a raid pair this didn’t work and I didn’t waste a lot of time on it.  What it did help me discover was which disk was bad.

Knowing which disk was bad I then confirmed the failed drive using the BIOS and boot sequence on my desktop.  I confirmed it was /dev/sda that was failed.  I was able to get a replacement disk on the same size from our desktop support team.  With the new disk installed here is what I did and the results.

Boot the system to an Ubuntu Live CD

I don’t have time to add much description now but the commands and sequence should hopefully help for now.  Feel free to post a question in the comments if you have any.

sudo mdadm --query --detail /dev/md/1
sudo mdadm --assemble --scan
sudo mdadm --query --detail /dev/md/1
sudo mdadm --assemble 
sudo mdadm --assemble --scan

sudo mdadm --query --detail /dev/md/1
sudo mdadm --query --detail /dev/md/0
sudo mdadm --query --detail /dev/md/2
sudo mdadm --query --detail /dev/md/3

sudo mdadm --stop /dev/md/0
sudo mdadm --stop /dev/md/1
sudo mdadm --stop /dev/md/2
sudo mdadm --stop /dev/md/3

sudo mdadm --query --detail /dev/md/0
sudo mdadm --query --detail /dev/md/1
sudo mdadm --query --detail /dev/md/2
sudo mdadm --stop /dev/md/2
sudo mdadm --query --detail /dev/md/3
sudo mdadm --stop /dev/md/3

sudo fdisk -l

cat /proc/mdstat 
sudo mdadm --assemble --scan

cat /proc/mdstat 
sudo mount /dev/md3 /mnt
cat /proc/mdstat 
sudo mount /dev/sdb1 /mnt

sudo fdisk -l

sudo mdadm stop /dev/md/0n3

cat /proc/mdstat
sudo mdadm --manage /dev/md0 --fail /dev/sda1
sudo mdadm --manage /dev/md0 --fail /dev/sda
sudo mdadm --manage /dev/md1 --fail /dev/sda2
sudo mdadm --manage /dev/md2 --fail /dev/sda3
cat /proc/mdstat
sudo sfdisk -d /dev/sda > sda.out
sudo sfdisk -d /dev/sdb |sudo sfdisk /dev/sda
sudo sfdisk -d /dev/sda > sda.out

sudo fdisk -l
sudo mdadm --manage /dev/md0 --add /dev/sda1
sudo mdadm --manage /dev/md1 --add /dev/sda2
sudo mdadm --manage /dev/md2 --add /dev/sda3
sudo mdadm --manage /dev/md3 --add /dev/sda5
cat /proc/mdstat 
watch cat /proc/mdstat 

Every 2.0s: cat /proc/mdstat                                                                                                                                                                 Mon Aug 17 13:15:31 2015

Personalities : [raid1]
md0 : active raid1 sda1[2] sdb1[1]
      4093888 blocks super 1.1 [2/2] [UU]

md1 : active raid1 sda2[2] sdb2[1]
      819136 blocks super 1.0 [2/2] [UU]

md3 : active raid1 sda5[2] sdb5[1]
      278538048 blocks super 1.1 [2/1] [_U]
      [==============>......]  recovery = 70.4% (196127360/278538048) finish=15.0min speed=91334K/sec
      bitmap: 0/3 pages [0KB], 65536KB chunk

md2 : active raid1 sda3[2] sdb3[1]
      204668800 blocks super 1.1 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

Good Luck

 

 

 

 

Unresponsive VMware Images

Over the past week I have had two vmware images become unresponsive.  When trying to access the images via the vmware console any action reports:

rejecting I/O to offline device

A reboot fixes the problem, however for a Linux guy that isn’t exactly acceptable.  Upon digging a little deeper it appears the problem is with disk latency or more specifically a disk communication loss or time out with the SAN.  I looked at the problem with the vmware admin and we did see a latency issue.  We reported that to the storage team.  That however does not fix my problem.  What to do…  The real problem is that systems do not like I/O temporary loss of communication with their disks.  This tends to result in a kernel panic or in this case never ending I/O errors.

Since this is really a problem of latency (or traffic) there are a couple of things that can be done on the Linux system to reduce the chances of this happening while the underlying problem is addressed.

There are two things you can address, swappiness (freeing memory by writing runtime memory to disk aka swap).  The default setting it 60 out of 100, this generates a lot of I/O.  Setting swappiness to 10 works well:

vi /etc/sysctl.conf
vm.swappiness = 10

Unfortunately for me, my systems already have this setting (but I verified it) so that isn’t my culprit.

The only other setting I could think of tweaking was the disk timeout threshold.  If you check your systems timeout it is probably set to the default of 30:

cat /sys/block/sda/device/timeout
30

Increasing this value to 180 will hopefully be sufficient to help me avoid problems in the future.  You do that by adding an entry to /etc/rc.local:

vi /etc/rc.local
echo 180 > /sys/block/sda/device/timeout

I’ll see how things go and report back if I experience any more problems with I/O.

 UPDATE (24 Sep 2015):

The above setting while good to have did not resolve he issue.  Fortunately I was logged into a system when it began having the I/O errors and I was still able to perform some admin functions.  Poking around the system and digging in the system logs dmesgs at the same time led me to this vmware knowledge base article about linux 2.6 systems and disk timeouts:

I passed this on to our vmware team.  They dug deeper and determined that installing vmware tools would accomplish the same thing.  I installed vmware tools on the server and the problem went away!  It seems vmware tools hides certain disk events that linux servers are susceptible to.  There you go, hope that helps.

6GB free = 100% disk usage?!

What to do when you have plenty of available disk space but the system is telling you the disk is full?!  I was working on a server migration, moving 94GB of user files from the old server to the new server.  Since we aren’t planning on seeing a lot of growth on the new server, I provisioned a 100GB partition for the user files.  A perfect plan, right?…  So I thought.  After rsync’ing the user files, the new server was showing 100% disk usage:

Filesystem*            Size  Used Avail Use% Mounted on*
/dev/mapper/my_lv_name
                       99G   94G  105M 100% /user_dir

Given competing tasks, at first glance I only saw the 100%.  Naturally I assumed something went wrong with my rsync or I forgot to clear the target partition.  So I deleted everything from the target partition and rsyn’d again.  When the result was the same, it gave my brain pause to say…what?!

My first thought was that the block size was different for the two servers the old server block size was 4kB, perhaps the new server had a larger block size.  As we joked, to much air in the files!  Turns out, using the following commands, the block size was the same on both systems:

usage:
blockdev --getbsz partition
# blockdev --getbsz /dev/mapper/my_lv_name 
4096

So the block size of the file system on both servers is 4kB.

I started digging through the man pages of tune2fs and dumpe2fs (and google) to see if I could figure out what was consuming the disk space.  Perhaps there was a defunct process that was holding the blocks )like from a deletion), there wasn’t.  In my research I found the root cause.  New ext2/3/4 partitions set a 5% reserve for file system performance and to insure available space for “important” root processes.  Not a bad idea for the root and var partitions but this approach doesn’t make sense in most other use cases, in this case user data.

Using the tune2fs command we can see what the “Reserved block count” like this:

tune2fs -l /dev/mapper/vg_name-LogVol00

The specific lines we are interested in are:

Block count:              52165632
Reserved block count:     2608282

These lines show that there is a 5% reserve on the disk/Logical Volume.  We fix this with this command:

tune2fs -m 1 /dev/mapper/vg_name-LogVol00

This reduces the reserve to 1%.  The resulting Reserved block count reflects this 1%

Block count:              52165632
Reserved block count:     521107

While this situation is fairly unique, hopefully this will at the least answer your questions and help you better understand the systems you manage.

*The names in the above have been changed to protect the innocent.

Flush This!

I came across this today and learned something new so thought I would share it here.

After killing 2 processes that had hung I noticed the following in the ps output:

root       373     2  0 Jun11 ?        00:00:00 [kdmflush]
root       375     2  0 Jun11 ?        00:00:00 [kdmflush]
root       863     2  0 Jun11 ?        00:00:00 [kdmflush]
root       867     2  0 Jun11 ?        00:00:00 [kdmflush]
root      1132     2  0 Jun11 ?        00:01:03 [flush-253:0]
root      1133     2  0 Jun11 ?        00:00:43 [flush-253:2]

Now kdmflush I am use to seeing, but flush-253: was something I had never noticed so I decided to dig.  I started with man flush but that seemed to lead no where since I am not running sendmail or any mail server.  I turned to google (not to proud to admit it) and searched “linux process flush”.  Turns out ‘flush-# is kernel garbage collection that flushes unused memory allocations to disk so the RAM can be reused.  So ‘flush’ is trying to write out dirty pages from virtual memory, most likely associated with the processes I just killed.

I discovered these commands that shed more light on what is actually happening:

grep 253 /proc/self/mountinfo 
20 1 253:0 / / rw,relatime - ext4 /dev/mapper/vg_kfs10-lv_root rw,seclabel,barrier=1,data=ordered
25 20 253:3 / /home rw,relatime - ext4 /dev/mapper/vg_kfs10-lv_home rw,seclabel,barrier=1,data=ordered
26 20 253:2 / /var rw,relatime - ext4 /dev/mapper/vg_kfs10-LogVol03 rw,seclabel,barrier=1,data=ordered

Remember my listings were for flush-253:0 and flush-253:2 so I now know what partitions are being worked with.  Another interesting command to use is the following, which shows the activity of writing out dirty pages:

watch grep -A 1 dirty /proc/vmstat
nr_dirty 2
nr_writeback 0

If these numbers are significantly higher you might be having a bigger problem on your system.  Though from what I have read this sometimes indicates sync’ing.  If this becomes a problem on your server you can set system parameters in /etc/sysctl.conf to head this off by adding the following lines:

vm.dirty_background_ratio = 50
vm.dirty_ratio = 80

Then (as root) execute:

# sysctl -p

The “vm.dirty_background_ratio” tells at what ratio should the linux kernel start the background task of writing out dirty pages. The above increases this setting from the default 10% to 50%.  The “vm.dirty_ratio” tells at what ratio all IO writes become synchronous, meaning that we cannot do IO calls without waiting for the underlying device to complete them (which is something you never want to happen).

I did not add these to the sysctl.conf file but thought it worth documenting.

Changing the Volume Group Name

One of the problems with cloning a system is that it has the same volume group names as the server it was cloned from.  Not a huge problem but it can limit your ability to leverage the volume group.  The fix appears easy but there is a gotcha.

RedHat provides a nice utility: vgrename

If you use that command and think you are done, you will be sorely mistaken if your root file system is on a volume group!  I’m speaking from experience there, so listen up!

If you issue the vgrename command:

# vgrename OldVG_Name NewVG_Name

it works like a charm.  If you happen to reboot your system at this point you are in big trouble… The system will Kernel Panic.

If you updated/changed the name of the VG that contains the root file system you need to modify the following two files to reflect the NewVG_Name.

  1. In /etc/fstab.   This one is obvious and I usually remember.
  2. In /etc/grub.conf.  Otherwise the kernel tries to mount the root file-system using the old volume group name.

The change is easy using ‘vi’.  Open the file in vi, then use sed from within vi; for example:

vi /etc/fstab
:%s/OldVG_Name/NewVG_Name/g
:wq

Don’t forget to save the changes.

Increasing the size of a filesystem

 

fdisk -l
fdisk /dev/sdc

In fdisk

c
p  (print the partition table to make sure the disk is not in use)
n (new partition)
p (primary partition)
1 (give it a number 1-4, then set start and end sectors)
w (write table to disk and exit)

Now create a physical volume, add it to the VG, extend the LV and then the file system.

pvcreate /dev/sdc1
vgextend VG_NAME /dev/sdc1
lvextend -L+5G LV_PATH (i.e.: /dev/VG_NAME/LV_NAME)
resize2fs LV_PATH
(OR if using xfs: xfs_grow LV_PATH)

Done.

Other useful commands when working with disks include:

# lsblk
NAME                             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sr0                               11:0    1  1024M  0 rom  
sda                                8:0    0 501.1M  0 disk 
└─sda1                             8:1    0   500M  0 part /boot
sdb                                8:16   0  29.5G  0 disk 
└─sdb1                             8:17   0  29.5G  0 part 
  ├─vg_name-lv_root (dm-0) 253:0    0  40.6G  0 lvm  /
  └─vg_name-lv_swap (dm-1) 253:1    0   3.7G  0 lvm  [SWAP]
sdc                                8:32   0    20G  0 disk

The lsblk will list all block devices.  Above it is an easy way to see disks, disk usage and LVM affiliations.  Of course if you just want the block device names this will work too:

ls /sys/block/* | grep block | grep sd