troubleshooting

md5sum differences on identical systems

In troubleshooting a problem a colleague noticed that there were md5sum differences in the files of identical systems:

ServerOne # md5sum /lib64/libc-2.12.so
27a605fdeaf7c835493a098213c9eec1  /lib64/libc-2.12.so

ServerTwo # md5sum /lib64/libc-2.12.so
13e3eb598abd09279efc05e215e77ae2 /lib64/libc-2.12.s

Analyzing a hex dump of the .so files showed cyclical differences at matching locations.  This is where I began helping to look at the problem.  After walking through the above (these files were the focus of an application error) I suggested we do a yum reinstall.  This resulted in different md5sums from those listed above.  The cyclical and matching locations are the link locations in the hex dump, as expected there are differences in the hex locations between systems.

After a little research I found the reason the md5sum’s don’t match is yum install uses prelink.  ‘prelink’ is a program that modifies ELF shared libraries and ELF dynamically linked binaries used by yum, therefore there would be differences in the md5sum even on identical systems.

To see the actual md5sum of the package before prelink is run, use the following command:

# prelink -y /lib64/libc-2.12.so | md5sum
14486c78922a8dc8376a715d4c78d85a  -
 I ran this on the two systems in question and they match.  The files are the same on each systems, therefore this was a red hearing and the search continues.

Using redhat-support-tool in 10 space

OK, Private IP space, but you should know that 10 space means private IP space.

The command redhat-support-tool is useful when working with a Red Hat support ticket. Once a ticket is opened with Red Hat your next step should be to create and attach an sosreport to the ticket. If you don’t then you will waste valuable time as their first response will be, you guessed it, please attach an sosreport. Even attaching one is no guarantee they won’t still ask as they follow the script pretty closely.

The 90% use case for using the command redhat-support-tool is adding attachments, like this:

redhat-support-tool addattachment -c CASE_NUMBER /tmp/sosreport.tar.xz

If you have not configured /root/redhat-support-tool/redhat-support-tool.conf you will be prompted for your RHN user name and password.  Since I mentioned it please note that /root/redhat-support-tool contains your configuration file and a log file.  Please note: that if you configure global setting (more on that below) those settings are stored in /etc/redhat-support-tool.conf

Back to Private IP space use.  Supposedly you can configure this using the redhat-support-tool -> config option for example:

# redhat-support-tool
Command (? for help): config proxy_url proxy.your-url.domain

OR

# redhat-support-tool
Command (? for help): config proxy_url http://proxy.your-url.domain

OR setting it globally (sets it to /etc/redhat-support-tool.conf)

# redhat-support-tool
Command (? for help): config -g proxy_url http://proxy.your-url.domain

This however doesn’t always work, here is why with an explanation, thanks to my colleague Doug B:

I figured out the redhat-support-tool issue.

– It’s always connecting to proxy via https, so you have to use “http://proxy.url.edu:80” in order to force it.
– It may conflict with an http_proxy environment variable.

Even unsetting the variable within the tool (with –unset proxy_url) didn’t seem to clear out an incorrect entry – even though nothing was in the config file!

In the end it’s easiest to just to export http_proxy=http://proxy.url.edu:80 and not modify anything within the support tool itself.

As you can see a frustrating problem, yes we could have just transferred the file and uploaded it using the webUI or from another system but what would we have learned from that?!

Again, thanks to Doug B. for working with me on this.

Here is a link (account required) to more details about the redhat-support-tool: https://access.redhat.com/articles/445443

 

yum Invalid System Credential error

I ran across the following yum error after migrating a system from being a client of Satellite 5.6 to Satellite 6.1.  First here is the error:

# yum update
Loaded plugins: package_upload, priorities, rhnplugin, search-disabled-repos, security, subscription-manager
There was an error communicating with RHN.
RHN Satellite or RHN Classic support will be disabled.

Error Message:
    Please run rhn_register as root on this client
Error Class Code: 9
Error Class Info: Invalid System Credentials.
Explanation: 
     An error has occurred while processing your request. If this problem
     persists please enter a bug report at bugzilla.redhat.com.
     If you choose to submit the bug report, please be sure to include
     details of what you were trying to do when this error occurred and
     details on how to reproduce this problem.

Setting up Update Process
rhel-6-server-rpms                                                                                                                                                            | 2.0 kB     00:00     
rhel-6-server-satellite-tools-6.1-rpms                                                                                                                                        | 2.1 kB     00:00     
No Packages marked for Update
This left me scratching my head for a few and a quick search didn’t produce much so I thought I should document this for prosperity.
The problem was with the contents of the file /etc/yum/pluginconf.d/rhnplugin.conf
Part of my transition is running this command:
sed -i -e 's/enabled=1/enabled=0/g' /etc/yum/pluginconf.d/rhnplugin.conf
The problem was unlike all of my other systems, this file must have been edited because instead of containing “enabled=1” it contained “enabled = 1”
To correct that I modified my sed command to ignore white space:
sed -i -e 's/enabled\s*=\s*1/enabled=0/g' /etc/yum/pluginconf.d/rhnplugin.conf

More details can be found using the yum.conf man page.

Hope that is helpful.

 

Working with Repositories

Pulling packages from multiple sources can lead to problems.  If you are running rhel and have epel enabled an update could inadvertently pull down a newer version from the wrong repository.  This doesn’t always cause a problem, but it can.  If you need to tfind all the epel packages on your system, here is how you: List all packages installed from repo “X”

yum list installed | grep @epel

 

 

Disk Woes

I hope to never use this document again but thought it worth documenting in case someone else has need of the information.  I powered my desktop off for a planned power outage.  When I powered it back on the system failed to boot reporting either “Error 17” or “Error 25”, in short the software raid (mirrored disks) were corrupted…  The timing of this event could not have been better.  The power outage included our data center, so I had to power over 100 systems on without my desktop!  Thank God for Live CDs!!  Following the power on there were other issues to deal with so it was almost a week before I could deal with my failed desktop.  Here is what I tried:

“sata to USB cable” since the drive was part of a raid pair this didn’t work and I didn’t waste a lot of time on it.  What it did help me discover was which disk was bad.

Knowing which disk was bad I then confirmed the failed drive using the BIOS and boot sequence on my desktop.  I confirmed it was /dev/sda that was failed.  I was able to get a replacement disk on the same size from our desktop support team.  With the new disk installed here is what I did and the results.

Boot the system to an Ubuntu Live CD

I don’t have time to add much description now but the commands and sequence should hopefully help for now.  Feel free to post a question in the comments if you have any.

sudo mdadm --query --detail /dev/md/1
sudo mdadm --assemble --scan
sudo mdadm --query --detail /dev/md/1
sudo mdadm --assemble 
sudo mdadm --assemble --scan

sudo mdadm --query --detail /dev/md/1
sudo mdadm --query --detail /dev/md/0
sudo mdadm --query --detail /dev/md/2
sudo mdadm --query --detail /dev/md/3

sudo mdadm --stop /dev/md/0
sudo mdadm --stop /dev/md/1
sudo mdadm --stop /dev/md/2
sudo mdadm --stop /dev/md/3

sudo mdadm --query --detail /dev/md/0
sudo mdadm --query --detail /dev/md/1
sudo mdadm --query --detail /dev/md/2
sudo mdadm --stop /dev/md/2
sudo mdadm --query --detail /dev/md/3
sudo mdadm --stop /dev/md/3

sudo fdisk -l

cat /proc/mdstat 
sudo mdadm --assemble --scan

cat /proc/mdstat 
sudo mount /dev/md3 /mnt
cat /proc/mdstat 
sudo mount /dev/sdb1 /mnt

sudo fdisk -l

sudo mdadm stop /dev/md/0n3

cat /proc/mdstat
sudo mdadm --manage /dev/md0 --fail /dev/sda1
sudo mdadm --manage /dev/md0 --fail /dev/sda
sudo mdadm --manage /dev/md1 --fail /dev/sda2
sudo mdadm --manage /dev/md2 --fail /dev/sda3
cat /proc/mdstat
sudo sfdisk -d /dev/sda > sda.out
sudo sfdisk -d /dev/sdb |sudo sfdisk /dev/sda
sudo sfdisk -d /dev/sda > sda.out

sudo fdisk -l
sudo mdadm --manage /dev/md0 --add /dev/sda1
sudo mdadm --manage /dev/md1 --add /dev/sda2
sudo mdadm --manage /dev/md2 --add /dev/sda3
sudo mdadm --manage /dev/md3 --add /dev/sda5
cat /proc/mdstat 
watch cat /proc/mdstat 

Every 2.0s: cat /proc/mdstat                                                                                                                                                                 Mon Aug 17 13:15:31 2015

Personalities : [raid1]
md0 : active raid1 sda1[2] sdb1[1]
      4093888 blocks super 1.1 [2/2] [UU]

md1 : active raid1 sda2[2] sdb2[1]
      819136 blocks super 1.0 [2/2] [UU]

md3 : active raid1 sda5[2] sdb5[1]
      278538048 blocks super 1.1 [2/1] [_U]
      [==============>......]  recovery = 70.4% (196127360/278538048) finish=15.0min speed=91334K/sec
      bitmap: 0/3 pages [0KB], 65536KB chunk

md2 : active raid1 sda3[2] sdb3[1]
      204668800 blocks super 1.1 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

Good Luck

 

 

 

 

Unresponsive VMware Images

Over the past week I have had two vmware images become unresponsive.  When trying to access the images via the vmware console any action reports:

rejecting I/O to offline device

A reboot fixes the problem, however for a Linux guy that isn’t exactly acceptable.  Upon digging a little deeper it appears the problem is with disk latency or more specifically a disk communication loss or time out with the SAN.  I looked at the problem with the vmware admin and we did see a latency issue.  We reported that to the storage team.  That however does not fix my problem.  What to do…  The real problem is that systems do not like I/O temporary loss of communication with their disks.  This tends to result in a kernel panic or in this case never ending I/O errors.

Since this is really a problem of latency (or traffic) there are a couple of things that can be done on the Linux system to reduce the chances of this happening while the underlying problem is addressed.

There are two things you can address, swappiness (freeing memory by writing runtime memory to disk aka swap).  The default setting it 60 out of 100, this generates a lot of I/O.  Setting swappiness to 10 works well:

vi /etc/sysctl.conf
vm.swappiness = 10

Unfortunately for me, my systems already have this setting (but I verified it) so that isn’t my culprit.

The only other setting I could think of tweaking was the disk timeout threshold.  If you check your systems timeout it is probably set to the default of 30:

cat /sys/block/sda/device/timeout
30

Increasing this value to 180 will hopefully be sufficient to help me avoid problems in the future.  You do that by adding an entry to /etc/rc.local:

vi /etc/rc.local
echo 180 > /sys/block/sda/device/timeout

I’ll see how things go and report back if I experience any more problems with I/O.

 UPDATE (24 Sep 2015):

The above setting while good to have did not resolve he issue.  Fortunately I was logged into a system when it began having the I/O errors and I was still able to perform some admin functions.  Poking around the system and digging in the system logs dmesgs at the same time led me to this vmware knowledge base article about linux 2.6 systems and disk timeouts:

I passed this on to our vmware team.  They dug deeper and determined that installing vmware tools would accomplish the same thing.  I installed vmware tools on the server and the problem went away!  It seems vmware tools hides certain disk events that linux servers are susceptible to.  There you go, hope that helps.

screen your work…

The Linux screen command is a very useful tool for many reasons. For one you don’t need to worry about losing your session.  Sometimes long running jobs with little or no output can lead to your remote session terminating, not usually a helpful thing.  Other benefits of the screen command are session logging (thing documentation) multitasking and session sharing.

The screen command is pretty darn easy to use but it does have some nice features that you may have to dig through the documentation to find.  I’ll give some highlights and add to this as I find new uses or useful features.  So let’s get started.

You can just issue the screen command ‘screen’ and you will immediately be in a screen session, not very useful.  Of course now that you are in a session how do you get out?! To exit but leave the screen session open/active type:

Ctrl-a d

To exit and terminate the screen session type:

Ctrl-a

Terminating the screen session will prompt you with the following, potentially misleading question:

Really quit and kill all your windows [y/n]

Choosing ‘y’ only kills the current session all other screen sessions that may be running are uneffected.

Once you leave a screen session you need to know how to re-enter it.  You need the screen session ID to do this, you can set one (covered shortly).  To list the active screen sessions issue this command:

# screen -ls
There are screens on:
13986.pts-0.hostname (Detached)
13488.pts-0.hostname (Detached)
16156.mylabel (Detached)

The last session listed was assigned a label (see below)  To reattach to a session you use the label or ID number like this:

screen -r 13986
OR
screen -r mylabel

Now that you have the basics I am going to speed things up and give a bunch of examples with explanations where necessary.  You can always refer to the screen man page.

From within a screen session using the command “Ctrl-A n“ will move you to the next screen session.  “Ctrl-A p“ will move you to the previous screen session.  “Ctrl-A c“ will create a new screen session.

The screen option -S  allows you to assign a Session Name/Label which makes multiple screen sessions easier to manage.  The screen option -L enables logging for the session.

screen -S "mylabel" -L

Cleaning up your Screen Log

The log screen produces contains a lot of special characters from typing mistakes, spaces, etc.  It can make the log difficult to read.  This command cleans out the majority of the cruft and make the file easier to read:

perl -ne 's/\x1b[[()=][;?0-9]*[0-9A-Za-z]?//g;s/\r//g;s/\007//g;print' < screen.0 > screen.0.readable

switch between these two tasks?

  • Switching between windows is the specialty of screen utility. So to switch between pine and wget window (or session) press CTRL+a followed by n key (first hit CTRL+a, releases both keys and press n).
  • To list all windows use the command CTRL+a followed by ” key (first hit CTRL+a, releases both keys and press ” ).
  • To switch to window by number use the command CTRL+a followed by ‘ (first hit CTRL+a, releases both keys and press ‘ it will prompt for window number).

press C-a d screen will detach from the screen session.

press C-a H screen will start recording everything to a file called screenlog.X (where X is a number starting at 0).

Using screen for shared command-line interaction:

  1. Set the screen binary (/usr/bin/screen) setuid root. By default, screen is installed with the setuid bit turned off, as this is a potential security hole.
  2. The first-user starts screen in a local xterm, for example via screen -S SessionName. The -S switch gives the session a name, which makes multiple screen sessions easier to manage.
  3. The second-user uses SSH to connect to the target system.
  4. The first-user then has to allow multiuser access in the screen session via the command Ctrl-a :multiuser on (all screen commands start with the screen escape sequence, Ctrl-a).
  5. Next the first-user grants permission to the second-user to access the screen session with Ctrl-a :acladd second-user where second-user is their login ID.
  6. The second-user can now connect to the first-user’s screen session. The syntax to connect to another user’s screen session is screen -x sessionID/name.

Common screen commands

screen command Task
Ctrl+a c Create new window
Ctrl+a k Kill the current window / session
Ctrl+a w List all windows
Ctrl+a 0-9 Go to a window numbered 0 9, use Ctrl+a w to see number
Ctrl+a Ctrl+a Toggle / switch between the current and previous window
Ctrl+a S Split terminal horizontally into regions and press Ctrl+a c to create new window there
Ctrl+a :resize Resize region
Ctrl+a :fit Fit screen size to new terminal size. You can also hit Ctrl+a F for the the same task
Ctrl+a :remove Remove / delete region. You can also hit Ctrl+a X for the same taks
Ctrl+a tab Move to next region
Ctrl+a D (Shift-d) Power detach and logout
Ctrl+a d Detach but keep shell window open
Ctrl-a Ctrl- Quit screen
Ctrl-a ? Display help screen i.e. display a list of commands

Cleaning Up Memory Usage

I noticed my Ubuntu desktop was using a rather large portion of available memory.  I usually have a lot running on my system, multiple terminals, background jobs, etc so this is nothing unusual.  Today however I noticed my system was sluggish so I started digging.  Memory use was near 100%.  I closed all of my programs to see what effect that would have but the memory usage stayed very high ~90%.  I started to suspect a memory leak in one of the processes or programs I was running.  I really didn’t want to reboot the system since it isn’t a Windows desktop!  What to do.  I needed to force memory cleanup on the system.  How do I analyze the memory usage on a system?  I thought I would document a few of the ways to see memory use.

You can use commands like ‘top’ and ‘vmstat’ to get an idea of what your system is chewing on.  Specifically looking at memory I tend to use:

watch -n 1 free -m

For a more detailed look use:

watch -n 1 cat /proc/meminfo

If you suspect a program of having a leak you can use valgrind to dig even deeper:

valgrind --leak-check=yes program_to_test

‘valgrind’ is great for testing however not to helpful with currently running processes or without some experience.

So you analyze the system and determine there is memory that has not been properly freed, what do you do?  You can reboot but that isn’t always an option.  You can force clear the cache doing the following:

sudo sysctl -w vm.drop_caches=3

This frees up unused but claimed memory in Ubuntu a (and most linux flavors).  This command won’t affect system stability and performance, it will just clean up memory used by the Linux Kernel on caches.  That said I have noticed the system is more responsive (contradiction, you decide).  Here is an example of how much memory you can free up with this command:

$ free
             total       used       free     shared    buffers     cached
Mem:      16287672   15997176     290496       5432     404120   14415648
-/+ buffers/cache:    1177408   15110264
Swap:      4093884          0    4093884
[msaba@nfc ~]$ sudo sysctl -w vm.drop_caches=3
[sudo] password for msaba: 
vm.drop_caches = 3
[msaba@nfc ~]$ free
             total       used       free     shared    buffers     cached
Mem:      16287672     948076   15339596       5432       1268      92708
-/+ buffers/cache:     854100   15433572
Swap:      4093884          0    4093884

Another command that can free up used or cached memory (inodes, page cache, and ‘dentries’):

sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

I have not seen any significant difference between the results of this or the first command.

I’ll add updates to this page as I think of them.  Good luck for now.

Testing Database Connectivity

Working with databases and new application installations can be really fun.  Problem is, when there is a problem, everyone starts the blame game.  Nothing unusual about that, part of an administrators job is to troubleshoot and prove where the problem starts.  When dealing with external databases, there can be numerous problem, the firewall could be blocking, the local or remote port could be blocked on the system, or the database credentials could be incorrect.  Testing for the last helps troubleshoot all of these.  Ruling out the database connection helps focus the application administrator on the real problem!  Testing a remote oracle database is pretty simple if you have the oracle client configured with tnsnames, etc.  But if that isn’t necessary you may not have it configured.  When you don’t this is the easiest way to test the database connection via the command line:

sqlplus 'user/pass@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(Host=hostname.network)(Port=1521))(CONNECT_DATA=(SID=remote_SID)))'

Prior to this make sure your ORACLE_HOME environment variable is set correctly.  You may also need the LD_LIBRARY_PATH set to $ORACLE_HOME/lib.

UPDATE: 21may2015:

Now that you are in you may want to check a few things out.  To give you a quick reminder of the syntax here are a few to get the lay of the land:

To list all tables in a database (accessible to the current user):

SQL> select table_name from user_tables;

 

To list the contents of a specific table, type:

SQL> select * from name_of_table;

You can find more info about views all_tablesuser_tables, and dba_tables in Oracle Documentation.

 

Pain often equals Progress

It has been one of those weeks.  Not fun, to many hours worked, personal events missed, you know the kind of week I am talking about.  If not…what do you do for a living?!

Despite all the pain and stress this week has resulted in Progress, an increased understanding of certain products and new ways to use old tools.  I won’t share the details of my story, just insert yours here, but I will share/document the lessons and commands I learned or rediscovered.  Here we go…

Starting a long running process from home last night around 9PM and forgetting to start screen…priceless!  At 5:30AM this morning the process was still chugging along, with from my calculations would be running for another 18+ hours.  Off to work with no way to grab the terminal (an ssh session), what to do?  Why use strace of course!  Here is how:

strace -pPROCESS_PID -s9999 -e write

ie: strace -p3918 -s9999 -e write

Now even if my ssh session dies at home, I can still see the process output and know when it finishes and if it had any problems.  Yes, I could have piped output to a file, you never forgot anything after working for 15+ hours?

Dealing with a system that had some package inconsistencies and a yum update that failed, followed by a package-utils –cleandupes that erased many complete packages, I thought about using the ‘yum history’ command to revert the system until I read this: “Use the history option for small update rollbacks.”  Here are some of the commands I used which due to the systems package inconsistencies did not perform as expected.

# yum check
# package-cleanup --cleandupes
# yum-complete-transaction
# yum check
# package-cleanup --problems
# rpm -Va --nofiles --nodigest
# yum distribution-synchronizatio

The rest is pretty standard stuff, at least not worth noting in this post.  The end result this week is a lot of lessons learned and a much deeper understanding for an application that I support on my server.  In all, ignoring the backlog, I’d say that is what progress looks like.