Forensics

When Tomcat stops responding to Apache

Today our multi-node tomcat servers became unresponsive to user/web traffic.  A quick look at our monitoring tools indicated that the tomcat servers were running healthily.  While the application administrator looked at catalina.out to see if we were missing something, I dug into the load balancer logs.  I immediately saw the following errors:

[Date] [error] ajp_read_header: ajp_ilink_receive failed
[Date] [error] (70007)The timeout specified has expired: proxy: read response failed from IP_ADDRESS:8009 (HOSTNAME)
[Date] [error] proxy: BALANCER: (balancer://envCluster). All workers are in error state for route (nodeName)

So we understood the problem, next we needed to understand why and how to fix it.  The AJP documentation confirmed that the default AJP connection pool is configured with a size of 200 and an accept count (request queue when all connections are busy) of 100. We confirmed that these matched our tomcat AJP settings.  Increasing the MaxClients setting in Apache’s configuration and a quick restart put us back in business.

Note:

Examining logs we could see that today there was a marked increase in testing activity that exposed this problem.  A further read of the Tomcat AJP documentation revealed that the connections remain open indefinitely until the client closes them unless the ‘keepAliveTimeout’ is set on the AJP connection pool.  Ideally, the AJP connections should grow as load increases and then reduce back to an optimal number when load decreases. The ‘keepAliveTimeout’ has the effect of closing all connections that are inactive.  Our keepAliveTimeout settings were set and working but I thought I should include that information here since if we didn’t have that setting this problem would most likely have manifested much earlier than it did.

Solution:

Configure Apache ‘MaxClients’ to be equal to the total number of Tomcat AJP ‘maxConnections’ accross all nodes.

This was already set, however you will also want to make sure you configure Tomcat AJP ‘keepAliveTimeout’ to close connections after a period of inactivity.

References:
Tomcat AJP: http://tomcat.apache.org/tomcat-7.0-doc/config/ajp.html
Apache MPM Worker: http://httpd.apache.org/docs/2.2/mod/worker.html

 

Putting ‘lsof’ to use

lsof is a powerful tool that has proven very userful over the years in troubleshooting and forensic investigations.  Here are some useful lsof command examples:

In this example we are looking at all the files a given process has open (pid=1655 here this is the zabbix agent)

lsof -p 1767

Note you can clean up the output with something like the ‘cut’ or ‘awk’ command to focus in on the columns you are most interested in.  The output from the above command looks like this:

COMMAND    PID   USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
zabbix_ag 1767 zabbix  cwd    DIR  253,0     4096       2 /
zabbix_ag 1767 zabbix  rtd    DIR  253,0     4096       2 /
zabbix_ag 1767 zabbix  txt    REG  253,0   209432 1315973 /usr/sbin/zabbix_agentd
zabbix_ag 1767 zabbix  mem    REG  253,0   156872  917626 /lib64/ld-2.12.so
zabbix_ag 1767 zabbix  mem    REG  253,0  1922152  917633 /lib64/libc-2.12.so
zabbix_ag 1767 zabbix  mem    REG  253,0   145720  917661 /lib64/libpthread-2.12.so
zabbix_ag 1767 zabbix  mem    REG  253,0    22536  917663 /lib64/libdl-2.12.so
zabbix_ag 1767 zabbix  mem    REG  253,0    91096  917658 /lib64/libz.so.1.2.3
zabbix_ag 1767 zabbix  mem    REG  253,0   598680  917655 /lib64/libm-2.12.so
zabbix_ag 1767 zabbix  mem    REG  253,0   113952  917683 /lib64/libresolv-2.12.so
zabbix_ag 1767 zabbix  mem    REG  253,0    43392  917665 /lib64/libcrypt-2.12.so
zabbix_ag 1767 zabbix  mem    REG  253,0   386040  917664 /lib64/libfreebl3.so
zabbix_ag 1767 zabbix  mem    REG  253,0   224328 1317809 /usr/lib64/libssl3.so
zabbix_ag 1767 zabbix  mem    REG  253,0  1286744 1317807 /usr/lib64/libnss3.so
zabbix_ag 1767 zabbix  mem    REG  253,0    21256  917689 /lib64/libplc4.so
zabbix_ag 1767 zabbix  mem    REG  253,0   243096  917688 /lib64/libnspr4.so
zabbix_ag 1767 zabbix  mem    REG  253,0   177952 1317480 /usr/lib64/libnssutil3.so
zabbix_ag 1767 zabbix  mem    REG  253,0    17096  917690 /lib64/libplds4.so
zabbix_ag 1767 zabbix  mem    REG  253,0   108728 1312777 /usr/lib64/libsasl2.so.2.0.23
zabbix_ag 1767 zabbix  mem    REG  253,0   183896 1317813 /usr/lib64/libsmime3.so
zabbix_ag 1767 zabbix  mem    REG  253,0    63304  917530 /lib64/liblber-2.4.so.2.5.6
zabbix_ag 1767 zabbix  mem    REG  253,0   317168  917569 /lib64/libldap-2.4.so.2.5.6
zabbix_ag 1767 zabbix  DEL    REG    0,4                0 /SYSV6c0004c9
zabbix_ag 1767 zabbix  mem    REG  253,0    65928  917605 /lib64/libnss_files-2.12.so
zabbix_ag 1767 zabbix    0r   CHR    1,3      0t0    3662 /dev/null
zabbix_ag 1767 zabbix    1w   REG  253,2      386     120 /var/log/zabbix/zabbix_agentd.log
zabbix_ag 1767 zabbix    2w   REG  253,2      386     120 /var/log/zabbix/zabbix_agentd.log
zabbix_ag 1767 zabbix    3wW  REG  253,2        4  389438 /var/run/zabbix/zabbix_agentd.pid
zabbix_ag 1767 zabbix    4u  IPv4  13481      0t0     TCP *:zabbix-agent (LISTEN)
zabbix_ag 1767 zabbix    5u  IPv6  13482      0t0     TCP *:zabbix-agent (LISTEN)

In the above: the FD column represents the File Descriptor and Mode (Read/Write).  The 4th line from the bottom has an FD value of (2w) meaning it is open for writing, makes sense since it is a log.

The -Z option for ‘lsof’ specifies how SELinux security contexts are to be handled.  This option is only available of Linux systems that have an SELinux enabled kernel.

# lsof -Z -p 1767
COMMAND    PID SECURITY-CONTEXT                USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  cwd    DIR  253,0     4096       2 /
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  rtd    DIR  253,0     4096       2 /
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  txt    REG  253,0   209432 1315973 /usr/sbin/zabbix_agentd
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   156872  917626 /lib64/ld-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0  1922152  917633 /lib64/libc-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   145720  917661 /lib64/libpthread-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0    22536  917663 /lib64/libdl-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0    91096  917658 /lib64/libz.so.1.2.3
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   598680  917655 /lib64/libm-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   113952  917683 /lib64/libresolv-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0    43392  917665 /lib64/libcrypt-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   386040  917664 /lib64/libfreebl3.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   224328 1317809 /usr/lib64/libssl3.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0  1286744 1317807 /usr/lib64/libnss3.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0    21256  917689 /lib64/libplc4.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   243096  917688 /lib64/libnspr4.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   177952 1317480 /usr/lib64/libnssutil3.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0    17096  917690 /lib64/libplds4.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   108728 1312777 /usr/lib64/libsasl2.so.2.0.23
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   183896 1317813 /usr/lib64/libsmime3.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0    63304  917530 /lib64/liblber-2.4.so.2.5.6
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0   317168  917569 /lib64/libldap-2.4.so.2.5.6
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  DEL    REG    0,4                0 /SYSV6c0004c9
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix  mem    REG  253,0    65928  917605 /lib64/libnss_files-2.12.so
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix    0r   CHR    1,3      0t0    3662 /dev/null
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix    1w   REG  253,2      386     120 /var/log/zabbix/zabbix_agentd.log
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix    2w   REG  253,2      386     120 /var/log/zabbix/zabbix_agentd.log
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix    3wW  REG  253,2        4  389438 /var/run/zabbix/zabbix_agentd.pid
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix    4u  IPv4  13481      0t0     TCP *:zabbix-agent (LISTEN)
zabbix_ag 1767 system_u:system_r:initrc_t:s0 zabbix    5u  IPv6  13482      0t0     TCP *:zabbix-agent (LISTEN)

I’ll add more when I have time, comment if you want to see something specific.