RHEL 5.3 excessive file handles

As per my previous post here. I’d like to share some conclusions and how they came about.

A scheduled reboot was performed and the nagios agent and scripts got corrected. However this wasn’t the issue, which was sort of determined prior to the reboot, as I had them stopped for about 2 cycles of sar reports (20mins).

After the reboot, the problem of course happened again. sar -v output would show the file-sz increase over time. Somewhat like 3 steps forward and 2 steps back. So you get a slow increase, although as seen in previous post 500 or so handles consumed at a time.

I suspected some with oracle, so pointed out some processes that seemed to be respawning every 7 mins, but it turns out this was due to database scripts accessing the database, however one of the dba’s investigated further and determined they believed the emagent process seemed to be acting a bit odd. Upon further investigation a stop of the agent, and checking the current cat /proc/sys/fs/file-nr output showed the file-sz value drop by over 10000. Thus it looks like we found the issue, so of course they had to defer it to oracle support. At least we now have a way to bring the value down by a restart on emagent (which is Oracle Grid Control related I am told).

To aid others, if anyone is seeing something similar. I’d recommend making sure that the system wide file-max-nr value is greater then that of the limits.conf hard limit for the user you suspect that might be causing it. In the case of our system, the file-max-nr was set to 65536, however so was the oracle hard limit for nofiles. Thus when oracle user used up all the handles, then the system also had issues, aka unable to ssh in. So increase the system wide setting, can be performed without reboot. Then for any user’s you suspect that might be causing the issue, set them up with limits via the limits.conf file and then monitor. If a process running as one of those users is consuming all the file handlers, then only the software ran by this user will suffer the issue, and the system wide setting of a greater value will allow you to still ssh in and do various other things.

Hope this helps anyone else out in the wide world web, as it was certainly a good problem to investigate.

RHEL 5.3 sar -v output continues to show an file-sz increase

Anyone out in the wide world web seen this before;

sar -v

16:50:01    dentunusd   file-sz  inode-sz  super-sz %super-sz  dquot-sz %dquot-sz  rtsig-sz %rtsig-sz
17:00:01        22144      5100     18581         0      0.00         0      0.00         0      0.00
17:10:01        22274      5610     18591         0      0.00         0      0.00         0      0.00
17:20:01        22631      5610     18832         0      0.00         0      0.00         0      0.00
17:30:01        22744      5610     18822         0      0.00         0      0.00         0      0.00
17:40:01        23233      6120     19172         0      0.00         0      0.00         0      0.00
17:50:01        23563      6120     19381         0      0.00         0      0.00         0      0.00
18:00:01        23702      5610     19395         0      0.00         0      0.00         0      0.00
18:10:01        24023      6120     19583         0      0.00         0      0.00         0      0.00
18:20:01        24093      6630     19522         0      0.00         0      0.00         0      0.00
18:30:01        24441      6630     19738         0      0.00         0      0.00         0      0.00

What I am referring too, is the file-sz value increasing, in fact on the system in question it continues to increase until it hits the system limit and then processes start to fail. Which is not what I want.

Any tips of trying to pin point the application that might be causing it and any associated commands. Platform is RHEL 5.3 x64, system is used to run three Oracle Database instances.

I have a suspicion against a possible application, and intend of having it shutdown at some point and then monitoring the output from the sar -v for several samples to determine if I see the same pattern as per the output above.

EDIT: I believe I think I found the cause, but won’t know until I can get approval to make the change and reboot the host. I had a feeling it might be something to do with nagios and some scripts that check various things. Still believe this to be the case, as I have found an issue with nrpe itself on the host. Will get the approval to make the changes and reboot. Then will post back outcome.

More Traxxas Rustler VXL video clips

Have another few video clips to share from the other day, unfortunately I couldn’t post them sooner as I didn’t have the internet bandwidth available to me to uploaded them. So it had to wait until I got home.

Hope people enjoy, I know I certainly enjoy using the car. Although managed to break a drive shaft in the last few days, so the Rustler VXL is out of action. Going to order a set of steel drive shafts, as the standard plastic ones just don’t handle the power produced from the 11.1V 3S LiPo.