As per my previous post here. I’d like to share some conclusions and how they came about.
A scheduled reboot was performed and the nagios agent and scripts got corrected. However this wasn’t the issue, which was sort of determined prior to the reboot, as I had them stopped for about 2 cycles of sar reports (20mins).
After the reboot, the problem of course happened again. sar -v output would show the file-sz increase over time. Somewhat like 3 steps forward and 2 steps back. So you get a slow increase, although as seen in previous post 500 or so handles consumed at a time.
I suspected some with oracle, so pointed out some processes that seemed to be respawning every 7 mins, but it turns out this was due to database scripts accessing the database, however one of the dba’s investigated further and determined they believed the emagent process seemed to be acting a bit odd. Upon further investigation a stop of the agent, and checking the current cat /proc/sys/fs/file-nr output showed the file-sz value drop by over 10000. Thus it looks like we found the issue, so of course they had to defer it to oracle support. At least we now have a way to bring the value down by a restart on emagent (which is Oracle Grid Control related I am told).
To aid others, if anyone is seeing something similar. I’d recommend making sure that the system wide file-max-nr value is greater then that of the limits.conf hard limit for the user you suspect that might be causing it. In the case of our system, the file-max-nr was set to 65536, however so was the oracle hard limit for nofiles. Thus when oracle user used up all the handles, then the system also had issues, aka unable to ssh in. So increase the system wide setting, can be performed without reboot. Then for any user’s you suspect that might be causing the issue, set them up with limits via the limits.conf file and then monitor. If a process running as one of those users is consuming all the file handlers, then only the software ran by this user will suffer the issue, and the system wide setting of a greater value will allow you to still ssh in and do various other things.
Hope this helps anyone else out in the wide world web, as it was certainly a good problem to investigate.