I’ve been meaning to write this one up for ages, and having just seen Dave Burnham’s interesting presentation on this, I’ve finally gotten round to it.
I’ve upgraded a few RAC clusters to 10.2.0.4 now. These are clusters that have been running with high levels of stability for a couple of years at 10gR2. I had heard that the Oracle Process Clusterware Daemon, OPROCD, which had been in previous 10gR2 releases on UNIX platforms without vendor clusterware, was being included for the first time on Linux with the 10.2.0.4 patchset.
[jason@linuxrac1 ~]$ ps -efl |grep oprocd|grep -v grep
4 S root 21041 20244 0 76 0 - 1644 wait Jun25 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
4 S root 21357 21041 0 -40 - - 2114 - Jun25 ? 00:00:00 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f
These options to oprocd are saying -t 1000: wake up every 1000ms and -m 500 allow upto 500ms margin of error on the time that oprocd wakes up before rebooting, i.e. if oprocd wakes up after > 1.5 secs it’s going to force a reboot.
When you hear about changes to Oracle Clusterware and that the patchset you are about to apply has the “first” version of some piece of the software you do start to have doubts/worries. Don’t get me started about patchsets being just “bug fixes”, I think both 10.2.0.4 and 184.108.40.206 seem to have functionality changes.
The idea of OPROCD is quite straightforward it’s goal is to provide I/O fencing, basically oprocd works by setting a timer, then sleeping. If when it wakes up again and gets scheduled onto cpu it sees that a longer time has passed than the acceptable margin, oprocd will decide to reboot the node.
This is also what the hangcheck-timer kernel module was responsible for doing pre-10.2.0.4 Linux.
I was quite surprised when one of the nodes in a 10.2.0.4 RAC cluster did reboot, as I said these were really very stable clusters. When I investigated, there really was not very much to go on, the following was in /var/log/messages:
Aug 20 18:28:34 node2 kernel: warning: many lost ticks.
Aug 20 18:28:38 node2 kernel: Your time source seems to be instable or some driver is hogging interupts
Aug 20 18:28:38 node2 kernel: rip __do_softirq+0x4d/0xd0
OPROCD itself maintains a log file, and in 10.2.0.4 this was under /etc/oracle/oprocd/, all it said was the following:
Jul 20 09:28:34.299 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
Jul 20 09:28:34.300 | INF | fatal mode startup, setting process to fatal mode
Aug 20 17:51:38.133 | INF | TrackHistoricalTrends: added first sample 1460000000 in 90 to 100 percentile
The first two lines on Jul 20 are actually when we installed the 10.2.0.4 patchset. The final line is symptomatic of an OPROCD reboot.
I kicked this round Oracle support for a couple weeks and their advice was to set the CSS diagwait parameter to 13:
crsctl set css diagwait 13 -force
This actually changes what parameters oprocd runs with:
[jason@linuxrac1 ~]$ ps -efl |grep oprocd |grep -v grep
4 S root 28906 27651 0 78 0 - 1644 wait 16:36 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
4 S root 29159 28906 0 -40 - - 2114 - 16:36 ? 00:00:00 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f
You can see that the margin has changed to 10000ms, that is 10 seconds in place of the 0.5 seconds that the default is, a 20-fold increase and this basically makes oprocd less trigger happy in rebooting the node. You can also see that a new parameter has been introduced the -hsi one. I think these are the percentiles (of the full margin) at which logging information will be generated at, i.e. 5%, 10% etc.
Unfortunately this requires a full shutdown of Oracle Clusterware on ALL nodes. For us this has been one off, and I was not prepared to immediately have downtime to instigate this fix, the above output was therefore on one of my test clusters.
I think it is now RAC PACK "best practice" to actually set the diagwait to this 13 value and I'd certainly do this on all new installs/upgrades to either 10.2.0.4 or 220.127.116.11. Next time I have a maintenance window this value will get changed on the the cluster I saw the reboot on, no doubt that downtime will be for the next "rolling" Critical Patch Update 😦