OPROCD Rebooting a 10.2.0.4 Linux RAC Cluster

I’ve been meaning to write this one up for ages, and having just seen Dave Burnham’s interesting presentation on this, I’ve finally gotten round to it.

I’ve upgraded a few RAC clusters to 10.2.0.4 now. These are clusters that have been running with high levels of stability for a couple of years at 10gR2. I had heard that the Oracle Process Clusterware Daemon, OPROCD, which had been in previous 10gR2 releases on UNIX platforms without vendor clusterware, was being included for the first time on Linux with the 10.2.0.4 patchset.


[jason@linuxrac1 ~]$ ps -efl |grep oprocd|grep -v grep
4 S root 21041 20244 0 76 0 - 1644 wait Jun25 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
4 S root 21357 21041 0 -40 - - 2114 - Jun25 ? 00:00:00 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f

These options to oprocd are saying -t 1000: wake up every 1000ms and -m 500 allow upto 500ms margin of error on the time that oprocd wakes up before rebooting, i.e. if oprocd wakes up after > 1.5 secs it’s going to force a reboot.

When you hear about changes to Oracle Clusterware and that the patchset you are about to apply has the “first” version of some piece of the software you do start to have doubts/worries. Don’t get me started about patchsets being just “bug fixes”, I think both 10.2.0.4 and 11.1.0.7 seem to have functionality changes.

The idea of OPROCD is quite straightforward it’s goal is to provide I/O fencing, basically oprocd works by setting a timer, then sleeping. If when it wakes up again and gets scheduled onto cpu it sees that a longer time has passed than the acceptable margin, oprocd will decide to reboot the node.

This is also what the hangcheck-timer kernel module was responsible for doing pre-10.2.0.4 Linux.

I was quite surprised when one of the nodes in a 10.2.0.4 RAC cluster did reboot, as I said these were really very stable clusters. When I investigated, there really was not very much to go on, the following was in /var/log/messages:

Aug 20 18:28:34 node2 kernel: warning: many lost ticks.
Aug 20 18:28:38 node2 kernel: Your time source seems to be instable or some driver is hogging interupts
Aug 20 18:28:38 node2 kernel: rip __do_softirq+0x4d/0xd0

OPROCD itself maintains a log file, and in 10.2.0.4 this was under /etc/oracle/oprocd/, all it said was the following:

Jul 20 09:28:34.299 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
Jul 20 09:28:34.300 | INF | fatal mode startup, setting process to fatal mode
Aug 20 17:51:38.133 | INF | TrackHistoricalTrends: added first sample 1460000000 in 90 to 100 percentile

The first two lines on Jul 20 are actually when we installed the 10.2.0.4 patchset. The final line is symptomatic of an OPROCD reboot.

I kicked this round Oracle support for a couple weeks and their advice was to set the CSS diagwait parameter to 13:


crsctl set css diagwait 13 -force

This actually changes what parameters oprocd runs with:


[jason@linuxrac1 ~]$ ps -efl |grep oprocd |grep -v grep
4 S root 28906 27651 0 78 0 - 1644 wait 16:36 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
4 S root 29159 28906 0 -40 - - 2114 - 16:36 ? 00:00:00 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f

You can see that the margin has changed to 10000ms, that is 10 seconds in place of the 0.5 seconds that the default is, a 20-fold increase and this basically makes oprocd less trigger happy in rebooting the node. You can also see that a new parameter has been introduced the -hsi one. I think these are the percentiles (of the full margin) at which logging information will be generated at, i.e. 5%, 10% etc.

Unfortunately this requires a full shutdown of Oracle Clusterware on ALL nodes. For us this has been one off, and I was not prepared to immediately have downtime to instigate this fix, the above output was therefore on one of my test clusters.

I think it is now RAC PACK "best practice" to actually set the diagwait to this 13 value and I'd certainly do this on all new installs/upgrades to either 10.2.0.4 or 11.1.0.7. Next time I have a maintenance window this value will get changed on the the cluster I saw the reboot on, no doubt that downtime will be for the next "rolling" Critical Patch Update 😦

Advertisements

14 thoughts on “OPROCD Rebooting a 10.2.0.4 Linux RAC Cluster

  1. Check /etc/init.d/init.cssd and, if I recall correctly, you should see that diagwait changes quite a bit there and introduces another delay in node eviction so that you have higher chances that logs are actually written to disk. That’s indeed useful for troubleshooting often sporadic node reboots. Btw, you normally don’t see any log entries when OPROC resets the node because it does reboot instantly and attempt to log something and flush buffers to disk would compromise quick eviction.

  2. Hi Alex,

    Thamks for reading!

    I’ve just compared /etc/init.d/init.cssd on a system with diagwait set to 13 and one that does not:

    [jason@linuxrac1sb init.d]$ ps -ef|grep oprocd.bin|grep -v grep
    root 29159 28906 0 Oct05 ? 00:00:00 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f
    [jason@linuxrac1sb init.d]$ md5sum /etc/init.d/init.cssd
    e8444f95b5913a0fdbe6416c22050de9 /etc/init.d/init.cssd

    [jason@linuxrac1 init.d]$ ps -ef|grep oprocd.bin|grep -v grep
    root 21357 21041 0 Jun25 ? 00:00:00 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f
    [jason@linuxrac1 init.d]$ md5sum /etc/init.d/init.cssd
    e8444f95b5913a0fdbe6416c22050de9 /etc/init.d/init.cssd

    As far as I can see, the md5’s are the same value, so the /etc/init.d/init.cssd has not changed by changing the diagwait. This is linux x86-64.

    I’m sure it is a trade off between quickly rebooting the node and thereby possibly preventing i/O that could cause corruption and actually gathering any useful debugging information.

    jason.

  3. Hi Bradd,

    Thanks!

    All I can say is that the change to the diagwait parameter to 13 is now considered best practice. I certainly will be changing all my RAC clusters.

    By the way, as you probably know, oprocd is far more hasty at rebooting than the hangcheck-timer, I think the setting on that was/is as much as 30 seconds.

    jason.

  4. Thanks for the information, Jason. We saw a similar situation in our 10.2.0.4 RAC cluster on HPUX. One of the nodes was being shutdown seconds after being coming up. Here are the events the nightmare:

    (1) According to HP Backline Support, the oracle user was sending a TOC message to shut itself down. Oracle support should be able to tell why.
    (2) Oracle Support said the reason for this was a bad hardware and the exact cause would be found in the system crash dump, which would be analyzed by HP support.
    (3) HP support said the crash dump merely said OPROCD sent a TOC; nothing else. “What bad hardware?”, they ask.
    (4) We forced HP support to do a full system scan; they found no bad hardware!
    (5) Oracle support: well, it seems oprocd is getting false negatives. So, reduce the priority at the OS level.

    And that’s all we did – reduce the priority; and oprocd didn’t send the TOC!

    What you found directly confirms what we observed. What’s really funny is, why would oprocd run with such high priority by default.

  5. Hi Arup,

    Thanks for dropping by!

    Just for absolute clarity though, the process priority for oprocd does NOT change if you modify the diagwait setting: it still runs with priority -40, which I believe to be Real Time priority class.

    Obviously it wants to do this so it gets scheduled on to cpu when it needs to. Changing diagwait just makes oprocd have a wider acceptable margin for being delayed getting onto cpu before it decides to reboot the node.

    jason.

  6. Jason – agreed. The diagwait parameter value change is the real fix; not the priority.

    Actually I was just agreeing with you how sloppily things have been put together in RAC by ignoring these minute but impacting details. If the install guide mentioned the value to be 13, all these headache would have been non-events.

    I had a similar post on my blog about the un-diagnosibility of RAC. Aug 4: Issues in RAC with Wrong Listener Configuration. Why can’t RAC developers be more cognizant of the needs of the real world where an error code “2” means nothing. It’s not an Oracle error; nor it’s an HPUX error. It’s just some error number the developer probably concoted on the fly with no regard to the context.

    Oh, well!

  7. Jason,

    very good blog post indeed. There is a presentation of Julian Dyke (http://www.juliandyke.com/Presentations/RACTroubleshooting.ppt) where he gives a formula on how the oprocd values change when diagwait is configured:

    If diagwait > reboottime then OPROCD_DEFAULT_MARGIN := (diagwait – reboottime) * 1000

    BTW, do you know whether diagwait also has an effect in 10.2.0.3 where no oprocd yet exists? MetaLink Note 559365.1 is mentioning 10.2.0.3 upwards but then again, it states the shutdown of oprocd, which isn´t yet available.

    Martin
    ora-solutions.net

  8. Hi Martin,

    Thanks for the kind words!

    Also thank you for posting the link to Julian’s work. Strange metalink doc as you say, states 10.2.0.3 but no oprocd in 10.2.0.3.

    I’m afraid I don’t know if changing diagwait in 10.2.0.3 will help you or not! I have no 10.2.0.3 RAC cluster to test on.

    jason.

  9. Jason, thanks for this, you just saved me some time waiting for Oracle Support (still I have an SR opened). We are on HP Integrity RAC 11.1.0.7.

    We were increasing the mount point where resides Oracle’s binaries and the node suddenly rebooted with the same dump stating: User requested reset of the system.Oracle CRS (oprocd) TOC for clusterware integrity …

    We will look into changing the diagwait parm to 13.

    Thanks.

  10. Hi Gilles,

    Until Oracle RAC 10.2.0.3 oprocd is not used by Linux as it is using the Hang-Check timer module. In all other operating system, Oracle Clusterware used oprocd.

    Starting 10.2.0.4, Oracle indicated oprocd will be used on all operating system.

    Please note that setting diagwait to 13 is not a best practice at this time as in fact it is a workaround they provided for us when I raised the scheduling issue of oprocd on HP-UX 11.31 Itanium that caused the reboots. It took them a while to release Metalink Note:559365.1(14-Apr-2008) to assure us that setting it to 13 will not cause data corruption as the main purpose of oprocd is I/O fencing.

    This eventuate after we had a conference call with Oracle Development and HP Mission Critical Support.

    Regards,
    Art (http://artthedba.blogspot.com)

  11. Hi Art,

    Thanks for reading and commenting.

    I would point out that I have heard it on good authority that the Oracle RAC pack are by default setting the diagwait to 13.

    cheers,

    jason.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s