A Busy Weekend Patching

July 21, 2008

I’ve just completed a busy weekend of patching, upgrading a 10.2.0.3 RAC cluster up to 10.2.0.4 and because the downtime was booked anyway, I decided on a quick test cycle of the July CPU and rolled that into the upgrade window.

So far, 10.2.0.4 seems good, with thankfully none of the fun and games that Jeff Hunter seems to be experiencing. The actual upgrade itself was exceptionally smooth and without incident.

Note that with 10.2.0.4 there is a new clusterware process, oprocd. This used to be UNIX platform specific but has now made it across to the Linux version.

ps -efl|grep oprocd
4 S root 29978 29208 0 76 0 - 1643 wait Jul20 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
4 S root 30326 29978 0 -40 - - 2112 - Jul20 ? 00:00:00 /opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f

The oprocd daemon is actually there to perform some fencing capability and it seems like it sets a timer and if oprocd fails to wake up within a certain margin of this it will reboot the node.

During the whole upgrade process I had a physical standby open for read only queries. I admit this completely disregards Metalink Note:278641.1 on how to apply a patchset with a physical standby in place, but on this occasion I think Metalink is just plain wrong and having your standby blindly performing managed recovery while you are upgrading the primary is not a good solution - particularly when you have forked over all those licensing pounds for a high availability system!

Speaking of high availability is it any wonder that there are so few people applying the CPU’s when you really are required to have everything running out of the Oracle Home being patched to be shutdown? Sure with RAC you may have a shot at performing a rolling upgrade (assuming you’ve done the view recompilation). However, there are many, many non-RAC systems out there that don’t really want the downtime involved with these security patches.

I’m sure the days of hot-patching must be just around the corner, 11gR2 anyone?


UKOUG RAC & HA Meeting

February 5, 2008

Today I attended the UKOUG RAC & HA SIG and here are the verbatim notes that I took at the event. I probably should not complain at the journey, but an hour on a packed train for 38 miles seems “extravagant”. Still, probably took one of the speakers longer to arrive from Geneva!

Introduction

Next RAC Sigs, Thursday 15th May, London and Thursday 2nd October, heritage motor centre

Call for papers for UKOUG conference opens on March 17th, This seems astonishingly early!

A Really packed agenda today.

Survey

8 9.2
2 10.1
majority on 10.2
none on 11 - yet

few itanium
srpinkling of solaris sparc
probably majority on linux 64 bit
another sprinkling of windows

Vast majority with 2 nodes a handful with more than this, including 8 nodes from CERN and someone with 10

Vast majority using SAN for storage a few on NAS.

A lot using ASM, with a handful on ocfs, sprinkling on veritas 1 with polyserve

A lot of people with a physical standby, a few with a logical standby no one using auto failover a handful with stretched clusters.

Hardly anyone using standard editon

Phil Davies - Support update

Whispers of the first patchset for 11g - probably not for ages though. 10.2.0.4 surely coming soon. Interesting problem with ASM hang, and controlfile enqueue problem this is on 10.2.0.3, fixed 10.2.0.4.

On the January 2008 CPU, one audience member, claimed Oracle support stated to them that the cpu was rolling upgradeable. My support analyst definately stated it was not, interesting contradiction. Nominet got a mention, as I have a Documentation bug out for the CPU.

Dave Burnham - Highly available Oracle Databases

High level overview of building higly available databases. Downtime = Time to notice an issue + Time to resolve the problem.

Complexity kils availability, I certainly agree with this, Keep it Simple Stupid really is the way to go - the less moving parts the less that can go wrong. Concept of an availability benchmark system, which is a single server oracle database server - does your infrastructure improve on this config? That is the high availability solution is the modern comodity system which can have many hot swappable and redundant parts- not fancy clustering solutions.

However, several things are not protected by the single server solution, host failure, site failure, and of course the number one cause of reduced availability is human error.

One alternative to running RAC is to use a single instance database with clustering solution from veritas (like VCS), or SUN, or any of the other hardware vendors. Basically on failure, the clustering solution will restart Oracle on a different node. No expensive RAC license, and it’s fairly well understood technology.

Dave has lots of experience of stretched RAC clusters but states they are quite complex, and that dataguard is far simpler, though perhaps was still prefering stretched RAC for HA.

Miguel Anjo - Multiple RAC clusters

Running around 20 RAC clusters, 2-8 nodes.

Oracle Home is same everywhere, they deploy the ORACLE_HOME as an image.

3 stage environment
development: 8/5
integration: 8/5
Production 24/7

Custom built gui - browser based, to allow developers to see what is happening to their sessions, including sql, DML & DDL and ability to kill their session.

They have a 2 node clustered server for monitoring (runs single instance oracle). The have auditing turned on and generate weekly/monthly reports. Custom written monitoring, based on python, bash, xml.

1 RAC cluster per physics experiment.

They use a wiki for a logbook, database procedures.

Martin Bach - Lessons Learned from Migrating 10.2.0.2 to 10.2.0.3

This talk was based on using Standard Edition. They not only upgraded release but also migrated hardware, old hardware single core cpu with 3GB memory, run queue sometimes exceeding 12. new hardware 2 x dual core opteron and an upgraded SAN.

They have NO RAC test environment - scary stuff! Oh they have no device naming persistance - no ASMLIB or udev. They encountered some wacky bugs with SUSE and OEM. dbms_scheduler failing to schedule jobs to run on time, running by 5-45 minutes.

ASM 11g Experience in Extended Cluster - Bernhard de Cock Buning

Seems to be running RDBMS at 9i with Clusterware and ASM instance at 10.2.0.3 considering upgrading Clusterware and ASM to 11g. RDBMS moving to 10. They can’t use the ASM_PREFERED_READ_FAILURE_GROUP as the RDBMS was not 11. ASM Sysasm user - separate user to own ASM home, not required in 11gR1 but is required in 11gR2. Audience member stated they saw x2 increase in rebalance performance in 11g compared to 10g. Possibility to perform rolling ASM upgrade with 11g.

Simulating one site failure, 10g continued uninterrupted but 11g generated an ORA-600[kfdOffline01]. Seems like ASM rebooted on the surviving site. They used swingbench for testing load and had node crashes a couple of times, but once they were using Hugepages they had NO node crashes. It’s an interesting idea run 11g ASM with 10g database instance.

Split Mirror Backups with RAC & ASM - Howard Jones

General consensus is that it’s costly - requiring high end storage and complex. Using Symantec SMB integrating with Netbackup.

Using Dataguard for hardware migration - Miguel Anjo

Cern Using oracle streams to send LHC data around the world. Uses rman duplicate target database for standby for creation of standby. they switchover to the standby and upgrade this, only using the (now old) primary should they encounter a failure.

I don’t get it really, perhaps it was still too close to lunch for me to understand fully: why they don’t upgrade the primary saving failing over, but using a dataguard standby for the protection it offers should something go wrong? The CERN mechanism still encounters downtime, seems like they do some of the upgrade before the failover and reduce the outage, but for example a 10.2.0.2 to 10.2.0.3 upgrade you can install in a new ORACLE_HOME and you still need the outage for the catupgrd scropt? if you are out there CERN guys, what am I missing?

Logical Standby in the real time world - Graham Cameron

Old system single instance running on service guard cluster queries were hurting performance, chose physical and logical

small db only 22gb, 2GB/s of logs per day running Oracle 9.2.0.8, running the physical and logical on same server, creating server in 9.2.0.8 required the database to be quiesced. They still had major issues with their logical standby and found it failing on many occasions, interestingly they are using oracle streams far more successfully on a different project.

Still, a cracking day and thoroughly enjoyable.


Upgrading to Oracle 11g Clusterware

January 31, 2008

I have just done a couple of 10g to 11g Oracle Clusterware upgrades on a pair of 2 node RAC clusters. These are now happily running 11g Clusterware with 10g ASM and database instances.

First off, I have found the documentation a little bit on the sparse side in terms on how to actually do a clusterware upgrade. It took a little while for me to realise that it is very possible to perform a rolling upgrade when upgrading your clusterware, know I knew this was possible when patching from 10.2.0.X to 10.2.0.Y but it took a little longer for me to understand that this can be done when going up to 11g.

The best place in the online documentation for information about this is Appendix B of the Oracle Clusterware Installation Guide. Another useful thing to look at is metalink note 338706.1 which tells you about the prerequisites you need to fulfill before you can upgrade your clusterware to 11g. Of course it is only with hindsight that I have seen the information there in the Clusterware Installation Guide. Here is what I did to upgrade, you are far better of, unlike myself, running the preupdate.sh script as recommended - but hey this what testing is all about ;-)

From the unziped clusterware directory run the cluster verification utility to check your system is ready to upgrade:

runcluvfy.sh stage -pre crsinst -n node1,node2 -verbose

make sure you upgrade any rpm’s needing changed.

Bring down the database and ASM instances on the first node you want to upgrade and then stop crs:

/opt/oracle/product/crs/bin/crsctl stop crs

If you run the preupdate.sh script that is in the clusterware/upgrade directory you don’t need to stop crs yourself or indeed perform the next step in changing permissions of the crs directory as it’s taken care for you.

The permissions on my crs directory were incorrect and the directory was owned by root. I changed them with:

chown -R oracle:oinstall crs/

run the installer and it will detect your CRS_HOME and offer to upgrade it, you want to make sure that on the Specify Hardware Cluster Installation Mode screen you select just the node you want to upgrade, assuming you are doing it rolling:

clusterware install

Once the upgrade has done it’s thing you are prompted to run the rootupgrade script:

[root@linuxrac2 install]# ./rootupgrade
Checking to see if Oracle CRS stack is already up…


copying ONS config file to 11.1 CRS home
/bin/cp: `/opt/oracle/product/crs/opmn/conf/ons.config’ and `/opt/oracle/product/crs/opmn/conf/ons.config’ are the same file
/opt/oracle/product/crs/opmn/conf/ons.config was copied successfully to /opt/oracle/product/crs/opmn/conf/ons.config
WARNING: directory ‘/opt/oracle/product’ is not owned by root
Oracle Cluster Registry configuration upgraded successfully
Adding daemons to inittab

Attempting to start CRS stack
The CRS stack will be started shortly
Oracle CRS stack has failed to start. Check the file /var/adm/messages or the crsd, cssdd, and evmd logs in
/opt/oracle/product/crs/log/linuxrac2 directory for more details

You don’t need to worry when it says CRS stack has failed to start, because after a few moments CRS is running happily! Your database and/or ASM isntance will now be automatically restarted as well.

It is also worth pointing out that the active CRS version only becomes the 11.1.0.6.0 version after all nodes are upgraded:

[root@linuxrac2 crsd]# crsctl query crs softwareversion
CRS software version on node linuxrac2 is 11.1.0.6.0
[root@linuxrac2 crsd]# crsctl query crs activeversion
CRS active version on the cluster is 10.2.0.3.0

You now basically proceed to perform the same on the other nodes in your cluster, and there you have it, a rolling clusterware upgrade from 10g to 11g. I was actually well impressed with how smooth and painless the upgrade was and there really were no brown trouser moments.

It remains to be seen how stable the new 11g clusterware is but I’m sure it’s just a coincidence that about 12 hours after the upgrade one of the nodes on one of the clusters had a kernel panic and froze!


Is the January Oracle CPU rolling upgradeable?

January 22, 2008

I’ve been running RAC for many years, but I have been consistently frustrated at having to lose availability to perform patches/upgrades. Obviously, I have heard a lot of buzz about rolling upgrades but so far, in the versions I have worked on up to and including 10.2.0.3 I have never seen an upgrade that I was going to apply whereby it was possible to perform an upgrade on my RAC cluster without incurring downtime. Perhaps, “rolling upgradeable” is more marketing smoke and mirrors than something actually based on reality?The January Quarterly critical patch update has recently come out, and at first I thought it was going to be possible to apply this doing a rolling upgrade and not incur any downtime, you will probably be surprised finding a dba owning up to applying a CPU, but if it was possible to apply this CPU without downtime then I thought I should give it a shot. So reading the README for the CPUJAN2008 patch, I page down the section regarding patch installation instructions for a RAC environment, and it clearly states that you can patch 1 node at a time. Great, is this a rolling upgradeable CPU?I then look at the post-installation instructions which state the following:

Select one node to execute the post installation steps. Follow the same set of instructions as mentioned in the Section 3.3.3, “Post Installation Instructions for a Non-RAC Environment”.

Users can continue to access the database during the post-installation steps.

So I go to the post-install instructions for the Non-Rac Environment, and sure I can run the catcpu.sql online, and looking at this script, Oracle have obviously thought about availability in a RAC environment:


.
Rem Check open UPGRADE status; set session attributes
Rem Following call to check_server_instance is commented as it checks if
Rem database
Rem is started up in UPGRADE mode else quits.
Rem CPU patches needs to be RAC compliant means database should not be started
Rem in UPGRADE mode.
– EXECUTE dbms_registry.check_server_instance;
.

But there is something new with this CPU, you must run a sql script called view_recompile_jan2008cpu.sql and here is a problem. You have to run this sql script after the database has been started using startup upgrade. This is bad news for system availability as only sysdba privilege users can access the system. Oracle are aware of this as a problem as they have the following note:

Depending on these considerations and your downtime schedule, you can choose to schedule the recompilation of views independent of the rest of the CPU installation. If you do this, your system will continue to work; however, the CPU installation will not be complete until the view recompilation is completed.

There are two thoughts for me here, why does the documentation state users can access the system during the post-installation phase, when to fully patch the security holes you must have downtime on your RAC cluster. The second point is, that it is no wonder that hardly anyone is applying these CPU’s if they are requiring system unavailability. If you have made a large investment in RAC to improve your availability then to have this compromised by a security patch is not a nice choice.

You should not have to make the choice between availability and security but it seems like for now, you cannot have both.


UKOUG RAC & HA Meeting

October 21, 2007

Well thankfully there was no falling off the stage when I gave my “Adventures in Dataguard” talk to the UKOUG RAC & HA Sig last week. Personally, I felt my talk went ok, with some positive feedback at the end and in the lunch break. Joel Goodman gave me some helpful comments regarding one or two points where I was slightly ambiguous in what I was saying, in particular with respect to the performance impact of the different protection modes on the primary databases - I must blog about protection modes soon. I did not feel I gave a great performance (I think presenting is always about giving a “performance”), but it was probably good enough. There were one or two other talks I found interesting, but for me it was not the best RAC & HA event I’ve been to, though it was the first RAC SIG to be branded as such.

I found a talk from British Airways, regarding the mechanism/procedures they have used in migrating some of the near 700 databases on to a stretched RAC cluster architecture. This is an enormous set of projects and it has taken them nearly two years just to have the architecture in place and are still in the process of migrating databases, this was not a technical talk more looking at this from the project management perpspective. This was the first time I’d heard lean methodologies mentioned in a DBA context. Though taking two years seems less than lean to me, but then at nominet we do like to work fast.

Perhaps, in some ways the most disappointing talk was the “ASM 11g new features” by Joel Goodman. As I’ve said previously, Joel knows Oracle inside out and is an Oracle Master. The disappointment came perhaps from the lack of snazzy new features. I like ASM and it has proved extremely reliable for us, (we use external redundancy, by the way). The obvious feature ASM lacks is the fact it is not a proper filesystem and ASMCMD only can do a fraction of the things you would want to do in a filesystem (e.g. copying things to it with a simple cp command). This has not changed in 11g, even though it is the obvious next step for ASM (maybe 12?)

Of the other talks, Piet de Visser’s was very agreeable (apart from his dislike of ASM - in our 2 node RAC cluster shop, ASM simplifies things as opposed to Veritas, rather than complicates), but his point about simplicity is one I can completely agree with, in fact I did like his Albert Einstein quote: “as simple as it needs to be, but no simpler”. The other talk was pure marketing from the HP “adaptive enterprise”.