Today I attended the UKOUG RAC & HA SIG and here are the verbatim notes that I took at the event. I probably should not complain at the journey, but an hour on a packed train for 38 miles seems “extravagant”. Still, probably took one of the speakers longer to arrive from Geneva!
Next RAC Sigs, Thursday 15th May, London and Thursday 2nd October, heritage motor centre
Call for papers for UKOUG conference opens on March 17th, This seems astonishingly early!
A Really packed agenda today.
majority on 10.2
none on 11 – yet
srpinkling of solaris sparc
probably majority on linux 64 bit
another sprinkling of windows
Vast majority with 2 nodes a handful with more than this, including 8 nodes from CERN and someone with 10
Vast majority using SAN for storage a few on NAS.
A lot using ASM, with a handful on ocfs, sprinkling on veritas 1 with polyserve
A lot of people with a physical standby, a few with a logical standby no one using auto failover a handful with stretched clusters.
Hardly anyone using standard editon
Phil Davies – Support update
Whispers of the first patchset for 11g – probably not for ages though. 10.2.0.4 surely coming soon. Interesting problem with ASM hang, and controlfile enqueue problem this is on 10.2.0.3, fixed 10.2.0.4.
On the January 2008 CPU, one audience member, claimed Oracle support stated to them that the cpu was rolling upgradeable. My support analyst definately stated it was not, interesting contradiction. Nominet got a mention, as I have a Documentation bug out for the CPU.
Dave Burnham – Highly available Oracle Databases
High level overview of building higly available databases. Downtime = Time to notice an issue + Time to resolve the problem.
Complexity kils availability, I certainly agree with this, Keep it Simple Stupid really is the way to go – the less moving parts the less that can go wrong. Concept of an availability benchmark system, which is a single server oracle database server – does your infrastructure improve on this config? That is the high availability solution is the modern comodity system which can have many hot swappable and redundant parts- not fancy clustering solutions.
However, several things are not protected by the single server solution, host failure, site failure, and of course the number one cause of reduced availability is human error.
One alternative to running RAC is to use a single instance database with clustering solution from veritas (like VCS), or SUN, or any of the other hardware vendors. Basically on failure, the clustering solution will restart Oracle on a different node. No expensive RAC license, and it’s fairly well understood technology.
Dave has lots of experience of stretched RAC clusters but states they are quite complex, and that dataguard is far simpler, though perhaps was still prefering stretched RAC for HA.
Miguel Anjo – Multiple RAC clusters
Running around 20 RAC clusters, 2-8 nodes.
Oracle Home is same everywhere, they deploy the ORACLE_HOME as an image.
3 stage environment
Custom built gui – browser based, to allow developers to see what is happening to their sessions, including sql, DML & DDL and ability to kill their session.
They have a 2 node clustered server for monitoring (runs single instance oracle). The have auditing turned on and generate weekly/monthly reports. Custom written monitoring, based on python, bash, xml.
1 RAC cluster per physics experiment.
They use a wiki for a logbook, database procedures.
Martin Bach – Lessons Learned from Migrating 10.2.0.2 to 10.2.0.3
This talk was based on using Standard Edition. They not only upgraded release but also migrated hardware, old hardware single core cpu with 3GB memory, run queue sometimes exceeding 12. new hardware 2 x dual core opteron and an upgraded SAN.
They have NO RAC test environment – scary stuff! Oh they have no device naming persistance – no ASMLIB or udev. They encountered some wacky bugs with SUSE and OEM. dbms_scheduler failing to schedule jobs to run on time, running by 5-45 minutes.
ASM 11g Experience in Extended Cluster – Bernhard de Cock Buning
Seems to be running RDBMS at 9i with Clusterware and ASM instance at 10.2.0.3 considering upgrading Clusterware and ASM to 11g. RDBMS moving to 10. They can’t use the ASM_PREFERED_READ_FAILURE_GROUP as the RDBMS was not 11. ASM Sysasm user – separate user to own ASM home, not required in 11gR1 but is required in 11gR2. Audience member stated they saw x2 increase in rebalance performance in 11g compared to 10g. Possibility to perform rolling ASM upgrade with 11g.
Simulating one site failure, 10g continued uninterrupted but 11g generated an ORA-600[kfdOffline01]. Seems like ASM rebooted on the surviving site. They used swingbench for testing load and had node crashes a couple of times, but once they were using Hugepages they had NO node crashes. It’s an interesting idea run 11g ASM with 10g database instance.
Split Mirror Backups with RAC & ASM – Howard Jones
General consensus is that it’s costly – requiring high end storage and complex. Using Symantec SMB integrating with Netbackup.
Using Dataguard for hardware migration – Miguel Anjo
Cern Using oracle streams to send LHC data around the world. Uses rman duplicate target database for standby for creation of standby. they switchover to the standby and upgrade this, only using the (now old) primary should they encounter a failure.
I don’t get it really, perhaps it was still too close to lunch for me to understand fully: why they don’t upgrade the primary saving failing over, but using a dataguard standby for the protection it offers should something go wrong? The CERN mechanism still encounters downtime, seems like they do some of the upgrade before the failover and reduce the outage, but for example a 10.2.0.2 to 10.2.0.3 upgrade you can install in a new ORACLE_HOME and you still need the outage for the catupgrd scropt? if you are out there CERN guys, what am I missing?
Logical Standby in the real time world – Graham Cameron
Old system single instance running on service guard cluster queries were hurting performance, chose physical and logical
small db only 22gb, 2GB/s of logs per day running Oracle 220.127.116.11, running the physical and logical on same server, creating server in 18.104.22.168 required the database to be quiesced. They still had major issues with their logical standby and found it failing on many occasions, interestingly they are using oracle streams far more successfully on a different project.
Still, a cracking day and thoroughly enjoyable.