To London on a pretty quiet train for the UKOUG RAC & HA SIG, it had been a while since I’d been to one and I was really looking forward to renewing acquaintances and learning a bit more about Exadata.
Sadly it was a bit of a disappointing turnout, at the beautiful venue of RIBA London, for what will turn out to be the final ever RAC & HA SIG event as it is merging with the Management & Infrastructure SIG. This new SIG will be titled the “Availability, Infrastructure & Management SIG”. I do hope this still has something of a focus on HA, and RAC, as without question, the RAC SIG has been brilliant to attend over the years. Not only have I learned so much from these events, but it has enabled me to meet and interact with so many fantastic DBAs. I have always considered it the premier database SIG to attend and I hope the quality of both presentations and attendees can be maintained!
I understand the changes are to broaden the remit including in the direction of cloud, virtualisation and monitoring. This does seems like a reasonable progression, and given the sad decline in numbers attending events it is a pragmatic decision.
Despite Dave Burnham warning that people should be “keyboard considerate”, I decided to whip out the laptop and record some notes for the blog anyway.
As has been traditional proceedings start off with a survey of who is using what flavours of hardware and software. These have been very informative over the years, and the surveys have shown a trend away from UNIX on to commodity linux servers. No surprise, but it has been interesting to see it on the ground, as it were. With a lower turnout, it becomes a bit less interesting, but worth recording the highlights none the less.
As for number of nodes in the cluster majority had 2, seemed to top out at 4, but a couple had more than this. 64-bit Linux was by far the most popular OS, with only a sprinkling of others, AIX, HP-UX, and even a Windows cluster (which also seems to draw a bit of derision). There were 2 users of Exadata.
Majority of users were running 10.2, with a reasonable number on 11.2. Still looks like there is a good ammount of upgrades to be done out there. Vast majority of people on Fibre Channel, with just 1 on iSCSI. Almost all just using Oracle Clusterware without any vendor clusterware.
And on to the presentations:
Phil Davies – Support Update
Phil pointed out he has been with Oracle an impressive 15 years, and he gave his usual rattle through notable bugs and software updates available. What stood out for me were a couple dataguard issue where doing a switchover after upgrading to 18.104.22.168 basically invalidates the newstandby see metalink note 11664046.8, and an issue where RFS process can overwrite a datafile (maybe even belonging to another instance)! see metalink note 1081961.1.
I must say with the number of issues shown, you have to question the QA that is being done within Oracle. Are they letting the ball drop on the bread and butter database releases?
Joel Goodman – Plugging in the Database Machine
Joel Goodman is a living legend, a colossus. The man is a walking Oracle documentation set, and his presentations are like drinking from a fire hydrant. You can guess I was looking forward to his presentation. He really focussed on the monitoring aspect of Exadata, the nuts and bolts of what you really need to do once you have wheeled your rack on to site. Crikey, Joel even gave you the dimensions of the rack!
Joel started with a broad sweep of the components and the various pieces that you need to monitor. It is a broad skillset that is required to manage an Exadata box, DBA, storage, os admin, network and knowledge of hardware – it is the combination that so excites me, but not every DBA is necessarily on this page.
There are 6 plugins available for Exadata for use in Grid Control.
You cannot mix and match drive types (performance and capacity), even within different cells within the same rack! An Expansion rack has only cells, no db servers. The expansion rack has upto 18 cells and disks are only available with high capacity drives.
adrci is available on the storage cells as well as the db servers.
SNMP – traps generated from both db servers and storage cells (plus switches, pdus, kvm)
IPMI – manage server hardware independently of operating system. Available on db servers and storage servers
ILOM – provides out of band monitoring and management. Generates alerts for hardware issues found on db servers and storage servers. Management server on storage server gets sent ILOM traps
There are basically Grid Control plugins for every component: storage cells, infiniband switches, cisco switches, pdu’s, ilom (db server only), and kvm. The plugins extend Grid Control. Joel made the point that metrics and thresholds were not setup out the box and need to be configured.
There may be a lot of changes with Grid Control 12.1 that may make these plugins disappear.
A trap forwarder is required to catch cisco switch and kvm traps due to a port mismatch.
Filesystem free space is monitored on cell by Management Server, which will automatically purge old log files if space starts to run out.
Joel mentioned about exachk, which is a utility that collects data regarding db machine components and best practices. Noted that it takes a long time to run the checks, to 45 minutes on a 1/4 rack. It produces a detailed report to look through, including recommendations for fixes.
Corrado Mascioli – Exadata Storage – Archictecture and Administration
Corrado has extensive in the field experience with Exadata – it sounds like everything in the garden is not entirely rosy! Corrado started off with a look at the various versions and options. Exadata cells are shipped with all the software pre-installed, O/S based on oel. On the cells 3 main accounts: root, celladmin, and cellmonitor. celladmin for day to day maintenance, cellmonitor is read only, and grid control uses this user.
Corrado covered the Exadata processes:
Moved onto talking about flash, which can be used in two ways. The recommended is to use all Flash assets as Exadata Flash Cache and this is the standard build.
Good discussion on the relationship between physical disks, luns, cell disk and grid disks. Presented nice diagram of this stack and then showed carving diskgroups onto the grid disks. Important to keep the naming convention here. All disks in a cell are part of the same failure group, but different cells are in different failure groups.
Corrado finally showed how to create from bare bones the grid disks and flash cache with a minimal set of commands.
At the end of this, had a good discussion with Joel Goodman on ASM normal redundancy and how things become a lot more complex than just using external redundancy.
After lunch was a panel session on Cloud Featuring Dev Nayak, Joel Goodman, Martin Bach, and Stuart Bensley. It is clear cloud has still to gain much traction in the database sphere, but management are keen on the potential cost savings. Good contributions from the audience and a lot of people have been tasked with seeing if there are cost savings in moving to a cloud type solutions.
Of course “cloud” means different things to different people.
Martin Bach – The private cloud
Thankfully Martin was on hand to explain the various cloud options, including SaaS, IaaS, and PaaS and then discuss a project he has worked on effectively building a private cloud with self-service portal. Martin gave a very interesting insight into how a large organisation operates and the challenges this can lead to, when trying to deliver change. Clearly from the previous panel discussion, these types of projects are going to become more common.
Martin is an accomplished presenter, and this was no exception, I particularly liked his use of a couple of dilbert cartoons that were used to illustrate his point. Martin even got interrupted by a ringing telephone, that seemed to go on and on, but he handled this with aplomb and even did a 3 Amigos style dance during the interruption.
Julian Dyke – Managing ASM Redundancy
It was a very fitting end to proceedings that the man who started the RAC & HA SIG way back in 2004 was the final presenter. Julian started off by showing how some numbers for how his benchmark stacks up on various different chips and architectures. It is pretty clear from this that Intel is the way to go.
Julian then went on to discuss various ways ASM stripes it’s data across disks, including with different levels of redundancy. He then mentioned about an issue where tablespace creation was taking a long time in a RAC environment with 1MB AU but was much faster when using a 32MB AU. The waits were mostly on KSV Master Wait. Interesting metalink doc on this 1308282.1.
Julian mentioned in a throw away remark that with high redundancy there are 2 primary extents!?!
Showed that kfed can be used to to show that a disk contains a voting disk. Julian also showed some new asmcmd features that are useful for backing up your spfile in ASM.
Julian finished off with a series of tests he had performed to simulate what happens when sans fail in a stretched RAC environment. Julian emphasised that maybe having more failure groups would have made the ASM mirroring more robust in his test environment. It still left me with the feeling that stretched RAC is not high availability!
I’m really looking forward to more discussions in the pub when I finally start doing some work in London.