Book Review: Tales of the Oak Table

February 13, 2008

I’ve just finished reading Oracle Insights: Tales of the Oak Table though I admit I’m coming to this book a little bit late as it’s been available for some 3 1/2 years already. I’m also reviewing this with a little bit of trepidation, as I feel I’m more formica table than oak, and others have reviewed this already, far more comprehensively than I will. I too was skeptical, and only recently got around to obtaining a copy, for some reason I thought it would not have enough technical content - doh!

Tales of the Oak Table is quite unlike any other Oracle book I have ever read. Each chapter is written by a different author, all of whom are world renowned Oracle experts. Though each chapter has a different author, there are universal themes running throughout the book. I don’t think it is a coincidence that the vast majority of the authors, at one point or another, actually used to work for Oracle and there are some very illuminating insider information. There is no filer at all, and you will not find a single screen shot.

I think one of the major themes throughout the book is how the improvement in instrumentation has transformed the diagnosis of Oracle performance problems. In several places the book describes quite well how tackling performance problems was almost like a black art, with “try it and hope” being the order of the day. This was before the wait interface came along, and there is excellent insight into why Oracle came to be so well instrumented. With the advent of this instrumentation, diagnosing performance problems gets put on a far more scientific footing.

Another major thrust of the book is examples of projects that had “design issues”. I’m sure you will recognise the perennial favourites like bind variable. The Jonathan Lewis, Tim Gorman, and Connor McDonald chapters being particularly redolent of this theme.

The book justifies it’s entrance price in the chapter introductions from Mogens alone, they are exceedingly funny. I also thoroughly enjoyed the James Morle chapter as well. In fact the book made me start thinking that perhaps a lot of performance problems today are masked by having faster cpus and a whole tonne more of RAM to play with. How comfortable would you be running your database on 50MHz processors?

I found this book to be an excellent read, full of insight, and if you have not had the opportunity to take a look at, I highly recommend you put it onto your reading list.


Interviewing Oracle DBAs

February 10, 2008

You may have seen in an earlier article I was trying to drum up interest in a recruitment drive we were having. Well, we whittled down the applicants and recently held the interviews. There was one candidate that stood out quite away above the other hopefuls. I thought it might be of interest to go through some of the interview questions that we used. I know Howard J. Rogers (currently and perhaps permanently off air) has in the past written a fantastic series of pieces on his interview questions, but I feel our experience is worth sharing. Though it seems the lack of good quality Oracle DBAs is a universal theme.

First off, I think it is fair to say the interview was split roughly in two, with the first half covering more attitudinal questions and the second half covering more technical Oracle questions. We certainly place as much weight on finding the correct sort of candidate, one with a good attitude to learning and customer service, as much as what particular level of Oracle skills they have. Also it’s worth pointing out this is for a job in a small company, where there is no strong demarcation of responsibilities, so the DBAs are expected to do a goodly amount of system administration work, and have ownership of configuring the SANs. The candidate we chose I would say displayed excellent answers in both the attitudinal and technical questions. Here are the Oracle questions we asked, I think these are pretty straightforward and gentle questions, there are not even that many of them!

What approach would you take if asked to investigate a query that was running slowly?

A couple of candidates answered this pretty well. I was looking for things like checking the explain plan, and tracing the session. I was not looking for someone to mention the buffer cache hit ratio - though someone did!

What SQL commands will cause a sort to take place?

Unfortunately a few candidates did not get past order by (which all got), some said distinct and group by. The candidate we hired mentioned these and UNION. No one mentioned create index.

How does Oracle implement read-consistency?

Quite a few candidates stumbled at this question. It is such a fundamental part of the Oracle RDBMS that you would hope a candidate would at least have go at answering. A couple of candidates said they were stumped. I was looking for them to mention that a query only sees data committed before the start of the query and uses undo to reconstruct any data if it finds it with a commited time that was after the start of the query. No one mentioned SCN.

Do you know what ORA-01555 is?

This is somewhat related to the previous question. Practically every candidate did not know this. I do not expect a candidate to know every single Oracle error message, I certainly don’t. However 1555 is such a classic error that I would expect good dbas to know what it is. The one candidate who did know that it’s snapshot too old, got the job.

What resources do you use to keep up-to-date with Oracle?

Well I probably was a little disappointed with the responses to this one, only one candidate mentioned a good spread of bloggers, or indeed blogs at all. Everyone mentioned OTN, but no one mentioned actually downloading a new release and trying it out for themselves. Oh and this question lead on to which authors the candidates liked, Tom Kyte was the definite winner with Steven Feuerstein in second place. I was astounded not one candidate mentioned Jonathan Lewis.

Explain the architecture of RAC?

So, I realise this could potentially lead to a wide ranging answer, but most of the candidates had RAC on their CV so you’d expect them to able to explain it at least to a level that they new it was different from single instance Oracle. Minimum requirement was multiple instances accessing the same database datafiles, communicating through a private interconnect, which most met.

Finally, here is one of the non-technical questions we asked:

What kind of working environment do you need to be most effective?

I’ll leave it to the reader to work out what we expected from this one!


UKOUG RAC & HA Meeting

February 5, 2008

Today I attended the UKOUG RAC & HA SIG and here are the verbatim notes that I took at the event. I probably should not complain at the journey, but an hour on a packed train for 38 miles seems “extravagant”. Still, probably took one of the speakers longer to arrive from Geneva!

Introduction

Next RAC Sigs, Thursday 15th May, London and Thursday 2nd October, heritage motor centre

Call for papers for UKOUG conference opens on March 17th, This seems astonishingly early!

A Really packed agenda today.

Survey

8 9.2
2 10.1
majority on 10.2
none on 11 - yet

few itanium
srpinkling of solaris sparc
probably majority on linux 64 bit
another sprinkling of windows

Vast majority with 2 nodes a handful with more than this, including 8 nodes from CERN and someone with 10

Vast majority using SAN for storage a few on NAS.

A lot using ASM, with a handful on ocfs, sprinkling on veritas 1 with polyserve

A lot of people with a physical standby, a few with a logical standby no one using auto failover a handful with stretched clusters.

Hardly anyone using standard editon

Phil Davies - Support update

Whispers of the first patchset for 11g - probably not for ages though. 10.2.0.4 surely coming soon. Interesting problem with ASM hang, and controlfile enqueue problem this is on 10.2.0.3, fixed 10.2.0.4.

On the January 2008 CPU, one audience member, claimed Oracle support stated to them that the cpu was rolling upgradeable. My support analyst definately stated it was not, interesting contradiction. Nominet got a mention, as I have a Documentation bug out for the CPU.

Dave Burnham - Highly available Oracle Databases

High level overview of building higly available databases. Downtime = Time to notice an issue + Time to resolve the problem.

Complexity kils availability, I certainly agree with this, Keep it Simple Stupid really is the way to go - the less moving parts the less that can go wrong. Concept of an availability benchmark system, which is a single server oracle database server - does your infrastructure improve on this config? That is the high availability solution is the modern comodity system which can have many hot swappable and redundant parts- not fancy clustering solutions.

However, several things are not protected by the single server solution, host failure, site failure, and of course the number one cause of reduced availability is human error.

One alternative to running RAC is to use a single instance database with clustering solution from veritas (like VCS), or SUN, or any of the other hardware vendors. Basically on failure, the clustering solution will restart Oracle on a different node. No expensive RAC license, and it’s fairly well understood technology.

Dave has lots of experience of stretched RAC clusters but states they are quite complex, and that dataguard is far simpler, though perhaps was still prefering stretched RAC for HA.

Miguel Anjo - Multiple RAC clusters

Running around 20 RAC clusters, 2-8 nodes.

Oracle Home is same everywhere, they deploy the ORACLE_HOME as an image.

3 stage environment
development: 8/5
integration: 8/5
Production 24/7

Custom built gui - browser based, to allow developers to see what is happening to their sessions, including sql, DML & DDL and ability to kill their session.

They have a 2 node clustered server for monitoring (runs single instance oracle). The have auditing turned on and generate weekly/monthly reports. Custom written monitoring, based on python, bash, xml.

1 RAC cluster per physics experiment.

They use a wiki for a logbook, database procedures.

Martin Bach - Lessons Learned from Migrating 10.2.0.2 to 10.2.0.3

This talk was based on using Standard Edition. They not only upgraded release but also migrated hardware, old hardware single core cpu with 3GB memory, run queue sometimes exceeding 12. new hardware 2 x dual core opteron and an upgraded SAN.

They have NO RAC test environment - scary stuff! Oh they have no device naming persistance - no ASMLIB or udev. They encountered some wacky bugs with SUSE and OEM. dbms_scheduler failing to schedule jobs to run on time, running by 5-45 minutes.

ASM 11g Experience in Extended Cluster - Bernhard de Cock Buning

Seems to be running RDBMS at 9i with Clusterware and ASM instance at 10.2.0.3 considering upgrading Clusterware and ASM to 11g. RDBMS moving to 10. They can’t use the ASM_PREFERED_READ_FAILURE_GROUP as the RDBMS was not 11. ASM Sysasm user - separate user to own ASM home, not required in 11gR1 but is required in 11gR2. Audience member stated they saw x2 increase in rebalance performance in 11g compared to 10g. Possibility to perform rolling ASM upgrade with 11g.

Simulating one site failure, 10g continued uninterrupted but 11g generated an ORA-600[kfdOffline01]. Seems like ASM rebooted on the surviving site. They used swingbench for testing load and had node crashes a couple of times, but once they were using Hugepages they had NO node crashes. It’s an interesting idea run 11g ASM with 10g database instance.

Split Mirror Backups with RAC & ASM - Howard Jones

General consensus is that it’s costly - requiring high end storage and complex. Using Symantec SMB integrating with Netbackup.

Using Dataguard for hardware migration - Miguel Anjo

Cern Using oracle streams to send LHC data around the world. Uses rman duplicate target database for standby for creation of standby. they switchover to the standby and upgrade this, only using the (now old) primary should they encounter a failure.

I don’t get it really, perhaps it was still too close to lunch for me to understand fully: why they don’t upgrade the primary saving failing over, but using a dataguard standby for the protection it offers should something go wrong? The CERN mechanism still encounters downtime, seems like they do some of the upgrade before the failover and reduce the outage, but for example a 10.2.0.2 to 10.2.0.3 upgrade you can install in a new ORACLE_HOME and you still need the outage for the catupgrd scropt? if you are out there CERN guys, what am I missing?

Logical Standby in the real time world - Graham Cameron

Old system single instance running on service guard cluster queries were hurting performance, chose physical and logical

small db only 22gb, 2GB/s of logs per day running Oracle 9.2.0.8, running the physical and logical on same server, creating server in 9.2.0.8 required the database to be quiesced. They still had major issues with their logical standby and found it failing on many occasions, interestingly they are using oracle streams far more successfully on a different project.

Still, a cracking day and thoroughly enjoyable.


Database jobs & Oracle Active Session History

January 11, 2008

A few days ago I found myself investigating a session that had become blocked during the middle of the night. This session happened to belong to a critical application so caused an on call engineer to be paged, but by the time they looked the session had become unblocked and the application was back happy and running again. I only found out the next morning so was investigating somewhat after the fact. I decided to have a look at what the active session history could shed on the problem. There is a lot of information out there about what ASH can do for you and I heartily recommend reading them.

I was fortunate in that the application whose session was being blocked was still connected to the database, so I could easily identify a SID, and had logged when it was paused. I could then could generate a report on active session history based on just that session_id (it’s quite frustrating that, for example, V$session uses the column SID for the session identifier, while V$active_session_history uses SESSION_ID for the session identifier, am I alone in finding inconsistencies like this really annoying and a clear case of a lack of communication within oracle!?!) Anyway I ran some very simple sql along the lines of:


select session_id, event, sql_id, blocking_session
from V$active_session_history
where sample_time between (sysdate - 6/24) and (sysdate - 5/24)
and session_id = &sid
/

It produced output like the following:

picture-3.png

So I could see that this application had been blocked because some other session had locked a row my application was interested in updating. But There is no misprint in the blocking session information, it was not present in the case I was examing, this was Oracle version 10.2.0.3. What a pain. I could see sql_id and could find out easily what sql my blocked application was trying to run, but V$active_session_history was not telling me what was causing the blockage.

In the end I looked at an awr report for the period covering the blockage and saw a database job was running and a knowledge of what the job was showing enabled us to be sure this was the culprit. But it lead to the conclusion that jobs scheduled with DBMS_JOB do not show up as a blocking session in the V$active_session_history view.

It’s pretty trivial to reproduce:

SQL> create table t (time timestamp);

Table created

SQL> insert into t select sysdate from dual

1 row created.

SQL> commit;

Commit complete.

SQL> create or replace procedure update_t as
   begin
   update t set time = (select sysdate from dual);
   dbms_lock.sleep(120);
    commit;
    end;
    /

Procedure created.

SQL> declare
    num_job BINARY_INTEGER;
    begin
      dbms_job.submit(job => num_job,
      what => ‘update_t;’,
      next_date => sysdate + 1/(24*20),
     interval => ‘(sysdate+1/86400)
  	    + mod(trunc(sysdate+(1/86400)+(1/24),”HH24”)
  	    - (sysdate+1/86400),1/96)’);
     commit;
     end;
 /

Now all you have to do is run the update in another session while the database job is running. Then when you query V$active_session_history, you too will be able to a session that is blocked with row lock contention but has no blocking session information. Not earth shattering perhaps, but it was surprising to me to see a session shown as blocked but not blocking_session information. So now you know, if you see these symptoms, your blocking session could well have been a database jobs. Of course, you are probably all using DBMS_SCHEDULER by now, anyone know if this shows as a blocking session in v$active_session_history?


50% of a UKOUG Conference

December 6, 2007

Well I never did make it to the Wednesday of the 2007 UKOUG Conference. Our new baby arrived on Wednesday at 11:22, after 12 hours of painful progress. She really could pass for Winston Churchill. I was kinda grateful that She hung on until after my talk and that I did not have to make a mad dash back from Brum in a blind panic. Certainly, 50% of a UKOUG event is better than none at all!


UKOUG Tuesday

December 4, 2007

Well, nightmare night the evening before, woke at 02:00am and failed to get much sleep after that. Pretty much gave up on making it up to Brum for Larry Carpenters talk on 11g dataguard new features. I was feeling a little uneasy by the time I was in the speakers lounge, having a final run through, it’s funny I’ve given this talk before, but knowing that in the audience there would be real dataguard experts made me slightly more nervous than giving the talk previously. I was Hall 9, which is actually quite large, though with being scheduled at the same time as Tom Kyte probably reduced the audience a little.

I had not really changed the talk much since giving it previously, and I’m not sure the spiel about our new baby being 8 days overdue went down to great. I was surprised by how long it went on for, when the chair said that there were just 6 minutes to go, I could hardly believe it. So if you were in the audience and were yawning away, apologies for going on so long. Seriously the talk went pretty well, with very few hiccups. I was somewhat chuffed when Jonathan Lewis passed me by somewhat later and said he enjoyed it - that’s good enough for me!

High Availability round table

After lunch I was at the HA round table which turned into a ask Larry Carpenter session - obviously no bad thing. active dataguard is a cost option perhaps because oracle feel it offloads large amounts of activity from primary. Hmm, still you must buy your EE license for your standby anyway.

Possibly they may in the future allow SE to use dataguard. 11g physical standby will change audit parameter to O/S when opened read only

Using flashback to test an upgrade on a standby and then flashback to the lower version, I’d already actually tried this in going 10.2.0.2 -> 10.2.0.3 - I must try testing this 10.2.0.3 -> 10.2.0.4 or maybe to 11g. Wow not many people using DGMRL. Larry Carpenter saying use DGMRL in RAC, 10.1.0.5 is almost viable, but prefers 10.2.

Jonathan Lewis statspack

I was pretty drained by this time and made this the last talk of the day, which may not be the attitude required.

jonathan likes level 7 statspack., this includes top sql, plans and segment level stats.

disk I/O
cpu consumption
network delay
application contention locks rac
internal contention latch

it’s almost always the sql. The most eye opening thing for me, was the fact that there is disk I/O time histograms with statspack in 10.2 - these are not available in AWR reports - which I kinda switched over to using.

Can’t say I’m enjoying the commuting, it’s adding a couple of hours to both ends of the day, not what I’m used to. The train has been packed every day and I’m missing a few beers at night. I should not complain, I should just be grateful daughter #2 has been delayed enough to allow me to attend. Now, I wonder if I’ll make Wednesday….


UKOUG 2007 Monday Morning

December 3, 2007

Well I had a bit of a leisurely morning, thought I would skip the thoughts of Ian Smith - it was bound to all be about “fusion” anyway. These are the notes I made during the presentations I attended on the first day:

Phil Grice RAC V Dataguard

Interesting demo of the dataguard broker, and it did seem to do what it said on the tin. There was a demo of fast-start failover and it seemed pretty quick at performing the failover. Though note that in 10g you must be using maximum availability for this, though there is the prospect of doing fast-start failover with maximum performance in 11g.

Perhaps the best thing about the demo was the reinstating the failed primary as a standby using flasback database, the broker did this automatically once the failed primary was brought back up - very impressive and one of the few arguments for using flashback on a primary - hmm. what performance impact and how reliable is it really?

Jonathan Lewis Playing Russian Roulette

Several examples from the forums where people were asking for one line solutions without presenting sufficient data.

One example was using optimizer_cost_ind_adj which obviously affects all queries, jonathan showed a query that by default used a good index but when the paramter was set used the bad one.

how to stop chasing silver bullets:

• put the data in the right place

• build appropriate indexes

• avoid labour intensive operations

Andy Bryant Building and running a 60TB standby

created standby using tapes - LTO2’s took 3.5 days to backup and 6 days to recover

veritas 5.1 failing on around 50 datafiles (out of 1500) - fixed in 6.0

archive log gaps restore from tape?

workaround for gaps is to kill ARCn on primary!

Backup & Recovery roundtable

Surprising that some people are using cold backups even though they are running in archivelog mode. Media management and rman retention policy could be in opposition. Interesting that around 50% were using a recovery catalog, this seems to have it’s own issues. 11g parallelize backup of individual files (cuts up a datafile into pieces so more than 1 channel can backup the same datafile).

Thomas Pustynen rman incremental backups

V$datafile_backup keeps track of upto what scn a backup relates to. block change tracking keeps a bitmap of changed blocks. You have to specify a datafile to be used to tracking the changes. Incrementally update image copies, can switch to image copy instead of restoring - requires enough disk space. It’s very easy to switch to the copy and recover this copy before bringing the copy online.

Larry Carpenter Creating a physical standby - best practices

Interestingly quite a large percentage of people were using the broker/EM to manage their standby, but only a tiny fraction used EM to actually create their standby. Larry argued that it was more sensible to use EM to create as it’s easier, but I profoundly disagree with this, as it might be easier to create but I bet it gives you far less understanding of what make a standby tick.

Overall in truth no great revelations, but the roundtables are a great idea and I hope to be at the HA one on the Tuesday. It was kinda nice being able to get into the speaker lounge (well useful being able to get some power into the MacBook Pro). As of circa 9:30pm no sign of baby #2 so it does look like I’ll be doing my dataguard after all!


UKOUG Conference Dilema

December 2, 2007

I find myself in a real dilema over this years UKOUG annual conference in Birmingham. No I’m not talking about whether I go to Tom Kyte’s “11g new features for DBAs” which clashes with Larry Carpenter’s “Best practices for creating your dataguard standby databases” (it’d be the dataguard talk, by the way). No I find myself on the horns of a much more personal dilema. My wife was due to give birth to our 2nd daughter 1 week ago. It still has not arrived.

The UKOUG conference is in Birmingham, which is just over an hour away by train from where I live, which is Oxford. I’ve been really looking forward to the conference for quite some time, probably more so than in the last couple of years. In fact I’ve been to the annual UKOUG conference now for the last 7 years, so it is quite a fixture in my calendar. For the first time though I was going to be a speaker, while I’ve given a few presentations at UKOUG Special Interest Group (SIG) events, this would be the first time I have presented at the annual conference. Sure it is one I had prepared earlier, but it’s still good to get an abstract accepted for the conference.

Normally, I would stay in Birmingham for the duration of the conference, and yes there is usually a few beers a bit of networking going on, but this year I’m going to commute, I was fearing I would have to miss the whole event, and it is still possible I will miss it if my standby offspring ;-) does arrive. I have also prepared a continuity plan for my presentation in the event I am assisting in a delivery suite in that I have enlisted the assistance of my colleague patrick hurley, fine upstanding deputy chair of the UNIX SIG.

Hopefully I’ll see you in Brum, but if I don’t make it, I think I’ve got a legitimate reason!


The Oracle VM bandwagon

November 14, 2007

Seems like the blogosphere is all over the Oracle VM. I’d would definately say I was on the skeptical side of the fence over this announcement. I guess in certain markets it could have it’s place, but I don’t see it in production. A little while ago Doug was talking about whether 1 instance per server, or multiple instances per server was more common. I can’t see where the advantage really comes in having multiple VM’s per machine over having multiple instances per machine, except with the multiple VM’s you get the additional overhead of managing the seperate VM’s. Sure, you get some benefit of different OS’s, patch levels etc, so maybe some management benefit. But compared to 1 database per server with multiple schemas, the performance of multiple VM’s just has to be degraded. Tom Kyte’s point about the multiple pots compared to one pot for me is spot on, but surely VM’s really can only make this worse. So that is my thought about production.

But I start to hesitate when it comes to test/dev environments. I have been invovled with a setup whereby every developer gets his/her own rig. With full fibre connectivity to the SAN. The management overhead of them does not scale. These boxes are mostly idle, and sure we have a provisioning server to install an OS and the Oracle binary is on the SAN, but it still does not scale. I really like the look of Solaris Zones which we are using on web server type applications, but not for Oracle as yet. So a couple of 4-way x86 boxes virtualised might be a good alternative for us.

I don’t actually like the license/support position of running Oracle in a VM. This article seemed to be hinting at an Oracle VMware love in, obviously this turned sour at somepoint. I understood it that it was not supported to run Oracle in a VM, and with the Oracle VM announcement I assume the only VM that is officially supported is the Oracle one, though the following is quoted from metalink:

“NOTE: Oracle has not certified any of its products on VMWare, but use of Oracle products in the RAC environment is specifically not supported.”

Though Oracle does go on to say:

“Oracle Support will provide support for Oracle products when running on VMware in the following manner: Oracle will only provide support for issues that either are known to occur on the native OS, or can be demonstrated not to be as a result of running on VMware.”

So this level of support might be fine in test/dev, I’m not so convinced about it in a production environment, and if you want some scale out with RAC, forget a VMware environment.

This EMC guy is also less than convinced, though I accept he may have his own agenda! I think he has a fair argument, and already loads of people are running Oracle in a range of different VM’s. So is this Oracle jumping on the VM bandwagon rather than concentrating on core techologies? And now I see another company climbing aboard the VM bandwagon, looks like the (virtual) train is about to leave the station, better hop aboard.


11g Result Cache - a missed opportunity?

November 11, 2007

One of the new features of 11g that initially caught my attention, (apart of course from versioning, that never actually made the cut into the production release) was the server side result cache. I thought this could really, really help repeated queries fly which on the database I was working on, a large number of queries were repeated. Recently, at the DBMS sig, Julian Dyke gave a presentation, based partly on this that reawakend my interest.

Sadly though this feature is NOT as useful as I first imagined, the problem with the result cache is stated plainly enough in the Oracle documentation “The cached results stored become invalid when data in the dependent database objects is modified”, naively I assumed the cached results would only be invalidated if the rows they relied upon actually changed. At the worse you could have it if the block that contains the rows that are part of your results set changes then invalidate the result cache. But no, the result cache of your query is invalidated if ANY row of the tables that you are querying changes. Clearly this makes the result cache not all that useful in an OLTP environment, to me this smacks of a missed opportunity, and I wonder if there is room for this feature to be enhanced in future releases.

SQL> set autotrace traceonly stat
SQL> select /*+ result_cache */ owner from t where object_id=33333;

Statistics
———————————————————-
1 recursive calls
0 db block gets
814 consistent gets
808 physical reads
0 redo size
419 bytes sent via SQL*Net to client
420 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> /

Statistics
———————————————————-
0 recursive calls
0 db block gets
0 consistent gets
0 physical reads
0 redo size
419 bytes sent via SQL*Net to client
420 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

I love the look of the 0 consistent gets, but you won’t be seeing that much in an OLTP environment until the granularity of what invalidates the result cache improves, so definitely a missed opportunity.