jarneil

October 12, 2009

Oracle OpenWorld 2009 – Day 0

Filed under: Oracle OpenWorld — jarneil @ 5:45 am

It has been an absolute fantastic and hectic first day of Oracle OpenWorld 2009. It feels like it has been far, far busier compared to the Sunday last year.

It was a bit of a mad dash from registering in Moscone West to the Hilton to catch the opening keynote of the Oracle Develop track by Tom Kyte. It looked like there were several thousand other people crammed into the Grand Ballroom B at 09:00am and Tom did not disappoint – I really enjoyed this talk titled:


“What are we still doing wrong?”

The music was blaring out as the ballroom filled up, and when Tom came on it sounded almost like ceremonial marching music being played. It was quite odd to see the close ups of Tom fill the two large screens at either end of the stage. Tom started off calling everyone “data processing professionals”. This presentation is a collection of interesting things Tom has seen over the past year, quite a lot of amusing content.

underestimating complexity

Even what on the surface seems like a sm re change, under the surface may have many, many ramifications. Hilarious picture of an error message complete with astonishingly bad spelling mistakes on Tom’s mobile phone,

Giving good advice that developers should look at the goal the business wants to achieve rather than having specs that specify a way of doing something.

Not knowing how to ask questions

Tom shows a few questions asked on ask Tom that basically do not give enough information for anyone to be able to troubleshoot the problem – he likes the analogy “My car won’t start -why?”.

Basically saying be very specific, but give details of the actual error.

we write/generate too much code

Very much of the opinion that more code = more bugs

lots of funny examples of writing a ton of code to accomplish what could have been done in very few lines. Classic was how to find make a negative number into the positive equivalent.

Dealing with failure

Example of t-mobile sidekick product completely wiped out by a server failure and no backups. What a terrible way of running something. Now talking about Ask Tom’s infrastructure and how he used to run the infrastructure himself, and that it was getting a bit dodgy – now hosted on apex.oracle.com

Errors happen – deal with it.

Do not catch unknown errors, need to raise them and log them.

quite a few times mentioned the when others then null – saying the logic of this escapes him. Saying the code before the when others then null seems to not matter whether it runs or not, in which case why not just remove it!

developers get into the mind set that users should never see an error as that would be embarrassing. But only catch errors that you can deal with, raise the errors and make sure you log them and become aware of them – no good just letting the users know.

security matters

discussion of why people not doing security, on quote was “besides it’s not as important as having pretty screens”

example of a funny site comparing computer security and star trek.

After Tom, I was back of to the Moscone and took in a couple of BI presentations. This was quite a change for me, normally I’d have gone to some of the RAC presentations. First one was on visualisation:

Lies, Damned Lies & Dashboards

Some interesting material around presentation and cognition – how we interpret what we see. They gave clear examples showing before and after, in particular the “using less ink” was pretty clear when they showed a graph with and without grids – the without was much clearer.

OAUG BI SIG

This was a really good 2 hour meeting. Now, I’m not sure how often you’d hear that phrase said, but I’m pretty new to BI, an I found this really interesting. There was a 20 minute bit from Oracle followed by 2 customers who had implemented OBIEE in their organisation, and I found it really fascinating to hear their experiences of an implementation. The event was rounded off with a good question and answer session.

Whenever oracle acquire a company they seek to replace (rewire) the BI software used within that company and use OBIEE within 90 days

Claims OBIEE is only solution for federated queries accros multiple apps.

Essbase is forward looking analytic engine

Customer Implementation from Black and Decker

claims they were in a train wreck, with cognos – 7 figures to upgrade cognos, dependent on IT for new reports. manual process to extract data, quite an insight into a business without control on it’s data

had end users choose which tool they would migrate to.

multiple apps into the DW delivering the BI content

because he had business users on board he could sell the depreciation of the cognos solution

he is claiming OBIEE is really slick

2 1/2 years of sales data from about 4000 stores

had end users do the data cleaning, that was one of the hardest things

keen on identifying Super Users within departments

Capital Power Corporation

EBS reporting insufficient

Risk and Trading system with almost no reporting

there seems to be a theme that everytime the business wanted a new report they had to call IT – business found this frustrating

no easy analytics on positions and risks – were using excel

did a vendor bake-off

separate presentation and BI server components

was a business project – not a geeky IT project – claims there was a shadow IS project to do their own evaluation

think about security can integrate to AD

For EBS project length 4-6 months developed a large number of custom dashboards. went into pilot mode and made lots of changes based on user feedbacks

claims complexity is very high, you need an expert when not using out-of-box dashboards.

Used Oracle Data Integrator to develop custom ETL to pull data from 5 different databases

designed custom star schema based DataWarehous 850GB 15 fact tables 75 dimensions have around 100+ different dashboards project length 15months

got the users to build the dashboards

dashboard creation/maintenance has been transitioned to end users

This is a business project not an IT project

Must be agile style approach – big bang does not work

get the architecture designed before engaging the business – not sure how that squares with the above.

poor performance will overshadow everything

Basically the messages coming over loud and clear:

Business can see the IT department as the problem – so get the business on board asap and try and ensure that the business are empowered to have full ownership of the end product so that they are self-sufficient.

October 11, 2009

It’s never the same after the 1st time

Filed under: Oracle OpenWorld — jarneil @ 4:35 am

So my second trip to Oracle OpenWorld started with a lot less excitement than the first. As the title says the second time you do something is really never as exciting as the first. I had not bothered to pack until the day of flight, so it was quite nice having my eldest daughter chuck stuff into the case for me just a handful of minutes before the car to airport arrived. Actually seems to have worked out pretty well, as I seem to be missing less stuff than usual.

The journey itself was really straightforward and as an added bonus I had plenty of room on the plane, but 11 hours in the one seat is never going to be a comfy experience, barely managed to sleep so awaiting the jet-lag to kick in. I fully expect to be awake at 03:00am San Francisco time.

Going through immigration control, I managed to go to the border agent who signed my passport coming to OpenWorld last year. He was a little surprised at that. As I was leaving I told him I’d see him next year. We will see.

I think there may be a bit more interest in the presentations in general this year than last, what with the new 11gR2 release to learn about. I’m currently working on a RAC upgrade to 11gR2 and we certainly have plenty as yet unanswered questions. It will be good to be able to pose some of them to RAC team, as well as the ASM people.

The other thing I’ve been tasked to find out about, is the skinny on data warehousing and Business Intelligence. Quite a new field for me, so hopefully there will be some good content to help me navigate my way.

Roll on tomorrow.

5am Update: Well 4 hours sleep seems standard. To the bloke in an adjoining room whose cough woke me up at 02:45, you really should see a medic that cough sounds terrible.

September 22, 2009

Charles Phillips on Oracle and VMware

Filed under: virtualisation — jarneil @ 10:03 am

Recently there has been a bit of a campaign stirred up by EMC and VMware to get Oracle to change it’s support position vis-a-vis VMware. Have a look at Oracle Storage Guy asking for people to get in touch with Charles Phillips about the Oracle/VMware position.

VMware is becoming the big thing for my organisation and we are having a massive mind-shift from running large numbers of physical tin to hosting our applications (including many Oracle instances) onto VMware. We need one VM solution, and our corporate choice is VMware, we don’t want to run our Oracle instances on Oracle VM while the rest of the infrastructure is on VMware.

I emailed Charles Phillips myself:

Hello Charles,

I suspect you might get quite a lot of emails on this subject.

Nominet are one among (I’m sure) many Oracle customers who are very keen to run their Oracle estate on ESX hosts.

It would be great if Oracle had a change of heart regarding your Support/Licensing position vis-a-vis running Oracle on VMware.

Heck, with OpenWorld coming up, it sure would be a great time for an announcement.

jason .

What I was not expecting was to get a response:

Jason, thanks for your note.
There is a misperception out there so let me clarify. We never said customers were not interested in virtualization and we already recognize customers want to run their applications in a virtualized environment and we encourage them to do so which is why we offer Oracle VM with no license charge. We do support, test, and certify aginst the Oracle VM environment. We’re always happty to hear from customers but I didn’t say we needed to hear from customers who wanted to use VMware; we arleady agree that virtualization is important.
What we’ve said on other VMs is that we will respond to support issues as they arise but so far we’ve elected not to formally certify any third party VMs because virtualization is intricately linked to the rest of our stack. A lot of our customers run database grids and we support virtual clusters which requires our clusterware for internode communication interacting with our VM for HA features which is why people use RAC in the first place. We also use virtualization to provision RAC clusters and do live migrations.
But that requires an engineered product family that takes advantage of our shared storage, clustered file system, clusterware, and our shared everything environment. We are supporting a complete system of deployment for virtualized computing and not just a hypervisor. That’s what allows the advanced functionalty. e.g. availability that doesn’t rely on network pings to determine whether a guest is running or not. This reduces the chances for false positives/negatives when determining whether a VM has failed. It also assures that a VM is restarted correctly without any risk of shared data corruption. We of course built our business on protecting data at all costs and the VM interacts at such a low level with our stack that we’ve elected to certify the one we could fully engineer and test. We have to protect the data and reduce the risk.
We’re trying to make it as easy as possible for people to run our database in virtualized environments. We are providing additional add ons for free such as VM templates that are downloadable and already preconfigured e.g. Oracle Enterprise Linux and the Oracle database. Quicker ramp up town with a single configuration installation script. Our version of Linux distribution is paravirtualized and we get significant performance benefits as a result. We can also do live migrations as well as capacity and power management (turn machines off an on). Much like the mainframe where virtualization was created, its a complete system from end to end.
We do respond when customers call and work with VMware and other third parties if an issue arises. We are always opening to revisiting issues but that’s where we are right now. Our VM is free so the only motivation is to provide a reliable and well tested environment for HA, clustering, and grid computing
As for licensing, most of our customers now have unlimited license agreements which means it doesn’t matter how many VMs, cores, or processors you decide to deploy. We think this is easier for both sides and encourages adoption of the technology. Thanks Charles

I don’t think it really helps me run my Oracle estate on VMware any easier, but I think it is to Oracle’s credit they actually provide a response, whether the response helps is a different matter, but I take my hat of to Charles in the sure-footedness of the response, at least he did not just ignore the email, and attempted to at least interact with 1 customer.

If you are interested in running Oracle on VMware do drop him a line yourself, it’s charles.phillips@oracle.com.

September 10, 2009

UKOUG RAC & HA Meeting September 2009

Filed under: UKOUG — jarneil @ 7:34 pm

Introduction

It was a beautifully bright and sunny, but cold day as I cycled to Oxford rail station to head to the Oracle offices at Blythe Valley Park for the third RAC & HA SIG of the year.

Future dates for the next couple of RAC & HA SIGs:

10th Feb
10th June

Quite a small turnout really, down to 32. Audience this size makes the traditional round of surveying who is running what a bit less interesting.

Some people running 11.1, no one obviously running on 11.2

Practically everyone using physical standby with their RAC systems, no one on active dataguard yet.

Oracle Support Update – Andy Giddons

Andy showed a refreshing level of honesty about the Oracle database product for an Oracle employee.

Very surprised to hear him say that 11.1.0.7 is the terminal release for 11gR1. Quite unusual just having 1 patchset for a release, remember 11.1.0.6 is the base release for 11gR1. 9.2 has 8, 10.1 has 5, as will 10.2, 10.2.0.5 being the terminal release for 10.2.

I pointed out we had hit a number of bugs in testing 11gR1 that were marked only as fixed in 11.2. It really struck me that 11.2 is the release to go for.

Oracle Linux Test Kit – Martin Bach

“The more I looked the more impressed I was”

Martin had a project to replace ageing hardware running on rhel3 moving to rhel5 u3 also a 32bit -> 64bit migration.

He wanted to know who much better performance would be, how much improvement the new hardware would bring, so he could show to the business the value of purchasing this new hardware.

There are several load generators, including swingbench, hammerora, ORION. Also option to use Real Application Testing though this is not free, and they would have had to upgrade their oracle install to 11g.

Using swingbench, found a five-fold increase in transactions per second on the new hardware. found hammerora too complex, particularly the tcl component.

The Oracle Linux Test Kit is an open source project available at the oracle open source software site.

Designed to verify linux kernel functionality and stability

it is a testing tool and not meant for performance comparisons. Used in the oracle validated configuration programme andhas a handy silent install of the oracle software.
Requires a lot of disk space. Comes with both OLTP and DSS benchmarks. You’ve really got to have complete use of the box when running this, as it will bring all instances running on a box down when it is finished.

Supports a variety of filesystems, including NFS, OCFS2 and ASM.

Sounds like it could be a useful tool look at if you have a new hardware install project. Don’t think it supports 11gR2 yet though.

Oracle High Availability for an Age of Austerity – Steve Shaw

The whole driver of this talk was to show how to build a Highly-Available database system for as cheaply as possible. Essentially using Oracle VM to provide cold-failover of a single instance database.

Oracle VM has HA built in. This uses OCFS2 which is built into the linux kernel from 2.6.16

Both Oracle VM and Oracle Enterprise Linux are free to download and use including in production.

The OCFS2 part is used to manage the guest OS’s

Oracle VM can allow your single instance db to start upon another host should the host fail. Apparently live migration will be supported in the future.

talking about using commodity hardware and open-e software to fail between storage nodes.

Mirroring Arrays with Open-E

Steve showed an example of hot adding extra cpu capacity to a VM and then a graph of how the database immediately picked up the increased cpu count.

Claiming nehalem around x15 performance compared to 2004 cpu’s and double compared to previous generation. Also claims Oracle VM gives 84% performance of native performance.

Oracle on VMWARE – Carl Bradshaw

Interesting discussion on Co-scheduling and CPU skewing. Can use esxtop to determine whether va’s are suffering skew.

good idea to reserve the size of the SGA but don’t over reserve as upon failure you may not be able to failover that particular VM.

comparing RAC against VMware HA and Fault tolerance also comparing Dataguard to SRM.

not to keen on Fault tolerance, particularly the fact it’s the 1.0 release of it, and that it is limited to 1vCPU.

Showing how flexible dataguard is, though has mentioned that SRM is for the whole datacenter not just the database.

Virtulisation Panel Discussion

An unusual one this for a UKOUG meeting, and I think there was a reasonable discussion, even if the questions took a while to get started. Virtualisation is clearly a hot topic and it has filtered down to even the database level. I think the panel were still stating not to run mission critical high volume of transactions databases onto a virtualised platform, I’m sure VMware may have a different opinion on that one.

I really was questioning exactly how much cheaper virtualisation truly is, as I’m finding migrating to beefier boxes to run VMware on, once you add in all the additional cards (hba, and extra nics) as well as increased software licenses, and of course each VMware needs far more memory than your standard single application box, a VMware host is a *lot* more expensive than our previous hosts.

Sure I understand that there are savings in power and cooling, and of course your datacenter is fantastically more agile, and you have wonderful HA features, but these are bit harder to put down on a budget spreadsheet, bottom line is the servers you are buying are going to cost more money.

I think this can be a good format and I hope the RAC & HA organising committee pursue this format in future events.

Practical Considerations of a Stretched RAC Cluster Implementation – Deepak Singh

This was the presenters first presentation and he did seem a bit nervous at first. Basically they implemented a Stretched RAC cluster for a baggage handler, sighting the nodes in different terminals.

They used 11gR1 and ASM to mirror across the two storage arrays in the different terminals. Utilised the ASM_PREFERRED_READ_FAILURE_GROUPS to ensure reads only occurred to the local storage array. They had dark fibre connection between the sites, which were only 1KM apart. Oh, used Standard Edition to cut the licensing costs.

Raised an interesting question at the end about what happens if the connection between the arrays goes, but the nodes on either side of the cluster can still communicate with each other, and the voting disk in the 3rd location. There was an interesting discussion saying that this situation would crash the cluster, but that you could then decide to start one side of the cluster back up again.

I had to pull the emergency release cord and leave before the final presentation on Oracle Fusion Middleware.

September 1, 2009

11gR2 Slips out the door

Filed under: 11g upgrade — jarneil @ 12:51 pm

11gR2 has been released without too much fanfare on September 1st. Really surprised this was not held back to OpenWorld, and Larry doing the honours.

June 18, 2009

UKOUG RAC & HA SIG June 2009

Filed under: UKOUG — jarneil @ 8:03 pm

It was a funny start to my day, my kids woke up in the middle of night, and at 03:00am my fire alarm went off, which gave me a quite a fright, though thankfully nothing was actually on fire. So I was pretty knackered as I got into London on a beautifully sunny day for the UKOUG RAC & HA meeting. So many familiar faces in the audience – which is great to see.

Preliminary Introduction – Dave Burnham

Before the survey of who is using what, Dave Burnham mentioned that the following meeting in Blythe Valley Park, may include an “expert panel” discussion forum. I think this could prove to be an excellent idea, though of course it’s highly dependent on the quality of questions discussed. Also there is the possibility of doing something on virtualisation – which is something I’m very interested in at the minute.

So onto the survey and as always most people have 2 nodes in a cluster with a handful on 3, 4. The highest in attendance today was 6 nodes. Vast majority running Linux and 64-bit, apart from a somewhat fed-up Martin Bach having the misfortune to be on 32-bit Linux. Not one person running RAC on windows, even though there was a presentation on running RAC on windows. A good quantity running RAC on Solaris SPARC.

The vast majority of users are using 10.2 with a handful having upgraded to 11.1, most people using Fibre Channel connectivity with HP being the most popular storage vendor, though EMC close behind, and netapp, IBM, Sun all on similar ammounts. A forest of hands for people using ASM, this is by far and away the most popular choice. A handful using Veritas, but again majority just using clusterware with no 3rd party clusterware.

As for App servers, Weblogic quite popular along with Oracle App Server. Only a few on tomcat, vast, vast majority using home written apps on their RAC clusters. A good amount using physical standby, one guy running active dataguard. Almost everyone using oem. Only 1 person has used Real Application Testing with their RAC install.

Oracle Support Update – Phil Davies

Thing I took away was the increasing importance of patch bundles, several of which were mentioned, in particular ones for Physical/Logical Standby. Phil had a throwaway point that he thought 10.2.0.5 (terminal 10.2 release) was likely next spring. This seems incredible to me as 10.2.0.4 was released (for x86-64) in March 2008 so that would be a 2 year gap. Seems quite astonishing, it’s not like there are not already enough one of patches to make up a patchset I think there are already 500+ one of patches on top of 10.2.0.4.

Complete Upgrade to 11g – Karen Ambrose

This presentation detailed how Karen got on upgrading a 10.2.0.2 cluster up to 11.1.0.6, including Clusterware, ASM, and the RDBMS. She chose to use DBUA and it all seemed to work fairly well, apart from the ASM instance which she said she had to upgrade manually. They also then went upto 11.1.0.7 with an ASM rolling migration. Strange point that they are using multiple listeners, one for ASM and one for the RDBMS. Joel Goodman pointed out this was unnecessary, and I have to say there did not seem to be a compelling reason for running the 2 listeners. The wanted to go to 11g to utilise the secure files feature. They have separate homes for the ASM and RDBMS instances, and have multiple RDBMS versions.

Whats the Point of Oracle Checkpoints? – Harald Van Breederode

For me, this was the stand out presentation of the day. What in the face of it is a fairly mundane topic was explained in such a clear way, with outstanding demonstrations.

SGA Buffer management via double linked lists. List for buffers in use, and list for buffers that available to be used for i/o – this means full SGA does not have to be scanned to find free buffers. Another double linked list is the ckpt-queue.

A checkpoint is a synchronisation event There are lots of different types of checkpoints:

  • full checkpoint
  • This writes all dirty buffers from all instances.

  • thread checkpoint
  • This does all dirty buffers from one instance in RAC cluster.

  • file checkpoint
  • This writes dirty buffers belonging to one tablespace. Caused by taking a tablespace offline, setting it read only or a begin backup statement.

  • parallel query direct reads
  • Parallel query does direct reads and this causes dirty buffers to be written out in a parallel query checkpoint. Can be odd that a query can cause lots of write activity.

  • object checkpiont
  • An object checkpoint can come about due to drop table and truncate table commands. Caused by the need for Point In Time Recovery.

  • incremental checkpoint
  • Incremental checkpointing writes some of the contents out from the CKPT-queue continually, the idea is to dribble out the writing of the dirty buffers rather than having a checkpoint occur in one big splurge. Ensures the fast_start_mttr_target parameter can be maintained.

    Sizing your redo logfiles is critical for keeping checkpoints under control. Harald was emphasizing that you can’t make your redo logs too big but you can make them too small. Point was made about using archive_lag_target to control how frequently you have a log switch when you have very large log files.

    V$instance_recovery can be very useful in finding out what is driving your checkpointing and what is an optimal size for your redo logs.

    Harald stated that in some future version of Oracle LGWR will be the only mechanism for sending redo to a standby no more ARCH – though if you have a network disconnect and have to ship archived redo, not sure how it will do that with LGWR.

    Achieving High Availability using Open Source Technology – Andrew Hughes

    This talk was how they were using OCFS2 to provide shared storage and connect in multiple nodes, but then use cold failover to provide some form of high availability of instances. It kinda did what they set out to achieve, but as Joel Goodman pointed out, so much more could have been achieved using Oracle Clusterware to automatically restart failed instances and would provide VIPS to avoid the net timeout issue.

    Oracle 11g RAC On Windows – Dave Bennet

    I think Dave had a bit of an uphill struggle here. Not a single person in the room had a RAC database running on windows, and the main thrust of the presentation seemed to be issues encountered in running on 32-bit Windows, and SE edition of windows at that.

    Database Links Masterclass – Joel Goodman

    Attending a Joel Goodman presentation really is like trying to drink from a fire hydrant. Joel’s knowledge is truly immense, both in breadth and depth. This was no exception, though it was a shame he ran out of time with the demo, and I think instead of saving the demo all up for the end, that doing a similar thing to Harald and presenting some theory, then a demo of that theory, rather than attempting to digest all the theory may have been a bit easier on the audience!

    The next RAC & HA SIG will be on 10th September in Blythe Valley Park in the West Midlands. Hope to see you there.

    June 8, 2009

    Fixing up ASM Disk Header Corruption

    Filed under: ASM — jarneil @ 4:02 pm

    I’m sure this sort of stuff would never happen to you, you would be far too smart for that.

    I was involved in a migration project, that was moving data from one set of drives to another, for space reasons just the drives were being replaced the actual chassis was remaining in place. So this involved replacing a handful of drives at a time, migrating data then rinse and repeat until the capacity of the array was increased. So this project did not require Albert Einstein levels of genius just a little bit of forethought and planning.

    Documentation, Documentation, Documentation

    abstraction

    You can see that there are several degrees of abstraction in going from the physical disks all the way up to what Diskgroup ASM is presenting to the actual RDBMS. So in this example the RDBMS is using a Diskgroup called DATA to store the various datafiles that make up the database.

    This system was also using ASMLib as well as EMC Powerpath for device multipathing.

    Now, none of this should have been a problem, in a well documented system the linkage between which physical devices were actually being used in which diskgroups should have been clear. Unfortunately, this system has grown somewhat organically over time accumulating more and more devices.

    I went around checking which devices were in use, just in case there were any devices free: the more free physical devices the better to migrate onto the new larger drives.

    In particular I was checking which particular devices were marked as being used with ASM. In our use of ASMLib the naming convention was each device was stamped with a volume name of the form VOL#. So in theory each device should have been marked liked that, any device not in use by ASM should have been able to be reclaimed.

    Corruption Leading to Confusion

    In performing this check I was the /etc/init.d/oracleasm querydisk command and feeding in a device path:

    
    [jason@bdc]$ sudo /etc/init.d/oracleasm querydisk /dev/emcpowera1
    Disk "/dev/emcpowera1" is marked an ASM disk with the label "VOL1"
    

    So that is all well and good, and then I ran into the following:

    
    [jason@bdc]$ sudo /etc/init.d/oracleasm querydisk /dev/emcpowerm1
    Disk "/dev/emcpowerm1" is marked an ASM disk with the label ""
    

    Huh? Now that did seem odd. I was sure all devices in use had the label VOL#, So I did what a DBA in a hurry to migrate drives might do, and thought this device could not be in use. So I tried to delete it:

    
    [jason@bdc]$ sudo /etc/init.d/oracleasm deletedisk /dev/emcpowerm1
    Removing ASM disk "/dev/emcpowerm1":                       [FAILED]
    

    When In a Hole – Stop Digging

    At this point I should have stopped and really had a think. In fact I should have checked the disk header to see exactly what was going on with device. I did not not. I incorrectly assumed this was a device that had been in use and was in use no longer. I removed it at the storage level.

    After this ASM started up fine and the database even got to the mount stage. Do you think the diskgroup would come online that the datafiles were on? Nope. It was a goner.

    I’d just removed a Volume that the diskgroup containing the RDBMS datafiles were depending on. Not only had I removed it from the server, I’d even gone as far to unbind the LUN at the storage array level. Just to make sure it really was a goner.

    It was looking like a career limiting move. Thankfully, 7 hours later on the telephone to EMC support, the LUN was able to be resurrected. But that was not the end of the story. ASM still could not understand what to do with this device stamped with “”. I now checked the header of the device:

    od

    So this device was actually called VOL7 and part of DATA4 diskgroup, which contained the datafiles for the RDBMS. However now compare this to a device that is labelled correctly:

    od_working

    Seems like a part of the disk header has become corrupted. The following line:

    
    0000040 O R C L D I S K
    

    Should in fact contain the following:

    
    0000040 O R C L D I S K V O L 7
    

    Somehow the VOL7 part of this line has been removed.

    KFED to the Rescue!

    So the database was down, a volume was missing from the diskgroup because the diskheader was corrupted. Not a good place to be, but I was sure the data was still intact, I was sure it was just a matter of fixing up the header and all would be well. I had heard of kfed before this, and I was wondering if this would be the key. I ran it against my corrupt device:

    kfed

    I could see that the line that had the problem was the following:

    
    kfdhdb.driver.provstr: ORCLDISK ; 0x000: length=8
    
    
    
    While running a metalink search for kfed, I came across Note: 787082.1 which, while about a completely separate bug, shows you how to edit the provstr of the disk header:
    
    
    
    [jason@bdc]$ sudo /etc/init.d/oracleasm force-renamedisk /dev/emcpowero1 VOL7
    

    And that was it! The ASM diskgroup could now find all the volumes it needed to bring the diskgroup back online and the database came back fine. I'm pretty sure any reboot of this server would have led to this device being unrecognised by ASM so was really just an accident waiting to happen, but still maintaining good documentation can never be underestimated.

    May 18, 2009

    Improving an Index Range Scan

    Filed under: tuning — jarneil @ 2:34 pm
    Tags:

    I recently had the opportunity to perform a little bit of tuning work, and I thought I would share this here. I think tuning is one of the more satisfying things about being a dba, where you can actually see that you have a made a concrete change for the better.

    The query in question was quite simple of the form:


    SQL> select count(*) from ns where id = :B1 and deleted is null

    The table is of a reasonable size with around 40 Million rows. This is an OLTP environment. Yes there was an index on the id column and indeed the optimizer was choosing this:

    
    
    ----------------------------------------------------------------------------------------------------
    | Id  | Operation		     | Name		   | Rows  | Bytes | Cost (%CPU)| Time	   |
    ----------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT	     |			   |	 1 |	11 |	 2   (0)| 00:00:01 |
    |   1 |  SORT AGGREGATE 	     |			   |	 1 |	11 |		|	   |
    |*  2 |   TABLE ACCESS BY INDEX ROWID| NS                  |	 1 |	11 |	 2   (0)| 00:00:01 |
    |*  3 |    INDEX RANGE SCAN	     | IX_ID_NS            |	 1 |	   |     2   (0)| 00:00:01 |
    ----------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - filter("DELETED" IS NULL)
       3 - access("ID"=TO_NUMBER(:B1))
    
    

    It’s all Skew-Whiff

    So far not all that interesting. However this data is extremely skewed. The id column is not a primary key, and the distribution of id’s looks like this:

    
    
    SQL> select min(id), max(id), median(id), avg(id)
              from ns;
    
       MIN(ID)    MAX(ID) MEDIAN(ID)    AVG(ID)
    ---------- ---------- ---------- ----------
    	 1	10774	       1 4.64528267
    
    

    That does not really do justice to how this data is skewed. So I’ve counted up the number of times each particular id value appears in the table, then done an aggregate of how often each count appears. A little snippet of the results set should explain the data:

    
    
    1   26816448
    2    2064686
    3     219876
    4      34708
    .
    .
    .
    135931     1
    136166     7
    136167     1
    140111     3 
    
    

    So what are these numbers saying? Well the most popular number of times an id value appears is 1, and there are 26M id values that appear just the once. Next is the count of id values that appear twice in the table and so on.

    You can seen from this that there are 3 id values that appear 140,111 times in the table.

    Plotting this with on a logarithmic scale looks like this:

    Frequency Count

    So how is this affecting the query?

    Well as stated the simple range scan is just fine for the 26Million id values that just appear this once, but whenever an id is used that appears far more frequently the index range scan is not a great idea:

    
    
    SQL> select count(*) from ns where id=16318609 and deleted is null;
    
      COUNT(*)
    ----------
        108559
    
    Elapsed: 00:02:30.35
    
    

    2 and a 1/2 minutes for an OLTP system is a bit of a problem. Looking at the extended plan information we can see the issue:

    
    -----------------------------------------------------------------------------------------------------------------------
    | Id  | Operation		     | Name		   | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
    -----------------------------------------------------------------------------------------------------------------------
    |   1 |  SORT AGGREGATE 	     |			   |	  1 |	   1 |	    1 |00:01:36.12 |   64371 |	56298 |
    |*  2 |   TABLE ACCESS BY INDEX ROWID| NS           	   |	  1 |	   1 |	  102K|00:01:35.80 |   64371 |	56298 |
    |*  3 |    INDEX RANGE SCAN	     | IX_ID_NS            |	  1 |	   1 |	  132K|00:00:05.83 |	 458 |	  444 |
    -----------------------------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - filter("DELETED" IS NULL)
       3 - access("ID"=16318609)
    
    

    The above is an extended trace and it shows how many rows the optimizer thinks it will get from each operation and the actual number of rows it really found when the database had to do the work for real. You can see the statistics are are out by 6 orders of magnitude for this particular chosen id value.

    However even if the optimizer exactly knew the correct number of rows to return, the only other plan available is a full table scan, which is not all that great on a 40M row table.

    Improving the response time

    Hopefully I took some of Jonathan Lewis’ advice and considered how best to get the data and with this simple query the best implementation really is to avoid going back to the table at all, so creating a fatter index on both the id and deleted column seemed to be the way to do it:

    
    
    SQL> create index id_removed on ns(id, deleted) parallel 3 nologging;
    
    SQL> select count(*) from ns where id=16318609 and deleted is null;
    
      COUNT(*)
    ----------
        108559
    
    Elapsed: 00:00:00.54
    
    

    That is a pretty pleasing two orders of magnitude improvement in the response time of this query.

    Of course best response time of all would have been not to have to run the query at all, but that’s another story ;-)

    April 30, 2009

    UKOUG 2009 Annual Conference Call for Papers

    Filed under: UKOUG — jarneil @ 7:34 pm

    The call for papers for the 2009 UKOUG Annual Conference Server Technology & E-Business has gone out.

    The conference will take place Between 30th November – 2nd December, and as usual takes place at the ICC Birmingham.

    The deadline for submissions is 5th June, so get your papers in while you can. Question is, will I be up against Tom Kyte for the 3rd year running?

    If you are going, I may well see you there!

    April 28, 2009

    Mistake in April2009 CPU Release Instructions

    Filed under: RAC, patchset — jarneil @ 11:30 am

    I thought as soon as I saw this set of instructions that something was a bit fishy, a bit unusual:

    table2

    The instructions are pretty clear and unambiguous: Apply the April2009 CPU to the CRS Home. I’d never seen this recommended on a clusterware patch before. This is actually really the only place in the release instructions for the CPU where it is mentioned.

    I assumed it was just a case a changing the ORACLE_HOME to point to the CRS_HOME and running OPatch as normal:


    oracle@linuxrac1:DBA1 /var/opt/oracle/product/admin/TAR/8290506> /var/opt/oracle/product/10.2.0/OPatch/opatch napply -skip_subset -skip_duplicate
    Invoking OPatch 10.2.0.4.6

    Oracle Interim Patch Installer version 10.2.0.4.6
    Copyright (c) 2009, Oracle Corporation. All rights reserved.

    UTIL session

    Oracle Home : /var/opt/oracle/product/crs
    Central Inventory : /home/oracle/oraInventory
    from : /etc/oraInst.loc
    OPatch version : 10.2.0.4.6
    OUI version : 10.2.0.4.0
    OUI location : /var/opt/oracle/product/crs/oui
    Log file location : /var/opt/oracle/product/crs/cfgtoollogs/opatch/opatch2009-04-28_09-38-05AM.log

    Patch history file: /var/opt/oracle/product/crs/cfgtoollogs/opatch/opatch_history.txt

    OPatchSession cannot load inventory for the given Oracle Home /var/opt/oracle/product/crs. Possible causes are:
    No read or write permission to ORACLE_HOME/.patch_storage
    Central Inventory is locked by another OUI instance
    No read permission to Central Inventory
    The lock file exists in ORACLE_HOME/.patch_storage
    The Oracle Home does not exist in Central Inventory

    UtilSession failed: Locker::lock() mkdir /var/opt/oracle/product/crs/.patch_storage

    OPatch failed with error code 73

    I then looked at the permissions on the CRS_HOME directory:


    [jason@linuxrac1 ~]$ ls -ltar /var/opt/oracle/product/
    drwxr-xr-x 45 root oinstall 4096 Jun 25 2008 crs

    I even attempted to apply this patch to the CRS_HOME as root:


    root@linuxrac1 8290506]# /opt/oracle/product/10.2.0/OPatch/opatch napply -skip_subset -skip_duplicate
    Invoking OPatch 10.2.0.4.6
    The user is root. OPatch cannot continue if the user is root.

    OPatch failed with error code 255
    [root@linuxrac1 8290506]#

    Lots of head scratching ensued until Oracle support pointed out Support Note 786803.1, where it states:

    If the home has CRS installed, then CPUApr2009 does not need to be applied.

    Gee, thanks Oracle, might be an idea to remove that line from the actual instructions that get downloaded with the CPU!

    As a bootnote, the CPU applied without a hitch into my 10.2.0.4 RAC instance.

    « Previous PageNext Page »

    Blog at WordPress.com.