I’m sure this sort of stuff would never happen to you, you would be far too smart for that.
I was involved in a migration project, that was moving data from one set of drives to another, for space reasons just the drives were being replaced the actual chassis was remaining in place. So this involved replacing a handful of drives at a time, migrating data then rinse and repeat until the capacity of the array was increased. So this project did not require Albert Einstein levels of genius just a little bit of forethought and planning.
Documentation, Documentation, Documentation
You can see that there are several degrees of abstraction in going from the physical disks all the way up to what Diskgroup ASM is presenting to the actual RDBMS. So in this example the RDBMS is using a Diskgroup called DATA to store the various datafiles that make up the database.
This system was also using ASMLib as well as EMC Powerpath for device multipathing.
Now, none of this should have been a problem, in a well documented system the linkage between which physical devices were actually being used in which diskgroups should have been clear. Unfortunately, this system has grown somewhat organically over time accumulating more and more devices.
I went around checking which devices were in use, just in case there were any devices free: the more free physical devices the better to migrate onto the new larger drives.
In particular I was checking which particular devices were marked as being used with ASM. In our use of ASMLib the naming convention was each device was stamped with a volume name of the form VOL#. So in theory each device should have been marked liked that, any device not in use by ASM should have been able to be reclaimed.
Corruption Leading to Confusion
In performing this check I was the /etc/init.d/oracleasm querydisk command and feeding in a device path:
[jason@bdc]$ sudo /etc/init.d/oracleasm querydisk /dev/emcpowera1 Disk "/dev/emcpowera1" is marked an ASM disk with the label "VOL1"
So that is all well and good, and then I ran into the following:
[jason@bdc]$ sudo /etc/init.d/oracleasm querydisk /dev/emcpowerm1 Disk "/dev/emcpowerm1" is marked an ASM disk with the label ""
Huh? Now that did seem odd. I was sure all devices in use had the label VOL#, So I did what a DBA in a hurry to migrate drives might do, and thought this device could not be in use. So I tried to delete it:
[jason@bdc]$ sudo /etc/init.d/oracleasm deletedisk /dev/emcpowerm1 Removing ASM disk "/dev/emcpowerm1": [FAILED]
When In a Hole – Stop Digging
At this point I should have stopped and really had a think. In fact I should have checked the disk header to see exactly what was going on with device. I did not not. I incorrectly assumed this was a device that had been in use and was in use no longer. I removed it at the storage level.
After this ASM started up fine and the database even got to the mount stage. Do you think the diskgroup would come online that the datafiles were on? Nope. It was a goner.
I’d just removed a Volume that the diskgroup containing the RDBMS datafiles were depending on. Not only had I removed it from the server, I’d even gone as far to unbind the LUN at the storage array level. Just to make sure it really was a goner.
It was looking like a career limiting move. Thankfully, 7 hours later on the telephone to EMC support, the LUN was able to be resurrected. But that was not the end of the story. ASM still could not understand what to do with this device stamped with “”. I now checked the header of the device:
So this device was actually called VOL7 and part of DATA4 diskgroup, which contained the datafiles for the RDBMS. However now compare this to a device that is labelled correctly:
Seems like a part of the disk header has become corrupted. The following line:
0000040 O R C L D I S K
Should in fact contain the following:
0000040 O R C L D I S K V O L 7
Somehow the VOL7 part of this line has been removed.
KFED to the Rescue!
So the database was down, a volume was missing from the diskgroup because the diskheader was corrupted. Not a good place to be, but I was sure the data was still intact, I was sure it was just a matter of fixing up the header and all would be well. I had heard of kfed before this, and I was wondering if this would be the key. I ran it against my corrupt device:
I could see that the line that had the problem was the following:
kfdhdb.driver.provstr: ORCLDISK ; 0x000: length=8While running a metalink search for kfed, I came across Note: 787082.1 which, while about a completely separate bug, shows you how to edit the provstr of the disk header:
[jason@bdc]$ sudo /etc/init.d/oracleasm force-renamedisk /dev/emcpowero1 VOL7
And that was it! The ASM diskgroup could now find all the volumes it needed to bring the diskgroup back online and the database came back fine. I'm pretty sure any reboot of this server would have led to this device being unrecognised by ASM so was really just an accident waiting to happen, but still maintaining good documentation can never be underestimated.