Exadata Storage Cells also use hardware RAID? – Yep it’s true

We have already seen that the compute nodes in an Exadata system are using hardware RAID to offer increased availability and serviceability for the disk drives in them. What about the Storage Cells themselves?

At this point you are quite possibly thinking I’ve gone a bit nuts. Everyone knows Exadata uses ASM to offer highly resilient storage with all the benefits that ASM brings to the table, and everyone knows you don’t need hardware RAID to have these benefits.

So surely an Exadata Storage Cell does not use hardware RAID, right?

Storage Cell Hardware

So how can you tell you are working on a Storage Cell, as opposed to the compute node? Well lets check what dmidecode states:

[root@cel01 ~]# dmidecode -s system-product-name 

SUN FIRE X4275 SERVER

This is actually a V2 box, while the X2-2 box is different in a couple of ways:

[root@cel01 ~]# dmidecode -s system-product-name 

SUN FIRE X4270 M2 SERVER       

The X4270 M2 can actually take 24 2.5″ drives or 12 3.5″ drives. Currently only the 12 disk option is available.

The schematic for this server is above, basically it is a 2U box that can take up to 12 drives. In Exadata these storage cells are running linux:


[root@cel01 ~]# uname -r 
2.6.18-194.3.1.0.3.el5

However, they have our old friend the LSI MegaRAID controller installed:



[root@cel01 ~]# lsscsi -v

[0:2:0:0]    disk    LSI      MR9261-8i        2.12  /dev/sda
  dir: /sys/bus/scsi/devices/0:2:0:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:0/0:2:0:0]
[0:2:1:0]    disk    LSI      MR9261-8i        2.12  /dev/sdb
  dir: /sys/bus/scsi/devices/0:2:1:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:1/0:2:1:0]
[0:2:2:0]    disk    LSI      MR9261-8i        2.12  /dev/sdc
  dir: /sys/bus/scsi/devices/0:2:2:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:2/0:2:2:0]
.
.

I’ve abbreviated the output to just the 3 drives, while the full output shows all 12 and the flash cards as well. Ok, so it’s pretty clear there is the LSI MegaRAID MR9261-8i card, just like the compute nodes.

MegaRAID Configuration

Lets take a look at what our old friend is doing in the storage cell:


[root@cel01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -ShowSummary -aALL
                                    
System
        OS Name (IP Address)       : Not Recognized
        OS Version                 : Not Recognized
        Driver Version             : Not Recognized
        CLI Version                : 8.00.23

Hardware
        Controller
                 ProductName       : LSI MegaRAID SAS 9261-8i(Bus 0, Dev 0)
                 SAS Address       : 500605b00250ef70
                 FW Package Version: 12.12.0-0048
                 Status            : Optimal
        BBU
                 BBU Type          : Unknown
                 Status            : Healthy
        Enclosure
                 Product Id        : HYDE12         
                 Type              : SES
                 Status            : OK

                 Product Id        : SGPIO          
                 Type              : SGPIO
                 Status            : OK

        PD
                Connector          : Port 0 - 3<Internal><Encl Pos 0 >: Slot 11
                Vendor Id          : SEAGATE
                Product Id         : ST360057SSUN600G
                State              : Online
                Disk Type          : SAS,Hard Disk Device
                Capacity           : 557.861 GB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 0 >: Slot 10
                Vendor Id          : SEAGATE
                Product Id         : ST360057SSUN600G
                State              : Online
                Disk Type          : SAS,Hard Disk Device
                Capacity           : 557.861 GB
                Power State        : Active

                Connector          : Port 0 - 3<Internal><Encl Pos 0 >: Slot 9
                Vendor Id          : SEAGATE
                Product Id         : ST360057SSUN600G
                State              : Online
                Disk Type          : SAS,Hard Disk Device
                Capacity           : 557.861 GB
                Power State        : Active
.
.
.

Storage

       Virtual Drives
                Virtual drive      : Target Id 0 ,VD name
                Size               : 557.861 GB
                State              : Optimal
                RAID Level         : 0

                Virtual drive      : Target Id 1 ,VD name
                Size               : 557.861 GB
                State              : Optimal
                RAID Level         : 0

                Virtual drive      : Target Id 2 ,VD name
                Size               : 557.861 GB
                State              : Optimal
                RAID Level         : 0
.
.
.

Again, output chopped after 3 drives for brevity. Basically we have 12 Physical Drives mapped to 12 Virtual Drives all with RAID level 0. But each RAID 0 stripe is only across a single drive.

You can even see that the LSI RAID controller has the same 512MB battery backed cache:



[root@cel01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -cfgDsply -aALL                                    
==============================================================================
Adapter: 0
Product Name: LSI MegaRAID SAS 9261-8i
Memory: 512MB
BBU: Present
Serial No: SV03902812
==============================================================================
Number of DISK GROUPS: 12


DISK GROUP: 0
Number of Spans: 1
SPAN: 0
Span Reference: 0x00
Number of PDs: 1
Number of VDs: 1
Number of dedicated Hotspares: 0
Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 557.861 GB
State               : Optimal
Stripe Size         : 1.0 MB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disabled
Encryption Type     : None
Physical Disk Information:
Physical Disk: 0
Enclosure Device ID: 20
Slot Number: 0
Device Id: 19
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 557.861 GB [0x45bb9000 Sectors]
Firmware state: Online, Spun Up
SAS Address(0): 0x5000c50028c59721
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST360057SSUN600G08051047E1P6N9         
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive:  Not Certified
.
.

Output chopped after 1 drive, as it does not get any more interesting. You can see that again the drives are in writeback mode, which means acknowledgements are given upon data being written to cache as opposed to actually physically on disk – again you’ve got to make sure your batteries are good to give yourself some protection on power failure.

Of course RAID-0 will not give any protection to your devices upon the event of hard disk failure but you can still say it’s true that an Exadata Storage Cell is using hardware RAID.

Joel Goodman has written an excellent account of how two of the 12 drives, the system disks, are used to create the various O/S devices.

We can see the differences between a system drive and a non-system drive with the following:

[root@cel01 /]# fdisk -l

Disk /dev/sda: 598.9 GB, 598999040000 bytes 
255 heads, 63 sectors/track, 72824 cylinders 
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System 
/dev/sda1   *           1          15      120456   fd  Linux raid autodetect 
/dev/sda2              16          16        8032+  83  Linux 
/dev/sda3              17       69039   554427247+  83  Linux 
/dev/sda4           69040       72824    30403012+   f  W95 Ext'd (LBA) 
/dev/sda5           69040       70344    10482381   fd  Linux raid autodetect 
/dev/sda6           70345       71649    10482381   fd  Linux raid autodetect 
/dev/sda7           71650       71910     2096451   fd  Linux raid autodetect 
/dev/sda8           71911       72171     2096451   fd  Linux raid autodetect 
/dev/sda9           72172       72432     2096451   fd  Linux raid autodetect 
/dev/sda10          72433       72521      714861   fd  Linux raid autodetect 
/dev/sda11          72522       72824     2433816   fd  Linux raid autodetect

So this is one of the two system drives while a non system drive has the following:


[root@cel01 /]# fdisk -l /dev/sdc

Disk /dev/sdc: 598.9 GB, 598999040000 bytes 
255 heads, 63 sectors/track, 72824 cylinders 
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn't contain a valid partition table

So from all these partitons on the system drives we then use mdadm to create software RAID devices by combining partitions from each system drive:

[root@cel01 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md6              9.9G  4.4G  5.0G  47% /
tmpfs                  12G     0   12G   0% /dev/shm
/dev/md8              2.0G  618M  1.3G  33% /opt/oracle
/dev/md4              116M   52M   59M  47% /boot
/dev/md11             2.3G   88M  2.1G   4% /var/log/oracle

And we can see that these /dev/md devices are made up from the /dev/sd[a-b] devices:

[root@cel01 ~]# mdadm -Q -D /dev/md6
/dev/md6:
        Version : 0.90
  Creation Time : Fri Dec 31 14:08:30 2010
     Raid Level : raid1
     Array Size : 10482304 (10.00 GiB 10.73 GB)
  Used Dev Size : 10482304 (10.00 GiB 10.73 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 6
    Persistence : Superblock is persistent

    Update Time : Fri Nov 11 16:42:07 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 87891e9e:e9bb6307:1e49e958:271166fe
         Events : 0.4

    Number   Major   Minor   RaidDevice State
       0       8        6        0      active sync   /dev/sda6
       1       8       22        1      active sync   /dev/sdb6

So while the Exadata storage server does indeed have a hardware RAID capability the O/S on the storage cell is given higher availability by utilising mdadm software RAID. This allows the unused space on the system drives to still be used in the ASM diskgroups.

Advertisements

7 thoughts on “Exadata Storage Cells also use hardware RAID? – Yep it’s true

  1. Weird – layer upon layer for no apparently good reason. I assume Oracle have taken this decision so that all servers are configured in roughly the same way. This has the advantage of a consistent set of admin tools (but then again presumably the customer is not supposed to need them), but makes me wonder whether these system are really engineered to extract the last drop of performance out of the hardware. Me – I’d just prefer to keep things simple instead.

  2. Hi Simon,

    Yeah, on the RAID card as you say there is some sense of standardisation.

    What I really scratch my head at is two different types of software RAID being deployed – that most definitely is not helping the customer with a consistent set of tools!

    jason.

  3. Ah – I hadn’t quite grasped that. I suppose we are making the assumption here that one team with a clear vision is designing Exadata machines… perhaps that’s not actually the case – perhaps the compute node people are completely separate from those designing the storage cells (and in different continents etc)!

  4. I believe (though I’m not positive) that the devices are configured as single drive RAID-0 virtual devices is so that the OS will see them as disks. Also, what other type of software RAID is being used on Exadata? The only type of software RAID being used on Exadata is the RAID-1 utilized on the storage servers.

  5. Jarneil, as you know all RAID levels require “multiple disks”. If you map 1 physical disk to 1 virtual disk (diskgroup), then there’s no RAID. Even software RAID-0 requires two partitions (to be able to stripe data to different locations). According to the document of LSI megaRaid SAS controller, you must use “virtual disks” to access physical disks. It explains why you see these 12 virtual disks. It’s also possible that these virtual disks are created implicitly (by the contoller card) to provide access to physical disks.

    In my opion, exadata storage cells have “raid-capable” SAS controllers but they do not use hardware raid technology.

    By the way, using same SAS controller card for both storage cells and compute nodes is a strategic decision and sounds very reasonable to me.

    • Hi Gokhan,

      Thanks for reading and responding.

      I agree very much with what you are saying. Indeed I can see the sense in having same hardware RAID controller card in both compute and storage cells. I just found it amusing how much Oracle play up ASM as the best way of protecting the data, when actually the machine they are doing it on is capable of hardware RAID.

      I accept the article title is a little tongue in cheek.

      jason.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s