Exadata Flash Storage

Exadata flash storage is provided by the Sun Flash Accelerator F20 PCIe card shown above. Four of these cards are installed in every Exadata storage cell. There is a Documentation set available to peruse.

First, we can see these devices using lspci:

[root@cel01 ~]# lsscsi |grep  MARVELL 
[8:0:0:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdn 
[8:0:1:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdo 
[8:0:2:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdp 
[8:0:3:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdq 
[9:0:0:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdr 
[9:0:1:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sds 
[9:0:2:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdt 
[9:0:3:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdu 
[10:0:0:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdv 
[10:0:1:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdw 
[10:0:2:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdx 
[10:0:3:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdy 
[11:0:0:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdz 
[11:0:1:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdaa 
[11:0:2:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdab 
[11:0:3:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdac

You can see they are bunched into 4 groups of 4 8:, 9:, 10:, and 11: This is the fact that the 4 cards each have 4 FMOD, so on every exadata the flash is presented as 16 separate devices.

We can also use the flash_dom command:


[root@cel01 ~]# flash_dom -l

Aura Firmware Update Utility, Version 1.2.7

Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved..

U.S. Government Rights - Commercial Software. Government users are subject 
to the Sun Microsystems, Inc. standard license agreement and 
applicable provisions of the FAR and its supplements.

Use is subject to license terms.

This distribution may include materials developed by third parties.

Sun, Sun Microsystems, the Sun logo, Sun StorageTek and ZFS are trademarks 
or registered trademarks of Sun Microsystems, Inc. or its subsidiaries, 
in the U.S. and other countries.



 HBA# Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC     WWID                 Serial Number

 1.  /proc/mpt/ioc0    LSI Logic SAS1068E C0     105      011b5c00     0       5080020000fe34c0     465769T+1130A405XA

        Current active firmware version is 011b5c00 (1.27.92) 
        Firmware image's version is MPTFW-01.27.92.00-IT 
        x86 BIOS image's version is MPTBIOS-6.26.00.00 (2008.10.14) 
        FCode image's version is MPT SAS FCode Version 1.00.49 (2007.09.21)


          D#  B___T  Type       Vendor   Product          Rev    Operating System Device Name 
          1.  0   0  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sdn    [8:0:0:0] 
          2.  0   1  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sdo    [8:0:1:0] 
          3.  0   2  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sdp    [8:0:2:0] 
          4.  0   3  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sdq    [8:0:3:0]

 2.  /proc/mpt/ioc1    LSI Logic SAS1068E C0     105      011b5c00     0       5080020000fe3440     465769T+1130A405X7

        Current active firmware version is 011b5c00 (1.27.92) 
        Firmware image's version is MPTFW-01.27.92.00-IT 
        x86 BIOS image's version is MPTBIOS-6.26.00.00 (2008.10.14) 
        FCode image's version is MPT SAS FCode Version 1.00.49 (2007.09.21)


          D#  B___T  Type       Vendor   Product          Rev    Operating System Device Name 
          1.  0   0  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sdr    [9:0:0:0] 
          2.  0   1  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sds    [9:0:1:0] 
          3.  0   2  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sdt    [9:0:2:0] 
          4.  0   3  Disk       ATA      MARVELL SD88SA02 D20Y   /dev/sdu    [9:0:3:0]
.
.

The output above has been edited for brevity. You can even have a look at the devices /proc/mpt/ioc1 on the filesystem.

We can also of course look at these devices via cellcli:


CellCLI> list physicaldisk where diskType='FlashDisk' 
         FLASH_1_0       1113M086V3      normal 
         FLASH_1_1       1113M086V4      normal 
         FLASH_1_2       1113M086V0      normal 
         FLASH_1_3       1113M086UY      normal 
         FLASH_2_0       1113M0892K      normal 
         FLASH_2_1       1113M086TR      normal 
         FLASH_2_2       1113M0891P      normal 
         FLASH_2_3       1113M0892L      normal 
         FLASH_4_0       1113M086UP      normal 
         FLASH_4_1       1113M086UQ      normal 
         FLASH_4_2       1113M086UT      normal 
         FLASH_4_3       1113M086UN      normal 
         FLASH_5_0       1113M08AGJ      normal 
         FLASH_5_1       1112M07V6U      normal 
         FLASH_5_2       1113M08AKJ      normal 
         FLASH_5_3       1113M08AH5      normal

Again presented as 4 lots of 4 and disktype of FlashDisk. Looking in on the detail of one of the flashdisks:


CellCLI>  list physicaldisk where diskType='FlashDisk' detail

  name:                   FLASH_5_3 
         diskType:               FlashDisk 
         errCmdTimeoutCount:     0 
         errHardReadCount:       0 
         errHardWriteCount:      0 
         errMediaCount:          0 
         errOtherCount:          0 
         errSeekCount:           0 
         luns:                   5_3 
         makeModel:              "MARVELL SD88SA02" 
         physicalFirmware:       D20Y 
         physicalInsertTime:     2011-12-07T19:00:02+00:00 
         physicalInterface:      sas 
         physicalSerial:         1113M08AH5 
         physicalSize:           22.8880615234375G 
         sectorRemapCount:       0 
         slotNumber:             "PCI Slot: 5; FDOM: 3" 
         status:                 normal

I’ve edited the above for just the detail on the FLASH_5_3 device, basically the last FDOM slot on the highest numbered PCI slot. You can see the size of each of the FDOMs at 22.8880615234375G which multiplied by 16 gives 366.21G.

We can also look at the lun level:

CellCLI> list lun where id='5_3' detail 
         name:                   5_3 
         cellDisk:               FD_15_cel01 
         deviceName:             /dev/sdy 
         diskType:               FlashDisk 
         id:                     5_3 
         isSystemLun:            FALSE 
         lunAutoCreate:          FALSE 
         lunSize:                22.8880615234375G 
         overProvisioning:       100.0 
         physicalDrives:         FLASH_5_3 
         status:                 normal

You can see each lun has a celldisk name associated with it, and a sensible naming convention. Finally drilling down into the celldisk detail:

CellCLI> list celldisk where name='FD_15_cel01' detail 
         name:                   FD_15_cel01 
         comment: 
         creationTime:           2012-01-10T10:13:06+00:00 
         deviceName:             /dev/sdy 
         devicePartition:        /dev/sdy 
         diskType:               FlashDisk 
         errorCount:             0 
         freeSpace:              0 
         id:                     8ddbd2c8-8446-4735-8948-d8aea5744b35 
         interleaving:           none 
         lun:                    5_3 
         size:                   22.875G 
         status:                 normal

The final point of interest on the flash cards is the white part, middle top on the card. That is the Energy Storage Module (ESM), and it has a set lifetime. According the F20 docs on a V2 it’s lifetime was expected at 3 years. You can monitor the health and lifetime of your modules with the following ipmi command:

[root@cel01 ~]# for RISER in RISER1/PCIE1 RISER1/PCIE4 RISER2/PCIE2 RISER2/PCIE5; do ipmitool sunoem cli "show /SYS/MB/$RISER/F20CARD/UPTIME"; done

Connected. Use ^D to exit. 
-> show /SYS/MB/RISER1/PCIE1/F20CARD/UPTIME

 /SYS/MB/RISER1/PCIE1/F20CARD/UPTIME 
    Targets:

    Properties: 
        type = Power Unit 
        ipmi_name = PCIE1/F20/UP 
        class = Threshold Sensor 
        value = 9844.000 Hours 
        upper_nonrecov_threshold = 26220.000 Hours 
        upper_critical_threshold = 25806.000 Hours 
        upper_noncritical_threshold = 25254.000 Hours 
        lower_noncritical_threshold = N/A 
        lower_critical_threshold = N/A 
        lower_nonrecov_threshold = N/A 
        alarm_status = cleared

    Commands: 
        cd 
        show

-> Session closed 
Disconnected

I’ve edited the output above to just one riser card, just to prevent boredom. You are looking to ensure the value , here showing value = 9844.000 Hours is less than the upper_noncritical_threshold, which in this case it is. Otherwise have the ESM replaced if this value is greater than the threshold.

So far I’ve found the flash cards on both V2 and X2-2 to be very reliable, I’d be interested in hearing other thoughts on their reliability.

About these ads
Previous Post
Leave a comment

8 Comments

  1. From our experience, replacements of flash cards have been few and far between, especially when compared to the loss of hard drives. Out of all the systems I’ve worked on, most have lost at least one disk over time, and I could count on one hand the number that needed a flash card replacement. Also, it’s nice that you get a spare flash card in case one goes out.

    Reply
    • jarneil

       /  March 15, 2012

      Hi Andy,

      thanks for confirming! Oldest V2’s I’m managing are at their 2nd birthday. Hope the cards see out their 3rd!

      Reply
  2. These devices are very reliable in current generation. I remember the shaky days though!

    Reply
    • jarneil

       /  March 15, 2012

      Kevin, I’ve heard some interesting stories on fusion I/O from circa 2 years ago and it sounded a bit challenging on the reliability.

      Reply
      • Hi Jason,

        All systems have bugs.

        I’m not a Fusion I/O expert. I have never touched a system with fusion on it (although my friends at Fusion I/O would have it another way if possible :-) ).

        I pointed out that these cards are pretty reliable these days. I shouldn’t think my earlier reply should spawn one of those but-so-and-so-suck-too sort of threads.

        My view on flash is quite simple: applying it as a cache is a fad. And, yes, I am aware that EMC has a product in this space (VFCache). That fact doesn’t curb my viewpoint. The word “fad” is not so pejorative as it may sound. VME was a fad. Are there any systems that still support a VME bus for main bus or even peripheral attach? Nope.

        It is also my view that Exadata architecture is a fad. It took me a few years of toiling with the technology to come to that conclusion, but as my recent posts show I can make a pretty good argument in favor of using plain old Oracle Database (+RAC) for extreme high-bandwidth query processing. Do I say Oracle Database is a fad? No. It remains a very good technology that can scale to exploit high bandwidth storage. The problem with Exadata is the fact that it does not possess as favorable scalability characteristics as RAC *without* Exadata and I’ve made that point very clear on more than one occasion. It’s really quite simple. If you chop off filtration, relegate it to a set of servers on the other side of a miserably slow (compared to a system bus), low bandwidth IB data path separate from joins/agg/sort you have a bottleneck. I don’t like bottlenecks. Never did. Never will. Sure, today’s Exadata offers more in-bound data bandwidth to the RAC grid than you’d get if you attached low-bandwidth conventional storage. That should be obvious. But it is quite simple with today’s technology to attach ample conventional storage (data flow) to totally obliterate host CPUs precessing complex queries. And, in DW/BI, all that really matters is plumbing sufficient data flow to busy up the CPUs you can afford to license (RAC).

        Time will tell but one this is for certain. I wouldn’t be typing these words if Larry Ellison hadn’t squandered the war chest on Sun. The face of Exadata would be ***entirely*** different. It would remain to this day (what it still actually is) a software solution (cellsrv) portable to pretty much any system with a C++ compiler. Best of breed competitors would battle to have the best Exadata implementation. Oracle would still have partners, Oracle customers would have choice ( and less strong-arm sales tactics to suffer through), and Oracle’s quarterly earnings calls would not be so, um, uncomfortable. But most importantly we would likely have never seen an advertisement claiming an Exadata rack is the world’s “First OLTP Machine.”

        And, as they say, is that.

  3. ashminder ubhi

     /  March 16, 2012

    Watch out for this bug:
    Bug 13454147 : FLASH CARDS DISAPPEAR AFTER 6 MONTHS OF UPTIME

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 56 other followers

%d bloggers like this: