Exadata Smart Flash Logging

With the 11.2.2.4.0 release of the Exadata storage server software (and providing you are at least at 11.2.0.2 BP11), you will have the opportunity to utilise Exadata Smart Flash Logging. I thought I’d take a look at how much (if any) improvement this feature would provide to a busy production environment.

Have a look at this blog entry on Exadata Smart Flash Loggin by Luis Moreno Campos for an introduction on how it works. Basically you now issue two writes, 1 to flash 1 to your disk based redo logs, fastest write is the winner.

First lets check that we actually have some Exadata Smart Flash Log’s available to be used:

CellCLI> list flashlog detail

name                      cel07_FLASHLOG 
cellDisk FD_00_cel07,FD_08_cel07,FD_09_cel07,FD_01_cel07,FD_15_cel07,FD_06_cel07,FD_03_cel07,FD_04_cel07,FD_07_cel07,FD_02_cel07,FD_12_cel07,FD_14_cel07,FD_13_cel07,FD_11_cel07,FD_05_cel07,FD_10_cel07
 creationTime              2012-03-17T15
 degradedCelldisks 
 effectiveSize             512M 
 efficiency                100.0 
 id                        a24a25e5-062e-4be1-bb6b-3168113a5fe8 
 size                      512M 
 status                    normal

First we can see that on this cell, there is a flashlog created of size 512M. It is carved out of each the 16 flash doms in the cell. Consequently this reduces the amount you have for your flashcache, though it’s a very small reduction.

How much use are we getting out of the flashlog? Well, we can look at some metrics:

CellCLI> list metriccurrent where objectType='FLASHLOG'
 FL_ACTUAL_OUTLIERS                 FLASHLOG        0 IO requests 
 FL_BY_KEEP                         FLASHLOG        0 
 FL_DISK_FIRST                      FLASHLOG        6,168,815 IO requests 
 FL_DISK_IO_ERRS                    FLASHLOG        0 IO requests 
 FL_EFFICIENCY_PERCENTAGE           FLASHLOG        100 % 
 FL_EFFICIENCY_PERCENTAGE_HOUR      FLASHLOG        100 % 
 FL_FLASH_FIRST                     FLASHLOG        172,344 IO requests 
 FL_FLASH_IO_ERRS                   FLASHLOG        0 IO requests 
 FL_FLASH_ONLY_OUTLIERS             FLASHLOG        0 IO requests 
 FL_IO_DB_BY_W                      FLASHLOG        286,075 MB 
 FL_IO_DB_BY_W_SEC                  FLASHLOG        13.328 MB/sec 
 FL_IO_FL_BY_W                      FLASHLOG        303,793 MB 
 FL_IO_FL_BY_W_SEC                  FLASHLOG        13.761 MB/sec 
 FL_IO_W                            FLASHLOG        6,341,159 IO requests 
 FL_IO_W_SKIP_BUSY                  FLASHLOG        0 IO requests 
 FL_IO_W_SKIP_BUSY_MIN              FLASHLOG        0.0 IO/sec 
 FL_IO_W_SKIP_LARGE                 FLASHLOG        0 IO requests 
 FL_PREVENTED_OUTLIERS              FLASHLOG        415 IO requests

First off, this is taken on a very busy system:

FL_IO_FL_BY_W_SEC: 13.761 MB/sec

That is saying how much data is being written to flash by smart flash log. Well, that sounds great, but it’s not quite so simple. Remember writes go to both flash and disk.

FL_DISK_FIRST: 6,168,815 IO requests

This metric is actually telling that 6.1M I/O requests were serviced first by disk. While:

FL_FLASH_FIRST: 172,344 IO requests

Is saying this number went to flash first. Oh, that’s not a great improvement! I make that 2.7% writes went to flash first.

Finally a word on the FL_PREVENTED_OUTLIERS, this is saying there was 415 writes on this cell that would have taken more than 0.5 secs if there was no flash logging in place.

I have also checked AWR reports on before and after having flash logging in place. There is very little change. AWR shows an average wait of 3ms for Log File Parallel Write. Have a look at the wait event histogram for this:

We see the vast majority of writes are under a 1ms. This was also the case before the flash logging as well. It has not improved this at all.

This is a busy cpu bound system lets look at the Log file sync wait event histogram:

Eurgh! is the only way to describe this.

I think Kevin Closson has covered this a mere half-decade ago!

6 thoughts on “Exadata Smart Flash Logging

  1. Have you tried this on an IO bound system instead of a CPU bound system? On a CPU bound system have you tried to increase the LGWR priority to real time to see the benefit?

    I think this blog post does not do the feature enough justice because it says that the bottleneck is elsewhere and so a feature doesn’t help. The real issue is that not every system is running at 100% CPU and if it is, they should atleast run LGWR with real time.

    As far as the small percentage of flash first that you are seeing, that is really where the maximum benefit is. The worst case glitches are the ones that produce the most log file sync. So preventing those is where the benefit lies.

    I would encourage you to try either real time on your LGWR for your CPU bound system or atleast try something a lot more realistic where your are not 100% CPU bound but run a smart scan on an OLTP system or something like that while observing log file parallel write and log file sync latencies in your ASH reports.

    • Hello,

      I’ve heard from other exadata users their surprise also at the lack of flash hits.

      I get your point about increasing the lgwr CPU priority if you have log file sync waits and not much spare CPU.

      Thanks for reading.

      Jason

  2. Hello,

    I just want to let you know that I developed a script to extract exadata real-time metric information based on cumulative metrics.

    The main idea is that cumulative, instantaneous, rates and transition exadata metrics are not enough to answer all the basic questions.

    That’s why the script has been created as it provides a better understanding of what’s is going on on the cells right now.

    More details (on how and why) here : http://bdrouvot.wordpress.com/2012/11/27/exadata-real-time-metrics-extracted-from-cumulative-metrics/

    Please don’t hesitate to give your opinion and report any issue you may found with it.

    Thx
    Bertrand

Leave a comment