jarneil

October 27, 2008

ASM Rebalance – I/O Saturation?

Filed under: ASM — jarneil @ 3:56 pm

I really think that the ability to perform online storage reconfigurations is one of the killer features of ASM. Not only is it possible online, it also is relatively trivial to modify the configuration of a diskgroup with only a simple alter diskgroup command.

It does however beg the question: is it practical to do an online reconfiguration? Features are all very nice in theory, but can you afford the additional I/O that a rebalance will necessitate. Sure you can have a very low asm_power_limit, to try and minimise that I/O and you can even set this to 0 which will ensure that no automatic rebalance occurs, and then you as the DBA can decide when is the best for a rebalance to occur by doing a manual rebalance. Does this become a trade off between a low impact on your “normal” workload versus taking a bit longer to do the actual rebalance?

I see someone within oracle has a sense of humour as the asm_power_limit takes it all the way to 11.

I was doing an (automatic) rebalance and I thought I’d take a look at what kind of I/O load I was doing:

SQL> select * from v$asm_operation;

GROUP_NUMBER OPERA STAT   POWER  ACTUAL	SOFAR    EST_WORK EST_RATE EST_MINUTES
---------    ----- ---- ------- ------- --------   -------   -------    --------
4            REBAL  RUN    1	    1	  19504     83593    428	149

You can see this rebalance is running with an asm_power_limit of just 1, which is the default value.

A sample of iostat output shows the effect this is having on the I/O subsystem:

First thing to be aware is, this is an idle system – it’s just one of my test RAC clusters. So the only workload going on is the asm rebalance. So there are 2 devices involved in this diskgroup that is undergoing the rebalance, sdr, which was the original (only) member of the diskgroup and sdz which is the new member. You can see that sdr is having it’s extents read and some of these are being transferred to the other device within the diskgroup – just what you’d expect.

What maybe you would not expect, even with this less than optimised I/O environment is that we see that the utilisation of the sdr device goes through the roof. This is with power 1, I’d hate to see what (if anything) a higher power limit did, and I’d hate to see what a rebalance would do to a system that was undergoing a real-world workload – particularly one where you were trying to add more disk spindles due to a burdened I/O subsystem. I’m pretty sure a rebalance does not work out how busy the device is and then throttle up or down, the speed should only be determined by the asm_power_limit.

Looks like the best option for doing a rebalance is to find a “less busy” time and perform a manual rebalance, rather than have the automatic rebalance done when the diskgroups are reconfigured, at least you can control when to take the I/O hit.

10 Comments »

  1. I really have a problem with this concept of using the words “reconfiguration” and “rebalance” as synonyms.

    The first is a simple maintenance operation that is very convenient to do online with ASM. And of course: it should be done in idle periods as it must have an impact on IO bandwidth.

    The second is a COMPLETELY different animal. Rebalance requires a priori that the balance target be known or have been measured. That is not the case with ASM.
    If it is done at an idle time, then it cannot possibly be “rebalancing” anything!
    If it is done at a busy period, then it will have a strong impact on the measure of “balance”, by the very nature of the beast.

    This is why the terminology used by Oracle is confusing, inappropriate and inadequate to describe what is really going on. Although of course: it reads good in marketing materials. Problem is: the marketeers never have to do any real work behind the trenches, do they?

    Comment by Noons — October 28, 2008 @ 12:36 am | Reply

  2. Jason, I think this is not fare. In my oppinion, when your IO subsystem is idle, ASM_POWER_LIMIT does not/should not work. The parameter should give lower or higher priority of rebalance to other IO operations. When the system is idle, even with priority 1 you get all the IO capacity, because there is nothing more important to be done. It’s like the resource manager – even if you are in the LOW group and have defined 1% CPU on level 8, when there’s noone else, you get all the CPU.
    Why don’t you test it in a busy test environment, to see what happens then?

    Comment by yavor — October 28, 2008 @ 8:17 am | Reply

  3. Hi Jason,

    I find that ASM rabalancing operations often do not scale linearly with “power”, that is I don’t manage to saturate the IO subsystem in production with a rebalance operation (even if I wanted to).
    Empirically I have noticed that there are serialization events that can pop up, especially in RAC. In terms of wait events in ASM for example “enq AD – allocate/deallocate” and also buffer busy events.
    I can typically see values of 3000-5000 in the EST_RATE column, not much higher than that.
    In 10g I normally use power 5 for “rebalance”, as higher power numbers don’t seem to give much more gain.

    Cheers,
    L.

    Comment by Luca — October 28, 2008 @ 8:42 am | Reply

  4. Hi Noons,

    Thanks for drpping by!

    I have a real habit of picking the wrong end of the stick in replying to blog comments, and I suspect this will be no exception!

    I’m not sure I’m agreeing with your comments on a rebalance. I may be teaching you to suck eggs here, but the only goal of a rebalance is to ensure all devices within the diskgroup are filled up to the same capacity so doing it at an idle time can be a good time.

    The Oracle marketeers are really responsible for pushing the line that rebalance is somehow good for I/O hotspots which is certainly not the case!

    jason.

    Comment by jarneil — October 28, 2008 @ 4:27 pm | Reply

  5. Hi Yavor,

    I think you have made a valid point, that a more interesting test would be to see the effect on a loaded system.

    jaosn.

    Comment by jarneil — October 28, 2008 @ 9:21 pm | Reply

  6. Hi Luca,

    I’d heard from a number of people that the higher values of the asm_power_limit seem to not give much extra oomph to the rebalance effort.

    I had been assuming though that increasing the asm_power_limit increases the number of slave ARBx processes – I’d thought it was proportional to the power limit.

    jason.

    Comment by jarneil — October 28, 2008 @ 9:26 pm | Reply

  7. [...] of time it takes Oracle to apply logs. Jason Arneil has some very noteworthy work showing how performing an online rebalance of ASM can affect your system with I/O saturation. Richard Foote has some very in-depth details for those who want to know what is going on with [...]

    Pingback by Log Buffer #121: a Carnival of the Vanities for DBAs — October 31, 2008 @ 4:02 pm | Reply

  8. The one issue I see with this is you only had 2 devices. My production system has over 50 devices and when I rebalance (add 10 more devices) the load is spread out more evenly.

    Comment by Tom — December 1, 2008 @ 2:35 pm | Reply

  9. Hello,
    I really like the tools you used to see the I/O detail on the devices.
    I wonder you tell me what is it .
    Thank you.

    Comment by Kamal — June 12, 2009 @ 2:44 pm | Reply

    • Hi Kamal,

      This data is got using the iostat with -kx as arguments.

      jason.

      Comment by jarneil — June 12, 2009 @ 2:54 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.