ASM Rebalance – I/O Saturation?

I really think that the ability to perform online storage reconfigurations is one of the killer features of ASM. Not only is it possible online, it also is relatively trivial to modify the configuration of a diskgroup with only a simple alter diskgroup command.

It does however beg the question: is it practical to do an online reconfiguration? Features are all very nice in theory, but can you afford the additional I/O that a rebalance will necessitate. Sure you can have a very low asm_power_limit, to try and minimise that I/O and you can even set this to 0 which will ensure that no automatic rebalance occurs, and then you as the DBA can decide when is the best for a rebalance to occur by doing a manual rebalance. Does this become a trade off between a low impact on your “normal” workload versus taking a bit longer to do the actual rebalance?

I see someone within oracle has a sense of humour as the asm_power_limit takes it all the way to 11.

I was doing an (automatic) rebalance and I thought I’d take a look at what kind of I/O load I was doing:

SQL> select * from v$asm_operation;

GROUP_NUMBER OPERA STAT   POWER  ACTUAL	SOFAR    EST_WORK EST_RATE EST_MINUTES
---------    ----- ---- ------- ------- --------   -------   -------    --------
4            REBAL  RUN    1	    1	  19504     83593    428	149

You can see this rebalance is running with an asm_power_limit of just 1, which is the default value.

A sample of iostat output shows the effect this is having on the I/O subsystem:

First thing to be aware is, this is an idle system – it’s just one of my test RAC clusters. So the only workload going on is the asm rebalance. So there are 2 devices involved in this diskgroup that is undergoing the rebalance, sdr, which was the original (only) member of the diskgroup and sdz which is the new member. You can see that sdr is having it’s extents read and some of these are being transferred to the other device within the diskgroup – just what you’d expect.

What maybe you would not expect, even with this less than optimised I/O environment is that we see that the utilisation of the sdr device goes through the roof. This is with power 1, I’d hate to see what (if anything) a higher power limit did, and I’d hate to see what a rebalance would do to a system that was undergoing a real-world workload – particularly one where you were trying to add more disk spindles due to a burdened I/O subsystem. I’m pretty sure a rebalance does not work out how busy the device is and then throttle up or down, the speed should only be determined by the asm_power_limit.

Looks like the best option for doing a rebalance is to find a “less busy” time and perform a manual rebalance, rather than have the automatic rebalance done when the diskgroups are reconfigured, at least you can control when to take the I/O hit.

About these ads
Next Post
Leave a comment

13 Comments

  1. I really have a problem with this concept of using the words “reconfiguration” and “rebalance” as synonyms.

    The first is a simple maintenance operation that is very convenient to do online with ASM. And of course: it should be done in idle periods as it must have an impact on IO bandwidth.

    The second is a COMPLETELY different animal. Rebalance requires a priori that the balance target be known or have been measured. That is not the case with ASM.
    If it is done at an idle time, then it cannot possibly be “rebalancing” anything!
    If it is done at a busy period, then it will have a strong impact on the measure of “balance”, by the very nature of the beast.

    This is why the terminology used by Oracle is confusing, inappropriate and inadequate to describe what is really going on. Although of course: it reads good in marketing materials. Problem is: the marketeers never have to do any real work behind the trenches, do they?

    Reply
  2. Jason, I think this is not fare. In my oppinion, when your IO subsystem is idle, ASM_POWER_LIMIT does not/should not work. The parameter should give lower or higher priority of rebalance to other IO operations. When the system is idle, even with priority 1 you get all the IO capacity, because there is nothing more important to be done. It’s like the resource manager – even if you are in the LOW group and have defined 1% CPU on level 8, when there’s noone else, you get all the CPU.
    Why don’t you test it in a busy test environment, to see what happens then?

    Reply
  3. Luca

     /  October 28, 2008

    Hi Jason,

    I find that ASM rabalancing operations often do not scale linearly with “power”, that is I don’t manage to saturate the IO subsystem in production with a rebalance operation (even if I wanted to).
    Empirically I have noticed that there are serialization events that can pop up, especially in RAC. In terms of wait events in ASM for example “enq AD – allocate/deallocate” and also buffer busy events.
    I can typically see values of 3000-5000 in the EST_RATE column, not much higher than that.
    In 10g I normally use power 5 for “rebalance”, as higher power numbers don’t seem to give much more gain.

    Cheers,
    L.

    Reply
  4. jarneil

     /  October 28, 2008

    Hi Noons,

    Thanks for drpping by!

    I have a real habit of picking the wrong end of the stick in replying to blog comments, and I suspect this will be no exception!

    I’m not sure I’m agreeing with your comments on a rebalance. I may be teaching you to suck eggs here, but the only goal of a rebalance is to ensure all devices within the diskgroup are filled up to the same capacity so doing it at an idle time can be a good time.

    The Oracle marketeers are really responsible for pushing the line that rebalance is somehow good for I/O hotspots which is certainly not the case!

    jason.

    Reply
  5. jarneil

     /  October 28, 2008

    Hi Yavor,

    I think you have made a valid point, that a more interesting test would be to see the effect on a loaded system.

    jaosn.

    Reply
  6. jarneil

     /  October 28, 2008

    Hi Luca,

    I’d heard from a number of people that the higher values of the asm_power_limit seem to not give much extra oomph to the rebalance effort.

    I had been assuming though that increasing the asm_power_limit increases the number of slave ARBx processes – I’d thought it was proportional to the power limit.

    jason.

    Reply
  7. Tom

     /  December 1, 2008

    The one issue I see with this is you only had 2 devices. My production system has over 50 devices and when I rebalance (add 10 more devices) the load is spread out more evenly.

    Reply
  8. Kamal

     /  June 12, 2009

    Hello,
    I really like the tools you used to see the I/O detail on the devices.
    I wonder you tell me what is it .
    Thank you.

    Reply
    • jarneil

       /  June 12, 2009

      Hi Kamal,

      This data is got using the iostat with -kx as arguments.

      jason.

      Reply
  9. radhakrishna

     /  May 29, 2011

    Hi,

    In what condition asm_power_limit we are increasing ?????????????? please let me know..?????????

    best regards.
    Radhakrishna

    Reply
    • jarneil

       /  May 31, 2011

      I’d say the ASM_POWER_LIMIT is a balance between how fast you want any rebuild to happen compared to how much resource you can afford on processing the rebuild compared to normal I/O workload. If you are already at full capacity with normal processing then you don’t want a high ASM_POWER_LIMIT.

      jason.

      Reply
  10. radhakrishna

     /  May 29, 2011

    Hi

    Reply
  1. Log Buffer #121: a Carnival of the Vanities for DBAs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 56 other followers

%d bloggers like this: