Monitoring the Progress of an 11gR2 Standby Database

It is the obvious question when running a Standby Database, how do I know how up-to-date my Standby is? I’ve been working on some Dataguard switchover testing on an 11gR2 system and was investigating this question once more. The first and obvious place to look into is V$DATAGUARD_STATS:


SQL> select * from v$dataguard_stats;

NAME				 VALUE								  UNIT		                     TIME_COMPUTED			    DATUM_TIME
-------------------------------- ----------------------------------------- -------------------------------- ------------------------------
transport lag			 +00 00:00:00					day(2) to second(0) interval	 11/14/2010 20:00:19		11/14/2010 20:00:18
apply lag			     +00 00:00:00					day(2) to second(0) interval	 11/14/2010 20:00:19		11/14/2010 20:00:18
apply finish time		 +00 00:00:00.000				day(2) to second(3) interval	 11/14/2010 20:00:19
estimated startup time		 28							second			                 11/14/2010 20:00:19

Assuming you have no lag either transport or apply, then you ensure that the time_computed and the datum_time columns are up-to-date with what the wall clock time is. If all that is correct then you can be confident that your standby is receiving and applying redo from the primary. However there is something deeply unsatisfying about relying on a value that is only reporting 0.

The recommended check to determine the state of your standby before performing a failover, is to check the V$ARCHIVE_DEST_STATUS view on the primary. For 11gR2 there is a new column on this view called GAP_STATUS. This does not appear before 11gR2 (Not in the 11gR1 docs):


SQL> select status, gap_status 
   from V$archive_dest_status
   where dest_id=3;

STATUS	  GAP_STATUS
--------- ------------------------
VALID	  NO GAP

So that can give you some confidence that you are going to switchover to a destination that is up-to-date, but still no hard numbers on where exactly the standby has recovered up to.

If you check the current_scn from V$DATABASE you’ll find it far behind the primary. The documentation for current_scn here has also changed from earlier versions and gives a nice clue to look at V$RECOVERY_PROGRESS to check the SCN of the last applied redo:


SQL> select * from v$recovery_progress;

02-11-10 09:52:22 Media Recovery						   Log Files			    Files	    870 	 0
02-11-10 09:52:22 Media Recovery						   Active Apply Rate		    KB/sec	    679 	 0
02-11-10 09:52:22 Media Recovery						   Average Apply Rate		    KB/sec	    114 	 0
02-11-10 09:52:22 Media Recovery						   Maximum Apply Rate		    KB/sec	  12327 	 0
02-11-10 09:52:22 Media Recovery						   Redo Applied 		    Megabytes	 139335 	 0
02-11-10 09:52:22 Media Recovery						   Last Applied Redo		    SCN+Time	      0 	 0 16-11-10 21:17:50 SCN: 7179024793 
02-11-10 09:52:22 Media Recovery						   Active Time			    Seconds	 489607 	 0
02-11-10 09:52:22 Media Recovery						   Apply Time per Log		    Seconds	    560 	 0
02-11-10 09:52:22 Media Recovery						   Checkpoint Time per Log	    Seconds	      0 	 0
02-11-10 09:52:22 Media Recovery						   Elapsed Time 		    Seconds	1250728 	 0

The Last Applied Redo for this system which is in real time apply tracks the wall time exceedingly closely and gives greater feedback of redo application on the standby.

With 11.2.0.2 there is the prospect that V$ARCHIVE_DEST on the primary will track which scn’s have been applied on the standby with the applied_scn column which is new in 11.2.0.2, and does not appear in 11.2.0.1. This will give a straightforward mechanism of determining how far your standby is lagging behind your primary.

Increasing the Dataguard Protection Level

I thought I understood testing. Before I run anything in my production environment, I’m utterly strict that I test in a non-production environment first. Does not matter where that change comes from, it is always run into test first. This naturally includes any changes at the database level, rather than just inside a particular schema.

When I have a set of instructions or steps to take the database from one particular environment, or to install a particular feature, I don’t tend to test just 1/2 the steps, but generally, if I have a sequence of steps I tend to test the entire sequence.

Recently, I’ve been working on a project to increase the protection level of a dataguard environment from Maximum Performance to Maximum Availability. This is a 10gR2 environment, so I pull up the 10gR2 dataguard documentation. To me, the steps seem pretty clear. Let me highlight step 1:

Step 1 If you are upgrading the protection mode, perform this step.

Perform this step only if you are upgrading the protection mode (for example, from maximum performance to maximum availability mode). Otherwise, go to Step 3.

Assume this example is upgrading the Data Guard configuration from the maximum performance mode to the maximum availability mode. Shut down the primary database and restart it in mounted mode:

SQL> SHUTDOWN IMMEDIATE;
SQL> STARTUP MOUNT;

It’s clear right? To upgrade the protection level, you have got to shutdown the instance and have it in the mount mode. I would not just run this in production, I’d always want to test these steps in my test infrastructure that was the same environment as my production setup. Question is, would anyone test upgrading the protection level, but just skip this step? Would it really occur to someone, oh, I wonder if I can just skip this first step and keep my instance up and running?

It did not occur to me, but then I read the (excellent) Oracle Data Guard 11g Handbook by Larry Carpenter, et. al. It’s pretty explicit that you don’t need to shutdown your instance!

availability2

The above was run on a 10.2.0.4 instance. I’d already set the log_archive_dest_n to a LGWR SYNC mode. One thing to note, you must explicitly set AFFIRM here as well, it’s not good enough just using LGWR SYNC, as NOAFFIRM is the default and this leads to the protection_level being in continual resynchronization.

Resynchronization occurs when you first increase the protection_mode or when there is a network outage. It means your configuration is effectively at that point running in maximum performance, and while the protection_level is not at maximum availability the potential exists for data loss.

This really does contradict the documentation so this has been a really useful find for me, as just following the documentation would have led to me having to take downtime on my RAC cluster. It has always been the case that you can drop the protection level without incurring downtime. Note to go all the way to MAXIMUM PROTECTION still requires the database to be in the mounted state.

Adventures in Dataguard

Here is a presentation on Dataguard I have given a number of times. In particular I go through several issues we have encountered while running a Physical Standby in production.

Be Careful how you open your Physical Standby

Just experienced a really strange issue when opnening a standby database read-only. The instructions on how to open your standby database read-only are pretty clear and pretty straightforwards:

Cancel Redo Apply:
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

Open the database for read-only access:
SQL> ALTER DATABASE OPEN;

To be clear, and a verbatim quote from the 10.2 documentation:

You do not need to shut down the instance to open it for read-only access.

I did this on an up-to-date 10.2.0.4 physical standby database (on RHEL 4 U3), but when I tried to log into the Standby using plain old sqlplus this is what I saw:

(jason@jason)$ sqlplus system@STANDBY

SQL*Plus: Release 10.2.0.4.0 - Production on Tue Feb 3 07:57:09 2009

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Enter password: 
ERROR:
ORA-06554: package DBMS_STANDARD must be created before using PL/SQL


ERROR:
ORA-06554: package DBMS_STANDARD must be created before using PL/SQL


Error accessing package DBMS_APPLICATION_INFO

This standby was providing the customer with their business continuity and I was opening it to ensure read-only applications kept running while the customer performed a software upgrade to their internal systems. This was not just an issue related to sqlplus. We hooked up the read-only applications to the standby and they kept logging the exact same ORA-06554 error.

So I started thinking this standby was toast. So to metalink and the wonderfully redundant document 5944239.8. It basically says there is a bug, 5944239, and that 10.2.0.4 is known to be affected on all/most platforms. The issue is fixed in 11.2. And that is it. It has absolutely zero ammount of detail. No idea what causes it, no clue as to what to do should you encounter it! Utterly, utterly astonishing.

One thing I can tell you, is what NOT to do. While looking on metalink there were a few documents about that particular ORA error number that were pretty old, dating back to 2002, and 2001. If you are not on those versions, then ignore those notes. They talk about re-running catproc.sql to rebuild the dictionary, which of course should create the DBMS_STANDARD package.

So I basically thought the dictionary of the standby was corrupt/broken. The applications which recieve 10’s of Millions of queries per day were about to go offline for the customers so I was sorely tempted to say the standby was shafted lets just open the standby read-write and run catproc.sql and see if we can keep the apps going. Of course this was kissing business-continuity for this database goodbye, but then I thought the standby was shafted anyway.

Thankfully before going for the nuclear option I decided to shutdown the standby instance and then open it read-only again to see if that had any effect. Of course, this did the trick and the applications connected without issue and kept running.

But of course as the documentation says, you do not need to shutdown your standby before opening it read-only. I know what I’ll be doing in the future with this 10.2.0.4 standby.

The Different Dataguard Modes of Protection for Managers

I was recently asked to provide a description of the various modes of operation possible with a standby configuration. This was needed to explain the setup to a group of non-technical managers.

The original setup of this companies dataguard configuration was Maximum Performance but the company was concerned about the potential data loss associated with running in an asynchronous mode.

Below is what I produced for them, nothing particularly revelatory here, but perhaps it might be useful for you if you are having this debate with your management team.

Introduction

When using an Oracle standby database for Business Continuity purposes there are 3 possible modes of operation for determining how the data is sent from the primary (the database currently being used to support the business queries) database to the standby (failover database to be used upon invocation of business continuity) database.

There are 2 synchronous modes and 1 asynchronous mode.

The various modes of operation of a standby are a real trade-off between data protection and impact on the primary systems.

Modes of Operation

The following table shows the options giving the benefits and drawbacks.

Mode of Operation

SYNC/

ASYNC

Benefits

Drawbacks

Maximum Protection

SYNC

Never any data loss, no data ever accepted on primary
without being on standby.

Slower response times on primary. Network disruption
between primary and secondary causes downtime of primary database.

Maximum Availability

SYNC

Updates must be on standby before accepted on primary,
network disruption means the primary continues running with updates sent to
standby when it is available again.

Slower response times on primary. Data loss possible if
network fails and then site fails.

Maximum Performance

ASYNC

No performance slowdown for applications on primary.

Data loss will happen

This customer is currently running in Maximum Performance mode, and thus the standby solution has minimal impact on the response times of the various applications.

There are 2 factors that would impact the application response if they moved to a higher level of data protection, the network latency and the network bandwidth. During normal operation, the response time of applications that make data changes would be impacted by the round trip time (rtt) to their standby

During periods of high volume activity, the bandwidth can become congested and this could lead to significant delay in response time.

Conclusion

If a higher level of data protection was deemed a business requirement I would advise a significantly larger network bandwidth to the Business Continuity site.

During normal running we typically see a transport lag (of data between the sites) of around 10 seconds and this would be the typical amount of data we would lose in a disaster situation. During high peaks of activity this can increase.

In determining to increase the protection mode of the configuration there is a real trade off between how much you value your data (for this customer it is a reputational issue as much as a financial consideration) and the impact on your normal day-to-day performance.

On Active Dataguard

One of the real advantages of Oracle Open World is that there are swarms of Oracle employees in attendance who have real intimate knowledge of various pieces of the software and how internally various bits of the Oracle software actually works.

In the exhibition hall there was a huge section with lots of different Oracle stands focusing on various features and on one of the days I dropped by the Active Dataguard booth and had a chat to one of the guys there.

So in terms of how you switch on active dataguard, it’s really not that hard. You just need to have opened your standby read only (which you can do without 11g) and then issue the familiar alter database recover managed standby command – the difference is with 11g you can issue this command to apply the redo while the database open read only and servicing application queries, previously the standby had to be at the mount stage.

So that seems like a small change and in fact such a little change on the surface tends to lead to some lousy presentations as there is very little, at first glance, to say about active dataguard. However underneath the covers a lot of work went on to enable this, it was not just a case of allowing the recover command to be run with the database open.

Database changes that are stored in the redo stream are not ordered and when performing managed recovery those changes are applied to the database out of order and this is done for performance reasons, trying to reorder the changes into the correct sequence would not be performant.

This is no good for queries though, as they need to see a read consistent point of view and with active dataguard this is a constantly moving target. So one of the challenges with active dataguard was to ensure consistent reads were still implemented.

So there is the concept of a published Read SCN which queries are consistent up to and this is behind where the apply process has actually applied redo up to.

Work was also required with dropped packages – you can’t have a package becoming unavailable when the query executing at an earlier read scn actually requires the package. This required changes to the redo stream so that package changes are buffered until the read scn is bumped up.

I also asked a couple of Oracle guys if there was any chance of this being backported to 10g, Larry Carpenter just laughed and said that was one of the first questions he asked. Seems like the changes to the redo stream are so significant that it will never see the light of day in 10g.

The Dataguard Mind Map

There is definitely a lot more to encompass with dataguard than with the ASM mind map. I hope the Dataguard Mind Map is of use to someone out there! The pdf linked to there is a lot smaller download than the full jpg available below.

I think dataguard splits itself quite neatly into the following topics:

  • Standby Creation
  • Transport
  • Redo Apply
  • Protection Modes
  • Maintenance
  • Role Transitions
  • The Broker
  • If there is anything you may think is missing in relation to dataguard in this diagram, I’d be interested in hearing what you think.

    Note if you click the diagram below you get a new page with what at first looks an even smaller diagram, but which you can zoom in to make it quite large.

    Testing Failover with Maximum Performance Dataguard

    Well that title was a bit of a mouthful, but what this posting is concerned with, is how you go about testing a failover in a Dataguard environment when you are running in Maximum Performance mode and sending your redo data via LGWR ASYNC. Testing a switchover in this environment is fine and this can be done without incurring any data loss, but how do you go about testing a failover?

    Performing a failover while running in Maximum Performance mode will incur data loss, this loss may be acceptable when faced with a real life disaster but may be unacceptable in a testing scenario. The size of data loss window in this type of environment can be minimised by using LGWR with ASYNC to ship the redo data, and this may provide a useful combination of minimal impact on your production instance with only a small data loss window.

    As I see it, there are really 2 options here on how to test a failover while you are running in Maximum Performance mode and ensure you have no data loss:

  • Failover with no active write connections on primary
  • In this scenario, you announce a service outage, and prior to performing the failover, you ensure all applications accessing the database are switched off. Note you can leave read-only applications up and running while performing the failover.

    You will need a method of resyncing your original primary to failback. Of course this is something you will want to have tested in the event of a real failover anyway.

    This has a bit of an obvious drawback in that how the applications that perform writes behave in a failover scenario are not really being exercised.

    An alternative scenario is:

  • Activate the standby read-write
  • This option involves activating the standby read-write and essentially running with 2 primary databases. Of course in this scenario you do not allow the end users into the standby, but you can run read-write applications against the standby for testing purposes.

    This scenario has the added attraction of the lowest impact on the original primary and indeed you may even get away without any downtime for your end users.

    I think it is fair to say that the first option is the more rigorous testing, while the second option is less intrusive. For sure it is important to test your business continuity strategy, but while running in a maximum performance your options for testing a failover can be limited.

    Monitoring your Standby with V$RECOVERY_PROGRESS

    I have blogged previously about various ways you can monitor the progress of your physical standby, and I have now come across another way of determining how well your standby is doing at applying the redo logs that are coming it’s way.

    The view V$RECOVERY_PROGRESS also gives you an interesting window into the performance of your standby. The official documentation on this is a bit confusing. Basically when you select from this view you get 9 rows of information on each of the various recovery operations you have started. So if you have only ever started managed recovery once since you started the instance you will only see 9 rows returned.

    With a physical standby each time you start a managed recovery there will be this series of 9 rows entered into the view. Each row has a type, but while recovering the standby it is always Media Recovery.

    SQL> select start_time, item, units, sofar
    from V$recovery_progress
    order by 1, 2
    /

    START_TIME	  ITEM				   UNITS             SOFAR    
    ----------------- -------------------------------- ----------   ---------- 
    04-06-08 08:45:54 Active Apply Rate		   KB/sec             3368       
    04-06-08 08:45:54 Active Time			   Seconds            6184       
    04-06-08 08:45:54 Apply Time per Log		   Seconds               1    
    04-06-08 08:45:54 Average Apply Rate		   KB/sec               35    
    04-06-08 08:45:54 Checkpoint Time per Log          Seconds               1    
    04-06-08 08:45:54 Elapsed Time			   Seconds         3553772 
    04-06-08 08:45:54 Last Applied Redo		   SCN+Time     1.8447E+19 
    04-06-08 08:45:54 Log Files			   Files              2146     
    04-06-08 08:45:54 Redo Applied			   Megabytes        123788    
    

    Helpfully, each item has the unit that the associated measurement comes in.

    The average apply rate includes time waiting for the redo to arrive. You can see the active time we have spent applying redo log information is a small proportion of the total elapsed time since we started managed recovery.

    The active apply rate therefore gives a better indication of how fast you can actually apply redo on your standby, if you think you are generating redo at a faster rate than this number, then you may well be falling behind on your standby.

    As indicated in the documentation this V$RECOVERY_PROGRESS view is actually just a subset of the V$SESSION_LONGOPS view, and while all the information is available there too, the V$RECOVERY_PROGRESS view summarises the relevant data for your media recovery progress in a standby quite nicely.

    FAL failing to resolve archive gaps

    I had been having problems with FAL failing to resolve archive gaps. The classic error message on my 10.2.0.3 physical standby looked something like this:

    FAL[client]: Failed to request gap sequence
    GAP – thread 1 sequence 11402-11403
    DBID 2878580510 branch 589635294
    FAL[client]: All defined FAL servers have been attempted.
    ————————————————————-
    Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
    parameter is defined to a value that is sufficiently large
    enough to maintain adequate log switch information to resolve
    archivelog gaps.
    ————————————————————-

    Now, a FAL (Fetch Archive Log) request is triggered by the MRP process detecting a gap in the available archives to apply. The server to contact to fulfil the request is determined by the FAL_SERVER parameter which you set to an appropriate tnsnames alias. This essentially enables your standby to contact the primary database and get it to ship the appropriate missing log using the arc process.

    While the above situation was logging the above on the standby, no indication of any issues were being logged on the primary. First thought of course was connectivity, but that was easily disproved.

    So to get more information I deployed extra logging on both the primary and standby, using the log_archive_trace parameter. This parameter has the ability to trace a number of background processes and is thus useful on both the primary and the standby.

    In fact it was the arcn process on the primary that really showed what was wrong, as the trace continually had the following lines:

    FAL[server]: Duplicate request detected (dbid 0 branch 127 thread 1 sequence 11402 dest STANDBY)
    FAL[server]: Request not queued in this case
    FAL[server]: Duplicate request detected (dbid 0 branch 127 thread 1 sequence 11403 dest STANDBY)
    FAL[server]: Request not queued in this case
    *** 2008-06-03 11:57:18.229
    FAL[subfal]: Waiting on response from destination for thread 1 sequence 10890.

    So, what is this actually telling us? The standby is asking for 11402, and 11403 and the primary knows the standby is after these, however the primary also thinks the standby is after 10890, but the standby having already applied this does not care about receiving this archived log.

    Essentially this log is blocking the FAL process from working properly.

    Turns out this is a bug situation. We were hitting bug:5576816 with an associated MetaLink Note: 5576816.8. Seems like it’s related to having max_connections set on the log_archive_dest parameter (which we did).

    As an attempted workaround I tried setting the log_archive_max_processes down from 6 to 1, with the following shown in the primary alert log:

    Shutting down archive processes
    ALTER SYSTEM SET log_archive_max_processes=1 SCOPE=BOTH;
    ARCH shutting down
    ARC4: Archival stopped
    ARCH shutting down
    ARC3: Archival stopped
    ARCH shutting down
    ARC2: Archival stopped
    ARCH shutting down
    ARC1: Archival stopped
    ARC5: Becoming the ‘no FAL’ ARCH
    ARC5: Becoming the ‘no SRL’ ARCH
    ARC5: Becoming the heartbeat ARCH
    ARCH shutting down
    ARC5: Archival stopped
    ARC0: Becoming the ‘no FAL’ ARCH
    ARC0: Becoming the ‘no SRL’ ARCH
    ARC0: Becoming the heartbeat ARCH

    I then killed (OS kill that is) the final running arc background process – note this is safe according Oracle support, and pmon detects it is dead and restarts the arc process:

    ARCH: Detected ARCH process failure
    ARCH: STARTING ARCH PROCESSES
    ARC0: Archival started
    ARCH: STARTING ARCH PROCESSES COMPLETE
    ARC0 started with pid=276, OS id=24202
    ARC0: Becoming the ‘no FAL’ ARCH
    ARC0: Becoming the ‘no SRL’ ARCH
    ARC0: Becoming the heartbeat ARCH

    This did not work. When the arc0 process restarted it still was thinking it should be sending the 10890 archive log. The only solution was to bounce the instance, thankfully with it being a RAC primary there was no real service outage.

    I won’t be caring about FAL next week, as I’m off to the seaside for a week.

    Follow

    Get every new post delivered to your Inbox.

    Join 58 other followers