FAL failing to resolve archive gaps

I had been having problems with FAL failing to resolve archive gaps. The classic error message on my 10.2.0.3 physical standby looked something like this:

FAL[client]: Failed to request gap sequence
GAP – thread 1 sequence 11402-11403
DBID 2878580510 branch 589635294
FAL[client]: All defined FAL servers have been attempted.
————————————————————-
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that is sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.
————————————————————-

Now, a FAL (Fetch Archive Log) request is triggered by the MRP process detecting a gap in the available archives to apply. The server to contact to fulfil the request is determined by the FAL_SERVER parameter which you set to an appropriate tnsnames alias. This essentially enables your standby to contact the primary database and get it to ship the appropriate missing log using the arc process.

While the above situation was logging the above on the standby, no indication of any issues were being logged on the primary. First thought of course was connectivity, but that was easily disproved.

So to get more information I deployed extra logging on both the primary and standby, using the log_archive_trace parameter. This parameter has the ability to trace a number of background processes and is thus useful on both the primary and the standby.

In fact it was the arcn process on the primary that really showed what was wrong, as the trace continually had the following lines:

FAL[server]: Duplicate request detected (dbid 0 branch 127 thread 1 sequence 11402 dest STANDBY)
FAL[server]: Request not queued in this case
FAL[server]: Duplicate request detected (dbid 0 branch 127 thread 1 sequence 11403 dest STANDBY)
FAL[server]: Request not queued in this case
*** 2008-06-03 11:57:18.229
FAL[subfal]: Waiting on response from destination for thread 1 sequence 10890.

So, what is this actually telling us? The standby is asking for 11402, and 11403 and the primary knows the standby is after these, however the primary also thinks the standby is after 10890, but the standby having already applied this does not care about receiving this archived log.

Essentially this log is blocking the FAL process from working properly.

Turns out this is a bug situation. We were hitting bug:5576816 with an associated MetaLink Note: 5576816.8. Seems like it’s related to having max_connections set on the log_archive_dest parameter (which we did).

As an attempted workaround I tried setting the log_archive_max_processes down from 6 to 1, with the following shown in the primary alert log:

Shutting down archive processes
ALTER SYSTEM SET log_archive_max_processes=1 SCOPE=BOTH;
ARCH shutting down
ARC4: Archival stopped
ARCH shutting down
ARC3: Archival stopped
ARCH shutting down
ARC2: Archival stopped
ARCH shutting down
ARC1: Archival stopped
ARC5: Becoming the ‘no FAL’ ARCH
ARC5: Becoming the ‘no SRL’ ARCH
ARC5: Becoming the heartbeat ARCH
ARCH shutting down
ARC5: Archival stopped
ARC0: Becoming the ‘no FAL’ ARCH
ARC0: Becoming the ‘no SRL’ ARCH
ARC0: Becoming the heartbeat ARCH

I then killed (OS kill that is) the final running arc background process – note this is safe according Oracle support, and pmon detects it is dead and restarts the arc process:

ARCH: Detected ARCH process failure
ARCH: STARTING ARCH PROCESSES
ARC0: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
ARC0 started with pid=276, OS id=24202
ARC0: Becoming the ‘no FAL’ ARCH
ARC0: Becoming the ‘no SRL’ ARCH
ARC0: Becoming the heartbeat ARCH

This did not work. When the arc0 process restarted it still was thinking it should be sending the 10890 archive log. The only solution was to bounce the instance, thankfully with it being a RAC primary there was no real service outage.

I won’t be caring about FAL next week, as I’m off to the seaside for a week.

16 thoughts on “FAL failing to resolve archive gaps

  1. Thanks for your analysis. I have 10204, we have not set the max connections and still I am getting the duplicate request message in the trace file. Time for me to log a SR

    Thanks
    Mak

  2. You would not believe what worked for me. Found the suggestion in another forum. You kill the arc_ processes on your primary. They will respawn and when they do gap resolution will work. Amazing. I’ve been trying to fix this for 2 weeks and I happened to stumble across the solution.

    • Besides killing the arc processes, I have been fixing this problem with
      On Primary in SQLPLUS:
      SQL> CREATE OR REPLACE DIRECTORY asm_files
      AS ‘+NFL_DATA_01/csesbp/archivelog/2013_05_03/’ ;
      SQL> CREATE OR REPLACE DIRECTORY DSK_FILES AS ‘/u01/alan’;
      SQL> exec DBMS_FILE_TRANSFER.COPY_FILE ( ‘asm_files’ , ‘thread_2_seq_52097.269.814250165’ , ‘dsk_files’ , ‘thread_2_seq_52097.269.814250165.arc’ );
      On primary OS:
      /u01/alan> gzip *.arc
      On Standby OS:
      /u01/alan>scp 10.196.50.20:/u01/alan/*.gz .
      /u01/alan>gunzip *.arc
      On Standby RMAN:
      RMAN> catalog archivelog ‘/u01/alan/’thread_2_seq_52097.269.814250165.arc’;

      And the alertlog shows that the replication immediately starts up and continues.

  3. Yes, I followed the suggestion by Tim Willams, and I tried. Amazingly, it works.

    Here is what I did on primary side:
    $ ps -ef|grep arc|grep
    $ kill -9

    Then you will see, oracle automatically spawn all those archive processes, and go to your standby side, you should see logs are being received and media recovered.

    Thanks, Tim.

  4. You saved my day! Our standby was out of sync and wasn’t able to resync via FAL. Restarting the primary was impossible. I orakilled ARC0 and the standby caught up within minutes.

    Thank you!
    Olaf

  5. Guys

    I can also recommend that killing the ARC process is the best bet, i too tried it because my Standby was not catching up and Primary was trying to send the Archivelog which Standby had already applied.

    So, as soon as you kill the ARC, Oracle spawned a new ARC process and the FAL automatically fetched all the pending archivelogs.

    • Thanks for your help Jarneil, for windows you can get the sid, serial using following select stmt.
      select * from v$session where type = ‘BACKGROUND’ and program like ‘%ARC0%’;

      confirm and kill the process.

      Earlier I used to FTP the file and give following stmt.
      alter database register or replace logfile ”;

      Thanks
      -Vijay Parmar

  6. whew! thank you so much jarneil. this also saved my day, killing the ARC process on the Primary. we’re 500+ logs behind and been babysitting the Standby DB, manually shipping some missing logs that the FAL cannot request from Prod.

    killing the ARC process rocks.. we have only 2 archiver process, and the ora_arc1, i assumed to be the DEST_2 pair, so i just left the ora_arc0, fearing it might cause unnecessary effect to the Primary.

    thanks again.

  7. Excellent thanks! Saved me a few anxious hours. The following Unix command (for SID=abcprd) worked for me

    ps -ef|grep arc|grep casdrp|grep -v grep|awk ‘{print $2}’|xargs kill -9

    A few minutes later the standby alert.log showed:
    Thu Oct 20 12:54:46 2011
    Fetching gap sequence in thread 1, gap sequence 120862-120882

  8. Wonderful! This works. My standby too was not in sync with primary for the last few days. Thankfully, came accross this thread and killed the primary arc processes. They respawned and the standby started to fetch and apply the gap sequence.

  9. Amazing! It really works. Thank you Jarneil so much. Really don’t know what to do before. And now everything ok. Actually have some fear to kill arc_N processes (I have 2) on the primary db, because it’s a billing system. But take a decision and now happy and ready for my weekend.
    Thanks.

Leave a comment