We have been running an Advanced Queuing solution for a while now, and with a frustrating regularity we were seeing messages building up in our queue table and the dequeue process seeming to think that there was no work for it to do. These were messages that had a time delay set before being ready for dequeuing. Messages queued without a delay were processed at all times.
This was a RAC system running 10.2.0.4
Perhaps the most frustrating thing about this problem was that Oracle Support effectively refused to look into the problem because we had the parameter aq_tm_processes set. Aparently the recomendation in 10g is to have this parameter unset. Of course if this parameter is currently set you can’t actually unset it without a bounce of the database, which is not something you can automatically perform.
While awaiting the chance to bounce the database and alter this parameter we hit on a workaround to get the queue processing again. It basically involved setting the aq_tm_processes value to 0, seeing that all the queue process monitors shutdown and then set the aq_tm_process parameter back to the original value:
SQL> alter system set aq_tm_processes=0; ps -ef|grep qmn ps -ef|grep q00
There should be no background processes found. Once these processes have died it is a simple case of starting them back up again:
SQL> alter system set aq_tm_processes=1;
So fairly straightforward workaround but this was happening with a tedious regularity.
Towards a Permanent Fix
Eventually we took some downtime and unset this parameter with a database bounce. We could then perform some debugging with Oracle Support, as it did not take long for the issue to reappear. Essentially we needed to trace the background processes when the dequeueing process had frozen. This is fairly easy with oradebug:
First find the pid to trace:
[jason@test]$ ps -ef|grep qmn oracle 6669 1 0 Jan22 ? 00:00:00 ora_qmnc_TEST SQL> select pid from v$process where spid=6669; PID ---------- 14 SQL> oradebug setorapid 14 Unix process pid: 6669, image: email@example.com (QMNC) SQL> oradebug unlimit Statement processed. SQL> oradebug Event 10046 trace name context forever, level 12 Statement processed.
Once you have gathered enough debugging information you can turn off the oradebug with the following:
SQL> oradebug Event 10046 trace name context off Statement processed.
In the trace file of one of the queue monitor processes we could see the following:
*** SERVICE NAME:\(SYS$BACKGROUND) 2009-01-14 14:58:37.219 *** SESSION ID:\(1120.5) 2009-01-14 14:58:37.219 Received ORADEBUG command 'unlimit' from process Unix process pid: 26034, image: Received ORADEBUG command 'Event 10046 trace name context forever, level 12' from process Unix process pid: 26034, image: WAIT #0: nam='Streams AQ: waiting for time management or cleanup tasks' ela= 4467314 p1=0 p2=0 p3=0 obj#=-1 tim=1203071408002413 *** 2009-01-14 15:03:31.795 WAIT #0: nam='Streams AQ: waiting for time management or cleanup tasks' ela= 283204298 p1=0 p2=0 p3=0 obj#=-1 tim=1203071691206827 *** 2009-01-14 15:08:31.801 WAIT #0: nam='Streams AQ: waiting for time management or cleanup tasks' ela= 104963843 p1=0 p2=0 p3=0 obj#=-1 tim=1203071984181007 *** 2009-01-14 15:13:31.806 WAIT #0: nam='Streams AQ: waiting for time management or cleanup tasks' ela= 292974027 p1=0 p2=0 p3=0 obj#=-1 tim=1203072277155198 *** 2009-01-14 15:18:31.812 WAIT #0: nam='Streams AQ: waiting for time management or cleanup tasks' ela= 292973898 p1=0 p2=0 p3=0 obj#=-1 tim=1203072570129257 *** 2009-01-14 15:23:31.817 WAIT #0: nam='Streams AQ: waiting for time management or cleanup tasks' ela= 292973852 p1=0 p2=0 p3=0 obj#=-1 tim=1203072863103293 *** 2009-01-14 15:28:31.823
Now be aware this issue had dragged on for quite a few months by now. As soon as Oracle saw the trace file the support engineer managed to diagnose immediately what the issue was – a bug 7150994 . Unbelievably, to be tripped up by this bug you have to have upgraded to 10.2.0.4 AND be running RAC. This bug appears to affect 10.2.0.4 only. Patch 6600051 is available to fix this problem.
What I think was the most disappointing thing off all is that our symptoms, version, and the fact we were running RAC matched so closely to this bug, that Oracle Support should have been able to point this bug out to us without us suffering a few months of delays before alighting on the solution.
I fully suspect that it was a change in support personnel rather than us providing additional information that enabled the solution to be found