One day I opened a standby database to run some heavy queries. Because we don’t have the license for Active DataGuard, I had to stop the redo apply before I opened the standby database.
After a few hours I finished my queries and I put the database back into the mount state and started the redo apply. It was a quite normal scenario to offload ad-hoc heavy queries to the standby database.
However this time I was not that lucky, when I checked back after 1 hour. The dataguard was out of sync. Redo apply was stuck.
From the alert log of primary I saw the error “ORA-16038: log 24 sequence# 3985 cannot be archived”
Errors in file /opt/oracle/diag/rdbms/PROD1/trace/PROD_tt03_26029.trc:
ORA-16038: log 24 sequence# 3985 cannot be archived
2021-07-07T16:11:13.595670-04:00
Errors in file /opt/oracle/diag/rdbms/PROD1/trace/PROD_tt03_26029.trc:
ORA-16038: log 24 sequence# 3985 cannot be archived
The trace file PROD_tt03_26029.trc didn’t tell anything more:
*** 2021-07-07T14:58:49.273535-04:00
*** SESSION ID:(2940.103) 2021-07-07T14:58:49.273568-04:00
*** CLIENT ID:() 2021-07-07T14:58:49.273574-04:00
*** SERVICE NAME:(SYS$BACKGROUND) 2021-07-07T14:58:49.273579-04:00
*** MODULE NAME:() 2021-07-07T14:58:49.273585-04:00
*** ACTION NAME:() 2021-07-07T14:58:49.273590-04:00
*** CLIENT DRIVER:() 2021-07-07T14:58:49.273595-04:00
Log read is SYNCHRONOUS though disk_asynch_io is enabled!
krr_seed_readahead:1Async IO is not on. Not seeding any prefetch. krrxflg 0x1893.
*** 2021-07-07 16:11:13.087831 [krsb.c:8012]
*** 2021-07-07T16:11:13.088151-04:00
krsb_stream_reap: Error 3111 reaping buffers
*** 2021-07-07 16:11:13.575882 [krsh.c:6348]
LAD:1 not using SRLs; cannot reconnect
*** 2021-07-07 16:11:13.576186 [krsh.c:6348]
Error 12152 archiving LNO:24 to 'PROD2'
krsb_stream_dispatch: No active stream for remote target:1, returning
krsu_upi_status: Error 1041 detaching RFS from standby instance at host 'PROD2'
*** 2021-07-07 16:11:13.590112 [krsi.c:8744]
krsi_dst_fail_caller: LAD:1 err:12152 force:0 blast:1
krsd_multidest_clear: Clearing Multidest for LAD:1
*** 2021-07-07 16:11:13.594894 [krss.c:1561]
krss_req_xpt_async_dgrd: Downgraded LAD:1 to single destination
<error barrier> at 0x7ffd950eddd0 placed krsw.c@9176
ORA-16038: log 24 sequence# 3985 cannot be archived
<error barrier> at 0x7ffd950eddd0 placed krsw.c@9176
ORA-16038: log 24 sequence# 3985 cannot be archived
From the standby alert log, it looked like some kind of corruption which blocked redo apply:
2021-07-07T17:17:25.250650-04:00
rfs (PID:9543): Opened log for T-1.S-3998 dbid 3561959174 branch 1074475546
2021-07-07T17:19:39.810093-04:00
rfs (PID:75623): Archived Log entry 602479 added for B-1074475546.T-1.S-3985 ID 0xd84220de LAD:1
2021-07-07T17:19:40.003503-04:00
PR00 (PID:80608): Media Recovery Log /dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc
2021-07-07T17:19:52.855306-04:00
Incomplete read from log member '/dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc'. Trying next member.
Errors in file /opt/oracle/diag/rdbms/PROD2/trace/PROD_pr00_80608.trc (incident=160500):
ORA-00353: log corruption near block 2799624 change 211541783510 time 07/07/2021 14:57:56
ORA-00334: archived log: '/dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc'
Incident details in: /opt/oracle/diag/rdbms/PROD2/incident/incdir_160500/PROD_pr00_80608_i160500.trc
2021-07-07T17:19:54.927667-04:00
PR00 (PID:80608): MRP0: Detected read corruption! Retry recovery once log is re-fetched...
2021-07-07T17:19:54.927997-04:00
Errors in file /opt/oracle/diag/rdbms/PROD2/trace/PROD_pr00_80608.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2799624 change 211541783510 time 07/07/2021 14:57:56
ORA-00334: archived log: '/dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc'
2021-07-07T17:19:59.496316-04:00
rfs (PID:75623): Allowing overwrite of partial archive log for T-1.S-3985
2021-07-07T17:19:59.770587-04:00
rfs (PID:75623): Opened log for T-1.S-3985 dbid 3561959174 branch 1074475546
2021-07-07T17:20:00.611785-04:00
Incomplete read from log member '/dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc'. Trying next member.
2021-07-07T17:20:00.612922-04:00
Errors in file /opt/oracle/diag/rdbms/PROD2/incident/incdir_160500/PROD_mz00_10084_i160500_a.trc:
ORA-00353: log corruption near block 2799624 change 211541783510 time 07/07/2021 14:57:56
ORA-00334: archived log: '/dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc'
Checker run found 1 new persistent data failures
2021-07-07T17:20:01.746041-04:00
Recovery interrupted!
2021-07-07T17:20:09.191192-04:00
MRP: Archival for thread 1 sequence 3985 in progress
2021-07-07T17:20:39.242864-04:00
MRP: Lock acquisition timeout for thread 1 sequence 3985
2021-07-07T17:20:39.244439-04:00
Errors in file /opt/oracle/diag/rdbms/PROD2/trace/PROD_pr00_80608.trc:
ORA-16145: archival for thread# 1 sequence# 3985 in progress
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2799624 change 211541783510 time 07/07/2021 14:57:56
ORA-00334: archived log: '/dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc'
Some recovered datafiles maybe left media fuzzy
Media recovery may continue but open resetlogs may fail
Stopping change tracking
2021-07-07T17:20:39.275513-04:00
Errors in file /opt/oracle/diag/rdbms/PROD2/trace/PROD_pr00_80608.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2799624 change 211541783510 time 07/07/2021 14:57:56
ORA-00334: archived log: '/dbase/PROD2/archivelog/2021_07_07/o1_mf_1_3985_jgd2np3h_.arc'
2021-07-07T17:20:59.317793-04:00
Started logmerger process
2021-07-07T17:20:59.328011-04:00
PR00 (PID:10472): Managed Standby Recovery not using Real Time Apply
2021-07-07T17:21:01.858702-04:00
Parallel Media Recovery started with 80 slaves
2021-07-07T17:21:02.128476-04:00
Stopping change tracking
PR00 (PID:10472): Media Recovery Waiting for T-1.S-3985 (in transit)
I started to manually copy the log sequence 3985 over. The file was 2G bytes and usually it took about 1 minute to copy. But this time it was crawling. Apparently something went wrong with the network!
So checking with the network administrator, it turned out he adjusted network parameters to try to address another issue while I was doing ad-hoc queries, but actually caused slowness on the network between the primary and standby database.
After he restored the network settings and the speed was recovered. The gap was solved by Oracle dataguard itself.