近日遇到错误ORA-00600[kjctr_pbmsg:badbmsg2],并且导致RAC节点实例重启,最终确认问题由于私网不稳定导致的。ORA-00600:internalerrorcode,arguments:[kjctr_pb
近日遇到错误ORA-00600 [kjctr_pbmsg:badbmsg2],并且导致RAC节点实例重启,最终确认问题由于私网不稳定导致的。
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 12379): terminating the instance due to error 484
1. 具体分析如下,首先查看日志:
alert log
Mon Aug 11 23:53:10 2014
Errors in file /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc (incident=1104178):
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/cdrdb/orcl/incident/incdir_1104178/orcl_lms1_12379_i1104178.trc
Mon Aug 11 23:53:12 2014
Dumping diagnostic data in directory=[cdmp_20140811235312], requested by (instance=1, osid=12379 (LMS1)), summary=[incident=1104178].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Aug 11 23:53:13 2014
Sweep [inc][1104178]: completed
Sweep [inc2][1104178]: completed
Errors in file /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc:
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 12379): terminating the instance due to error 484
Mon Aug 11 23:53:22 2014
ORA-1092 : opitsk aborting process
orcl_lms1_12379_i1104178.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/dbhome_1
System name: HP-UX
Node name: h7sd05da
Release: B.11.31
Version: U
Machine: ia64
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 14
Unix process pid: 12379, image: oracleh7sd05da (LMS1)
Dump continued from file: /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
========= Dump for incident 1104178 (ORA 600 [kjctr_pbmsg:badbmsg2]) ========
*** 2014-08-11 23:53:10.339
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.
----- Call Stack Trace -----
skdstdst <- ksedst <- dbkedDefDump <- ksedmp <- ksfdmp
<- $cold_dbgexPhaseII <- dbgexProcessError <- dbgeExecuteForError <- dbgePostErrorKGE <- 2352
<- dbkePostKGE_kgsf <- 128 <- kgeadse <- kgerinv_internal <- kgerinv
<- kgeasnmierr <- kjctr_pbmsg <- kjctr_rksxp <- kjctrcv <- kjcsrmg
<- kjmsm <- ksbrdp <- opirip <- opidrv <- sou2o
<- opimai_real <- ssthrdmain <- main <- main_opd_entry
--------------------- Binary Stack Dump ---------------------
2. 检查patch信息,当前版本是11.2.0.2.1
$ opatch lsinventory
Installed Top-level Products (1):
Oracle Database 11g 11.2.0.2.0
Patch 10248523 : applied on Fri Mar 25 09:33:02 GMT+08:00 2011
3. 根据这个错误搜索相关的文档和BUG,列出下面的相关bug和描述
Bug 18015296 : ORA-600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
The assert is trigerred because the batch message is invalid/corrupt. This looks like some form of underlying infrastructure/network issue, Please work with customer to have this checked and tested.
Bug 18771858 : LMS0 TERMINATING THE INSTANCE DUE TO ERROR 484 (ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
From the past bug 16240464 & bug 18015296 , both were closed by dev as not a product defect.
It was suggested that problem was outside Oracle stack at network level. So please check with CT on same lines to identify network problems (if any) with help from there OS/Net support. Refer Doc ID 563566.1 Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
Bug 16240464 : INSTANCE CRASH WITH ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
This looks like some form of underlying infrastructure/network issue, please work with customer to have this checked and tested.
Bug 17452853 : LNX64-12.1-EF,DB INST CRASH WITH LMS4 HIT ORA-600 [KJCTR_PBMSG:BADBMSG2] in 12.1.0.2
Bug 17049773 Diagnostic enhancement to give additional parameter in error ORA-600 [ kjctr_pbmsg:badbmsg2] in 12.1.0.1
Note: This fix will not address the root cause of the error but the additional information may help with diagnosis of the cause.
Bug 13917456 : LNX64-12.1-UD: ASM LMD HIT ORA-00600 KJCTR_PBMSG:BADBMSG2 IN NON-UPGRADED NODES in 12.1.0.0.2
It may occurred in upgrading stage from 11.2.0.3 to 12.1 . Not related with this SR.
4. 至此,我需要检查问题发生时的AWR,oswatcher和全部的LMS, LMD, LMON,LMHB and DIAG日志,看是否有跟多的信息记录。
同时也通过cluvfy和ORAchk来检查RAC的整体环境。
--. AWR report 22:00~23:00 on Aug 11 from both nodes.
--. Deploy the oswatcher, then collect the current OS information, when the database workload is high.
--. All the LMS, LMD, LMON,LMHB and DIAG from both nodes.
--. CVU output:
cluvfy stage -pre crsinst -n
-verbose
--. Please run oraCheck as root.
ORAchk - Health Checks for the Oracle Stack (Doc ID 1268927.2)