时间:2021-07-01 10:21:17 帮助过:25人阅读
今天核心系统将一块磁盘(EMC DMX4)添加到了ASM dg中,然后数据库RAC两个节点双双crash掉了,顿时吓了一身冷汗。 检查日志: NOTE: Disk in mode 0x8 marked for de-assignmentERROR: diskgroup DGIDX1 was not mountedORA-15032: not all alterations perf
今天核心系统将一块磁盘(EMC DMX4)添加到了ASM dg中,然后数据库RAC两个节点双双crash掉了,顿时吓了一身冷汗。
检查日志:
NOTE: Disk in mode 0x8 marked for de-assignment ERROR: diskgroup DGIDX1 was not mounted ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete ORA-15042: ASM disk "16" is missing from group number "4" ERROR: ALTER DISKGROUP DGIDX1 MOUNT /* asm agent *//* {1:8345:41140} */ Thu Nov 06 15:17:41 2014 Errors in file /oraclelog/grid/diag/asm/+asm/+ASM1/trace/+ASM1_pz99_22545054.trc: ORA-27063: number of bytes read/written is incorrect IBM AIX RISC System/6000 Error: 16: Device busy Additional information: -1 Additional information: 4096 WARNING: Read Failed. group:0 disk:10 AU:0 offset:0 size:4096 Errors in file /oraclelog/grid/diag/asm/+asm/+ASM1/trace/+ASM1_pz99_22545054.trc: ORA-27063: number of bytes read/written is incorrect IBM AIX RISC System/6000 Error: 16: Device busy Additional information: -1 Additional information: 4096 WARNING: Read Failed. group:0 disk:9 AU:0 offset:0 size:4096 Errors in file /oraclelog/grid/diag/asm/+asm/+ASM1/trace/+ASM1_pz99_22545054.trc: ORA-27063: number of bytes read/written is incorrect IBM AIX RISC System/6000 Error: 16: Device bus
新加的盘不能使用,但是此时两个节点尝试ASM和数据库实例恢复,第二个节点却起了起来,目前问题是第一个节点的读取问题。可能是这个LUN对主机的存储锁、SAN链路等问题导致了。此时在第二个节点ASM实例中查看v$asm_operation视图,结果为空。看来这个盘的rebalance操作已经完成了。为了让这个生产系统早点上线,我们选择了把这个有问题的LUN从ASM第二个实例中剔除,还原初始环境。在asmca中操作后,检查rebalance进度:
SQL> select * from v$asm_operation; GROUP_NUMBER OPERA STAT POWER ACTUAL SOFAR EST_WORK EST_RATE ------------ ----- ---- ---------- ---------- ---------- ---------- ---------- EST_MINUTES ERROR_CODE ----------- -------------------------------------------- 4 REBAL RUN 1 1 36659 49690 2899 4
一共49G的数据需要操作,等待SOFAR=EST_WORK后,该LUN被成功剔除。此时第一个节点的ASM实例也成功启动。吃一堑长一智,在数据库真正使用一个磁盘之前,检查设备的可用性是非常重要的。Oracle的ACS也提到了一个工具kfod(in $GRID_HOME/bin),可以快速检查LUN的有效性,盖总也简单介绍过该工具:kfod in oracle_asm
# 在第2个节点,可以找到该磁盘的信息 $ kfod disk=all |grep 113 139: 51930 Mb /dev/rhd113 grid asmadmin # 在第1个节点,则找不到该磁盘的信息,说明Oracle GI无法正确识别该LUN。 $ kfod disk=all |grep 113
>o<
原文地址:添加磁盘导致的ASM实例crash, 感谢原作者分享。