当前位置:Gxlcms > 数据库问题 > 小机上运行ORACLE需要注意的进程调度BUG

小机上运行ORACLE需要注意的进程调度BUG

时间:2021-07-01 10:21:17 帮助过:9人阅读

小y今天要和大家分享的是一个疑难杂症的分析过程。如果大家有耐心读完这个案例,一定会或多或少有些收获,也就没浪费小y的一片苦心。

具体来说是一个应用间歇性局部挂起案例的分析过程,报告中将对Oracle数据库稳定运行的共性风险和隐患作出提醒。

1问题描述

据客户反映,应用会间歇性出现异常,包括insert单条记录在内的操作长时间无法完成,按照客户的说法,数据库内可能有“死锁”现象,希望能够找到问题发生的根因,提出解决方案,以避免问题再次发生。

20151223日,问题再次发生,客户再次联系到小y,y通过远程方式进行了信息收集和故障诊断,最终定位了问题的根本原因。

环境介绍:

操作系统 HPUX IA64 B.11.31

数据库 ORACLE 10.2.0.5,单实例


2分析过程

>>2.1 异常时刻数据库出现异常等待

技术分享

可以看到:

2个会话在等待行锁(拿不到事务锁,需要一直等待),另外有一个会话在等待“undo segment extension”。


>>>2.2 梳理异常等待之间的关系

1)分析行锁等待的阻塞者

技术分享

可以看到:

SID 285/290的两个会话都是被SID=315的会话阻塞了,他们在等待行锁,时间已经超过60000秒。

2)查看阻塞者SID=315在做什么

技术分享

可以看到:

SID=315的会话阻塞了其他两个会话,他本身也处于一个资源的等待上,在等待“undo segment extension”,已经等了70384秒了!该等待事件没有阻塞者。

3)SID=315在执行的SQL语句

INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7)


>>>2.3 什么是“undo segment extension”等待事件

“undo segment extension”即等待回滚段扩展完成。当执行增删改等操作时,数据库需要回滚段来存储前镜像,当回滚段空间不足时,则需要扩展。

具体来说,undo segment的扩展或者回收(extend / shrink)都是前台进程通过通知SMON后台进程来完成的。


发出下列命令查看undo的使用情况,UNDO表有空有1个文件27750M,其中目前活动的值有8M,未过undo retention1892M,过了undo retention的有40M未发现异常。


技术分享


>>>2.4 收集数据库hanganalyzesystemstate信息

发出下列命令异常时候的hanganalyzesystemstate信息

SQL> oradebug setmypid

Statement processed.

SQL> oradebug hanganalyze 3

Hang Analysis in /oracle/admin/xxdb/udump/xxdb_ora_14136.trc

SQL>

SQL> oradebug dump systemstate 266

Statement processed.

SQL> oradebug tracefile_name

/oracle/admin/xxdb/udump/xxdb_ora_14136.trc


>>>2.5 问题在收集信息后自动解决

在发出上述命令收集相关信息后,再次检查,发现数据库异常等待居然已经自动解决掉了。

从原理和经验分析,这是因为oradebug 在收集systemstate dump时,其中会调用dbx等OS命令去做进程堆栈的打印,此时会将进行wakeup(唤醒)。

如下所示


技术分享


>>>2.6 获得SID:315会话的活动会话历史

技术分享

可以看到,确实一直在等待undo segment extension


>>>2.7 分析systemstate dump定位问题根因

其中SID=315的会话在等待”undo segment extension”,该会话对应PROCESS 19 SSD当中的信息如下:

PROCESS 19:

----------------------------------------

SO: c00000003949b948, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00

(process) Oracle pid=19, calls cur/top: c0000000397209b0/c0000000397209b0, flag: (0) -

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 121

last post received-location: kcbzww

last process to post me: c000000039496148 1 22

last post sent: 0 0 121

last post sent-location: kcbzww

last process posted by me: c000000039496148 1 22

(latch info) wait_event=0 bits=0

Process Group: DEFAULT, pseudo proc: c000000039529928

O/S info: user: oracle, term: UNKNOWN, ospid: 11880

OSD pid info: Unix process pid: 11880, image: oracle@ap-machine-

*** 2015-12-22 10:34:53.431

Short stack dump:

ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_pw_wait()+48<-pw_wait()+128<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktugur()+4416<-ktuchg()+1280<-ktbchg2()+704<-kdiins0()+267536<-kdiinsp()+320<-kauxsin()+2960<-insidx()+1744<-insrow()+1440<-insdrv()+960<-inscovexe()+1408<-insExecStmtExecIniEngine()+176<-insexe()+1040<-opiexe()+13776<-kpoal8()+3808<-opiodr()+2144<-ttcpip()+1680<-opitsk()+2368<-opiino()+1664<-opiodr()+2144<-opidrv()+1248<-sou2o()+240<-opimai_real()+496<-main()+240<-main_opd_entry()+80

----------------------------------------

SO: c0000000396d80d8, type: 4, owner: c00000003949b948, flag: INIT/-/-/0x00

(session) sid: 315 trans: c0000000355b2a28, creator: c00000003949b948, flag: (100041) USR/- BSY/-/-/-/-/-

DID: 0001-0013-00000027, short-term DID: 0000-0000-00000000

txn branch: 0000000000000000

oct: 2, prv: 0, sql: c0000000384bad50, psql: c0000000384bad50, user: 33/XXDB

service name: xxdb

O/S info: user: , term: , ospid: 1234, machine: rvwapp2-1

program:

waiting for ‘db file sequential read‘ wait_time=0, seconds since wait started=0

file#=c, block#=2f359, blocks=1

blocking sess=0x0000000000000000 seq=42271

Dumping Session Wait History

for ‘undo segment extension‘ count=1 wait_time=2 min 45 sec

segment#=6, =0, =0

for ‘buffer busy waits‘ count=1 wait_time=0.000009 sec

file#=2, block#=59, class#=1b

for ‘SQL*Net message from client‘ count=1 wait_time=0.000087 sec

driver id=28444553, #bytes=1, =0

for ‘SQL*Net message to client‘ count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

for ‘SQL*Net message from client‘ count=1 wait_time=0.000086 sec

driver id=28444553, #bytes=1, =0

for ‘SQL*Net message to client‘ count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

for ‘SQL*Net message from client‘ count=1 wait_time=0.000086 sec

driver id=28444553, #bytes=1, =0

for ‘SQL*Net message to client‘ count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

for ‘SQL*Net message from client‘ count=1 wait_time=0.000087 sec

driver id=28444553, #bytes=1, =0

for ‘SQL*Net message to client‘ count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

Sampled Session History of session 315 serial 4544

---------------------------------------------------

The sampled session history is constructed by sampling

the target session every 1 second. The sampling process

captures at each sample if the session is in a non-idle wait,

an idle wait, or not in a wait. If the session is in a

non-idle wait then one interval is shown for all the samples

the session was in the same non-idle wait. If the

session is in an idle wait or not in a wait for

consecutive samples then one interval is shown for all

the consecutive samples. Though we display these consecutive

samples in a single interval the session may NOT be continuously

idle or not in a wait (the sampling process does not know).

The history is displayed in reverse chronological order.

sample interval: 1 sec, max history 120 sec

---------------------------------------------------

[120 samples, 10:32:52 - 10:34:53]

waited for ‘undo segment extension‘, seq_num: 42270

p1: ‘segment#‘=0x6

p2: ‘‘=0x0

p3: ‘‘=0x0

time_waited: >= 120 sec (still in wait)

---------------------------------------------------

Sampled Session History Summary:

longest_non_idle_wait: ‘undo segment extension‘

[120 samples, 10:32:52 - 10:34:53]

time_waited: >= 120 sec (still in wait)

可以看到:

进程在被ORADEBUG间接唤醒后,不再等待undo segment extension,而是做 db file sequential read

从前面的分析得知,“undo segment extension”即等待回滚段扩展完成。当执行增删改等操作时,数据库需要回滚段来存储前镜像,当回滚段空间不足时,则需要扩展。具体来说,undo segment的扩展或者回收(extend / shrink)都是前台进程通过通知SMON后台进程来完成的。CallStack堆栈的调用中,最后停在“pw_wait”的调用上,即说明在等待SMON返回扩展UNDO SEGMENT成功与否的消息。

因此,我们需要查看SMON进程的状态。

检查SMON的信息如下:

PROCESS 8:

----------------------------------------

SO: c000000039496148, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00

(process) Oracle pid=8, calls cur/top: c00000003971e868/c00000003971e868, flag: (16) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 121

last post received-location: kcbzww

last process to post me: c0000000394a0948 211 0

last post sent: 0 0 24

last post sent-location: ksasnd

last process posted by me: c000000039495148 1 6

(latch info) wait_event=0 bits=0

Process Group: DEFAULT, pseudo proc: c000000039529928

O/S info: user: oracle, term: UNKNOWN, ospid: 10072

OSD pid info: Unix process pid: 10072, image: oracle@ap-machine- (SMON)

Short stack dump:

ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_pw_wait()+48<-pw_wait()+352<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktmmon()+1168<-ktmSmonMain()+64<-ksbrdp()+2368<-opirip()+1184<-opidrv()+1184<-sou2o()+240<-opimai_real()+336<-main()+240<-main_opd_entry()+80

----------------------------------------

----------------------------------------

SO: c0000000396ead68, type: 4, owner: c000000039496148, flag: INIT/-/-/0x00

(session) sid: 329 trans: 0000000000000000, creator: c000000039496148, flag: (100051) USR/- BSY/-/-/-/-/-

DID: 0001-0008-00000002, short-term DID: 0000-0000-00000000

txn branch: 0000000000000000

oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS

service name: SYS$BACKGROUND

waiting for ‘smon timer‘ wait_time=0, seconds since wait started=3109

sleep time=12c, failed=0, =0

blocking sess=0x0000000000000000 seq=7382

Dumping Session Wait History

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 11 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for ‘smon timer‘ count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

Sampled Session History of session 329 serial 1


可以看到,SMON进程在等“SMON TIMER”,即空闲等待。


这说明SMON没有因为阻塞在异常的等待上,导致无法腾出时间来处理前台进程发过来的undo segment扩展请求。


>>>2.8 定位问题的根本原因


综上所有现象:

1) 前台进程SID 315smon发送一条消息,请求对回滚段进行扩展,并在“undo segment extension“事件上等待,进程callStack停在pw_wait,post/wait,等待消息的返回;

2) SMON进程在等待”SMON TIMER”,即空闲等待。此时SMON可能已经完成了UNDO SEGMENT的扩展请求并返回了前台进程;也可能没有收到前台进程发送过来的请求;


3) ORADEBUGSID 315收集信息并间接唤醒了该前台进程后,前台进行可以继续往下工作,这说明,步骤2)的消息已经返回给前台进程,只是由于操作系统调度进制的问题,未能即使将前台进程调度到CPU上,拿到UNDO扩展成功的消息。


结上所述,我们可以判定:造成该故障的根本原因是操作系统调度机制的问题。

环境介绍:

操作系统 HPUX IA64 B.11.31

数据库 ORACLE 10.2.0.5,单实例


因此,我们在
ORACLE
官方网站BUG中以“Hpux pw_wait“做关键字索HPUX已知的调度缺陷,可知命中操作系统缺陷。

The problem is a defect in the pw_wait() O/S system call.
On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:

PHKL_37456 scheduler cumulative patch

HPUX11.31上,当安装了PHKL_37456这个调度补丁后,操作系统调用pw_wait存在缺陷,导致进程无法从post/wait中被唤醒,解决方案是Solution: on 11.31 install PHKL_38397 or later equivalent.

人气教程排行