时间:2021-07-01 10:21:17 帮助过:20人阅读
Topic: this post is about Linux perf and uprobes for tracing and profiling Oracle workloads for advanced troubleshooting.
Context
The recent progress and maturity of some of the Linux dynamic tracing tools has raised interest in applying these techniques to Oracle troubleshooting and performance investigations. See Brendan Gregg(link is external)‘s web pages for summary and future developments on dynamic traces for Linux. Some recent work on applying these tools and techniques to Oracle can be found in: Tanel Poder(link is external)‘s posts on "RAM is the new disk(link is external)", Frits Hoogland(link is external)‘s post on "Stapflame(link is external)" and in this blog at "Extended stack profiling(link is external)".
Dynamic tracers are proven tools for instrumenting OS calls: an example of interest for OLTP tuning, are probes for measuring I/O latency at microsecond precision. Another topic, more complex but also more exciting, is tracing Oracle userspace. This topic is of growing interest with modern servers hosting large amounts of RAM and workloads that are often CPU-bound.
DTrace(link is external) is still a reference when talking about dynamic tracing. For Oracle userspace tracing, however, this is mostly relevant for the Solaris environment. Basic examples of DTrace probes for Oracle on Linux can be found in this blog at this link(link is external).
SystemTap(link is external) is a tool that offers a mature interface to tracing Linux kernel and the userspace. Notably SystemTap offers the possibility to write probes with advanced aggregations similarly to DTrace. Tracing Oracle userspace and wait events with SystemTap has been covered before in this blog with work done in collaboration with Frits Hoogland(link is external): see this link(link is external) and this link(link is external) and the scripts at thisGithub repository(link is external). An advantage of SystemTap can trace userspace for RHEL/OL 6.x using the utrace(link is external).
Linux Perf(link is external) is another major player in this area. It is a powerful tool, integrated in the kernel and continuously being improved with new kernel releases. Perf is an excellent tool for data gathering for on-CPU flame graphs (see this link(link is external) and references below). Perf has gained a reputation to be stable and relatively low overhead, which makes it useful for troubleshooting production workloads (mileage may vary).
In this post you will find examples, of how Perf can be used to trace the Oracle userspace: this includes tracing calls to userspace functions and also reading from memory, in particular from the Oracle SGA. This creates a link between Perf and Oracleinstrumentation and can be used for tracing Oracle and for building extended profiling tools (i.e. tools that pull together OS and Oracle instrumentation) and for flame graph visualization.
Perf probes for Oracle
If you are not familiar with Linux Perf(link is external), you can profit from reading/testing a few examples from http://www.brendangregg.com/perf.html(link is external). The focus of this post is with perf probes for Oracle userspace functions using uprobes. This functionality is available from kernel version 3.5(link is external), this means that to run the examples shown here you will need a relatively recent Linux version: RHEL 7.x (with 3.10.x kernel) or Oracle Linux with UEK3 (3.8.x kernel) or UEK4 (4.1.x kernel). The examples in this post have been tested for Oracle 11.2.0.4 running on OL7.2 with UEK4.
Perf meets Oracle wait events: probes for the wait event interface are of particular interest as they provide a link between what Perf can measure and the Oracle instrumentation. Some key Oracle userspace functions in this area have been described previously in this blog(link is external) (see also the example scripts in github(link is external)). A short description of three functions of interest before that will be used later in the example scripts:
Function name |
Purpose |
Selected parameters |
---|---|---|
kskthbwt |
Kernel service Kompile thread begin wait. This function is called at the start of an Oracle wait event. kslwtbctx is its parent function call and marks the start of a wait event. The suffix "bwt" most likely stands for "begin wait". |
Parameters and registers: 11.2.0.4 - register r14 -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent) Note for 12.1.0.2 - register r13 -> pointer into X$KSUSE register rsi -> timestamp of the beginning of the wait (in microseconds) |
kskthewt |
Kernel service Kompile thread end wait. This function is called at the end of an Oracle wait event. kslwtectx is its parent function call marking the end of a wait event. The suffix "ewt" most likely stands for "end wait". |
Parameters and registers: register r13 -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent) |
kews_update_wait_time |
Kernel event wait statistics update wait time. This function is called at the end of an Oracle wait event. kslwtectx is its parent function call marking the end of a wait event. |
Parameters and registers: register rsi -> wait time register r13 (relevant for 11.2.0.4, it does not work on 12.1.0.2) -> event number |
You can get started with Perf tracing of Oracle userspace using the following:
# export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/rdbms
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si
If all worked well you will have created a new probe for Oracle userspace tracing and will see the following output.
Added new event:
probe_oracle:kskthewt (on kskthewt in /u01/app/oracle/product/11.2.0.4/rdbms/bin/oracle with timestamp=%di event=%si)
You can now use it in all perf tools, such as:
perf record -e probe_oracle:kskthewt -aR sleep 1
Note in case of errors, check the OS and Oracle version compatibility discussed elsewhere in this post. Go ahead and create two more probes:
# perf probe -x $ORACLE_HOME/bin/oracle kskthbwt timestamp=%si event=%dx
# perf probe -x $ORACLE_HOME/bin/oracle kews_update_wait_time wait_time=%si event=%r13
You should have now three probes on Oracle, you verify this by listing them with: "perf probe -l". If needed, you can delete probes with this syntax: "perf probe --del <probe_name>". Now you can run perf record to gather data for the events defined by the probes. One last step is to identify an Oracle process to test (it can be a sqlplus session for example). Find the OS pid of the Oracle process to probe and run the following:
# perf record -e probe_oracle:kews_update_wait_time -e probe_oracle:kskthbwt -e probe_oracle:kskthewt -p <pid>
Hit CTRL-C when done tracing. Display the trace output with the perf script command. Here an example from "select count(*) from <testtable>" tun on a test databas:
# perf script
oracle 17270 [003] 4004734.472372: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x6fce81d4 event=0x162
oracle 17270 [003] 4004734.472408: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690de07 event=0x162
oracle 17270 [003] 4004734.472953: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e05c event=0x15e
oracle 17270 [003] 4004734.472977: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x1e event=0x15e
oracle 17270 [003] 4004734.472982: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690e07a event=0x15e
oracle 17270 [003] 4004734.474083: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e4c6 event=0xc6
oracle 17270 [003] 4004734.474398: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x13b event=0xc6
...
The fields displayed in the example output are: process name, pid, CPU number, timestamp since machine startup, probe name and address, probe parameters. Some additional processing is beneficial to make the output more easily readable. In the next example perf script output is filtered by piping it to a series of commands: (1) to reduce the number of fields displayed, (2) convert hexadecimal numbers into decimal, (3) convert wait event numbers in wait event names. For this we use sed and eventsname.sed, a file generated with a sqlplus script(link is external).
# perf script | \
awk ‘{printf "pid=%s TS=%s ",$2,$4} {print $5,$7,$8}‘ | \
perl -p -e ‘s/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge‘ | \
sed s/event=/event#=/g |sed -f eventsname.sed
pid=17270 TS=4004734.472372: probe_oracle:kews_update_wait_time: wait_time=1875804628 event=SQL*Net message from client
pid=17270 TS=4004734.472408: probe_oracle:kskthewt: timestamp=1444447003926023 event=SQL*Net message from client
pid=17270 TS=4004734.472953: probe_oracle:kskthbwt: timestamp=1444447003926620 event=SQL*Net message to client
pid=17270 TS=4004734.472977: probe_oracle:kews_update_wait_time: wait_time=30 event=SQL*Net message to client
pid=17270 TS=4004734.472982: probe_oracle:kskthewt: timestamp=1444447003926650 event=SQL*Net message to client
pid=17270 TS=4004734.474083: probe_oracle:kskthbwt: timestamp=1444447003927750 event=direct path read
pid=17270 TS=4004734.474398: probe_oracle:kews_update_wait_time: wait_time=315 event=direct path read
...
What the examples show is that trace data that you would normally find in a 10046 trace file (or trace for EXEC DBMS_MONITOR.session_trace_enable(waits=>TRUE) if you prefer) can be collected using OS commands alone. Perf with utrace can be used to trace Oracle as an alternative to setting Oracle diagnostic events. What you cannot find in the Perf examples above are information on the parameters of the wait event (usually called p1, p2, p3). This is the subject of the next paragraph.
It only gets better from here: perf with uprobes can read Oracle SGA
Uprobes has received a major improvement with kernel 3.14 to include the possibility of fetching arguments from the memory of the traced process (see also http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt(link is external)). As I write this Oracle has released UEK4 with kernel 4.1.x, which can be used to test this feature. Fedora is another option for a test environment with recent kernels.
Many details and metrics of the Oracle instrumentation coming from V$ views and their underlying X$ tables can be directly accessed by reading memory structures in the SGA (this is not the case for all X$, as many of them are rather implemented as internal functions). X$KSUSE exposes many fields of V$SESSION. The memory location of the fields of interest can be identified using SQL as in this example(link is external).
The link between one of the main wait event userspace functions, kskthewt, and X$KSUSE is found by experimentation. This has been investigated together with Frits(link is external). We find that the value of the CPU register r13 points into the X$KSUSE record with an offset that is version depended. This work has already been done for SystemTap probes and a working example can be found at this link for 11.2.0,4(link is external) and this link for 12.1.0.2(link is external).
This is an example of a probe on the Oracle wait event interface that exposes selected fields from X$KSUSE memory structures:
# perf probe --del probe_oracle:kskthewt
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si p1=‘-2080(%r13)‘ p2=‘-2072(%r13)‘ p3=‘-2064(%r13)‘ sql_hash=‘-1828(%r13)‘
# perf record -e probe_oracle:kskthewt -p <pid>
# perf script | \
awk ‘{printf "pid=%s TS=%s ",$2,$4} {print $8,$9,$10,$11,$12}‘ | \
perl -p -e ‘s/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge‘ | \
sed s/event=/event#=/g |sed -f eventsname.sed
pid=17270 TS=4097944.225385: timestamp=1444540213679034 event=SQL*Net message from client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.225786: timestamp=1444540213679433 event=SQL*Net message to client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.227361: timestamp=1444540213681023 event=direct path read p1=1 p2=193 p3=7 sql_hash=11373786902
pid=17270 TS=4097944.229206: timestamp=1444540213682870 event=direct path read p1=1 p2=3896 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.230532: timestamp=1444540213684197 event=direct path read p1=1 p2=12032 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.232161: timestamp=1444540213685827 event=direct path read p1=1 p2=29464 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.244675: timestamp=1444540213698344 event=direct path read p1=1 p2=177 p3=5 sql_hash=11373786902
...
More fields of interest from X$KSUSE or other areas of the SGA can be extracted as arguments of perf by extending the example above.
Example of tracing Oracle parse functions:
Function name |
Purpose |
Selected parameters |
---|---|---|
opiprs |
Oracle Program Interface - Parse Use to trace hard parsing. This function is called when doing hard parsing
|
Useful information when probing the function: register rdx -> sql statement length |
kksParseCursor |
Use to trace soft parsing. This is one of the functions in the call stack when Oracle parses a cursors already in the library cache |
Useful information when probing the function:
11.2.0.4: register rbx -> pointer to the SQL text string 12.1.0.2: register r9 ->sql statement lenght register r8 -> pointer to the SQL text string |
This is an example of a perf probe to trace hard parsing (tip: flush the shared pool before testing):
# perf probe -x $ORACLE_HOME/bin/oracle opiprs length=%dx sql=‘+0(%si)‘:"string"
# perf record -e probe_oracle:opiprs -p <pid>
# perf script
oracle 17270 [003] 4100899.401507: probe_oracle:opiprs: (27fd2f6) length=0x19 sql="select sysdate from dual"
Similarly for tracing soft parsing (11.2.0.4 version) you can use the following probe:
# perf probe -x $ORACLE_HOME/bin/oracle kksParseCursor length=%si sql=‘+0(%bx)‘:"string"
Many more Oracle functions can be traced:
From the OS point of view, the Oracle executable is a big C program with a large number of functions. You can run "nm -D $ORACLE_HOME/bin/oracle" to get an idea of the scale: it is of the order of 10^5 text (code) sections. Obviously too many details and also undocumented. Some pointers to understand the function names can be found in MOS (see in especially the bug-related documents), from sqlplus running oradebug doc(link is external) and from various blogs and presentations (see references).
Here below a short table of functions of interest for tracing logical and physical I/O taken from this blog post(link is external). The ideas and examples discussed above can be applied to these functions too, if there is a need to trace logical and physical I/O for example. Note that tracing logical I/O on a busy system can have an important overhead because of the large number of events per second.
Function name |
Purpose |
Selected parameters |
---|---|---|
kcbgtcr |
Kernel Cache Buffers Get Consistent Read Note: key function for consistent reads |
tbs#=user_int32(%rdi) rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF block#=user_int32(%rdi+4) & 0x003FFFFF data_object_id#=user_int32(%rdi+8) object_id#=user_int32(%rdi+12)
Note: for bigfile tablespaces: block#=user_int32(%rdi+4) |
kcbgcur |
Kernel Cache Buffers Current Note: key function for current reads |
tbs#=user_int32(%rdi) rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF block#=user_int32(%rdi+4) & 0x003FFFFF data_object_id#=user_int32(%rdi+8) object_id#=user_int32(%rdi+12)
Note: for bigfile tablespaces: block#=user_int32(%rdi+4) |
kcbzib |
kcbZIB should stand for: Kernel Cache Buffers, Z Note: this function is part of the physical I/O stack: it indicates that Oracle will perform physical read(s) into the buffer cache
| |
kcbzgb |
The suffix GB in kcbZGB should stand for:
Oracle allocates space in the buffer cache for a given block (typically before I/O operations).
|
tbs n#=%rsi, rel file n#=%rdx >> 22 & 0x003FFFFF block#=%rdx & 0x003FFFFF data_object_id#=%r8 object_id#=%r9
Note: for bigfile tablespaces: block#=%rdx |
kcbzvb |
Invoked after Oracle has read a given block Note: this function is used both for reads in the buffer cache and for direct reads
|
tbs n#=%rsi, rel file n#=%rdx >> 22 & 0x003FFFFF, block#=%rdx & 0x003FFFFF data_object_id=user_int32(%rdi+24) Note: for bigfile tablespaces: block#=%rdx |
Combining Perf probes and Oracle traces
What you have seen so far can also be described as a way to extract information normally available with Oracle diagnostic events, using perf and uprobes instead. Additional value to these methods comes when you combine them with perf probes for tracing kernel and hardware activity. There are many Perf events available. See also the output of perf list and the references in this post.
Study of Oracle I/O and wait events:
The example below combines perf events on tracepoints for system calls to the I/O interface in Linux and probes on the Oracle wait event interface. The example below records a trace for the Oracle userspace functions marking the beginning and end of a wait and for system calls to the I/O interface:
# perf record -e probe_oracle:kskthewt -e probe_oracle:kskthbwt -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>
The output is processed for improved readability:
# perf script | \
awk ‘{printf "pid=%s TS=%s ", $2,$4} {print $5,$8,$9,$10,$11,$12}‘ | \
perl -p -e ‘s/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge‘ | \
sed s/event=/event#=/g |sed -f eventsname.sed
pid=25314 TS=4173676.363095: probe_oracle:kskthbwt: event=db file sequential read
pid=25314 TS=4173676.363109: syscalls:sys_enter_pread64: buf: 2644443136, count: 8192, pos:
pid=25314 TS=4173676.363667: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.363689: probe_oracle:kskthewt: event=db file sequential read p1=4 p2=147586 p3=1 sql_hash=2246415721
pid=25314 TS=4173676.364030: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364043: syscalls:sys_enter_pread64: buf: 5727326208, count: 40960, pos:
pid=25314 TS=4173676.364576: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.364597: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147587 p3=5 sql_hash=2246415721
pid=25314 TS=4173676.364790: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364801: syscalls:sys_enter_pread64: buf: 5727326208, count: 65536, pos:
pid=25314 TS=4173676.365344: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.365365: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147592 p3=8 sql_hash=2246415721
What this example achieves is a way to port to Perf the work done with SystemTap for tracing Oracle I/O. See alsohttp://externaltable.blogspot.com/2014/11/life-of-oracle-io-tracing-logical-and.html(link is external) for additional details.
Oracle wait events and perf scheduler events
Perf scheduler events (see: perf list sched:*) are another source of very useful information. For example the event sched:sched_stat_runtime traces the amount of time spent executing on CPU. Note however that not all the scheduler probes are available on RHEL7 and UEK kernels, as CONFIG_SCHEDSTATS is not set on those platforms (see: grep CONFIG_SCHED /boot/config-`uname -r`). In particular this prevents the use of the technique described by Brendan Gregg for off-CPU flame graphs(link is external) with perf.
This is an example of tracing the CPU scheduler together with I/O related events with Oracle wait events:
# perf record -e sched:* -e probe_oracle:kskthewt -e probe_oracle:kskthbwt -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>
After capturing the workload you can extract atrace file running "perf script". In the table below you can see a snippet of the trace filtered with a simple scripts to improve the readability. You can see that perf provides valuable information by measuring CPU time, system calls and Oracle wait events.
EVENT NAME BEGIN TS END TS ELAPSED (MicroSec)
event=SQL*Net message to client 4413050521226 4413050521261 35
sched_stat_runtime 4413050521179 4413050521291 112
sched_stat_runtime 4413050521293 4413050521530 237
syscall:pread64 4413050521392 4413050535684 14292
event=db file sequential read 4413050521375 4413050535704 14329
syscall:io_submit 4413050536073 4413050536249 176
sched_stat_runtime 4413050535593 4413050536551 958
syscall:io_submit 4413050536343 4413050536623 280
sched_stat_runtime 4413050536548 4413050536686 138
syscall:io_getevents 4413050536678 4413050537650 972
event=direct path read 4413050536652 4413050537677 1025
syscall:io_submit 4413050537807 4413050537924 117
sched_stat_runtime 4413050537563 4413050537963 400
syscall:io_getevents 4413050537956 4413050539289 1333
event=direct path read 4413050537943 4413050539315 1372
sched_stat_runtime 4413050539134 4413050539553 419
..
Ideas for future development
Stack profiling and flame graphs: An additional step to get more value out of perf traces is to visualize their output. Flame graphs(link is external) and extended stack profiling have proven to be useful in this area. See also previous work on flame graphs for Oracle at link1 (link is external)and link2(link is external) on extended stack profiling with stapflame by Frits at this link(link is external) and with stack profilers at this link(link is external).
This is still work in progress. For a sneak preview follow this link for a proof-of-concept graph.
Perf hardware events: Combining Oracle traces traces with hardware and PMU counters is another area of interest for future investigations. This could be useful for example for studying logical reads and their dependencies on events at HW level on memory and/or processor cache level.
Add and documenting more probes for Oracle userspace function tracing: For example it would be useful to develop a set of perf probes for replicating the 10046 trace event.
Linux and Oracle version compatibility:
The definition of probes on Oracle executable needs uprobes, available from kernel version 3.5(link is external). When testing the examples discussed in post on RHEL/OL, you can use Kernel UEK4 or UEK3 or use RHEL 7.x. Notably, examples discussed in this post will not run on RHEL 6.x kernels.
Issues with uprobes and Oracle versions: uprobes works OK for tracing Oracle 11.2. However, for Oracle 12.1 I find that uprobes works OK on RHEL7.0 (kernel 3.10.0-123) and UEK (kernel 3.8.x), but does not work for kernels that ship with RHEL 7.1,7,2 and anything higher (including UEK4). When testing the easiest is to use Oracle 11.2 or if you want to test Oracle 12.1 use UEK3 or RHEL 7.0 kernel. More investigations are needed on this topic.
Conclusions
Perf events and uprobes are very powerful tools for dynamic tracing of Oracle. They allow to complement and extend Oracle and OS instrumentation for advanced troubleshooting and performance investigations. In particular Perf with uprobes can be used:
Credits and references
Kudos to Brendan Gregg(link is external) for his work on flame graphs and for the articles and presentations he has published on Linux tracers. Many thanks to Frits Hoogland(link is external) for his investigations of Oracle internals, for collaboration on Oracle internals investigations and tracing tools and for proof reading this post. Many thanks to Tanel Poder(link is external) for his work on Oracle internals and performance.
Additional links on topics related to this post:
http://www.brendangregg.com/perf.html(link is external)
https://fritshoogland.wordpress.com/2016/01/12/introducing-stapflame-extended-stack-profiling-using-systemtap-perf-and-flame-graphs/(link is external)
http://externaltable.blogspot.ch/2015/12/extended-stack-profiling-ideas-tools.html(link is external)
http://blog.tanelpoder.com/2015/09/21/ram-is-the-new-disk-and-how-to-measure-its-performance-part-2-tools/(link is external)
http://externaltable.blogspot.ch/2014/11/life-of-oracle-io-tracing-logical-and.html(link is external)
https://fritshoogland.wordpress.com/2014/02/27/investigating-the-wait-interface-via-gdb/(link is external)
http://enkitec.tv/2013/03/14/tanel-poders-hacking-session-how-oracle-sql-plans-are-really-executed-part-2/(link is external)
https://alexanderanokhin.wordpress.com/tools/digger/(link is external)
http://lxr.free-electrons.com/source/tools/perf/Documentation/perf-probe...(link is external)
http://linux.die.net/man/1/perf-probe(link is external)
http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt(link is external)
Tags: Linux Perf performance Tracing
Linux Perf Probes for Oracle Tracing
标签:oltp intro href cache add nes using form syntax