时间:2021-07-01 10:21:17 帮助过:16人阅读
Topic: This post is about tracing logical and physical reads in Oracle using SystemTap. You find here a few examples illustrating the main mechanisms used by Oracle to performo physical and logical I/O and you learn how to build SystemTap scripts to further explore and troubleshoot Oracle I/O.
Introduction: Performing input/output from and to the storage media, i.e. physical I/O, is one of the critical tasks underlying all database engines. When accessing data stored in RAM from the database cache we can talk of logical I/O. Physical I/O is performed by the database using system calls to the O/S kernel. The type of system calls used depends on database settings, the O/S and the type of storage. Logical I/O is performed by the database processes using Oracle kernel functions. Oracle instruments the timing of physical I/O calls with the wait event interface, while logical I/O time is generically accounted for as CPU time. Tracing I/O operations performed by Oracle processes together with the details provided by the Oracle wait event interface is a powerful technique to investigate Oracle I/O internals. With the help of a few real-life examples we investigate what the mechanisms are that Oracle uses to access storage, how the I/O operations correlate to the wait event interface and what the meaning of the wait event time is for the most common wait events associated with I/O.
Building the tracing tool: Investigations of what Oracle and the O/S do when an I/O operation is performed require specialized tools. Systemtap is a dynamic tracing tool specific to Linux that allows tracing of both the O/S and the Oracle functions (Linux kernel functions, system calls and Oracle database kernel functions). In the following you can find a description of a systemtap script that can be used to trace Oracle logical and physical I/O: it is composed of probes that can attach to the function calls of interest and can read function parameters, memory and CPU registers.
Systemtap requires relatively recent kernels to be able to probe userspace: either the utrace or uprobe_events functionality needs to be available. This is the case for example of RHEL/OL 6.5 and higher but not with RHEL 5. Moreover for tracing system calls the kernel debuginfo package needs to be installed.
More details on this and on how to build a lab environment can be found in this article(link is external). The script that we use in this blog post to trace Oracle logical and physical I/O and wait event interface is: trace_oracle_logicalio_wait_events_physicalio_12102.stp(link is external) (for Oracle 12.1.0.2). A version for 11.2.0.4 is also available: trace_oracle_logicalio_wait_events_physicalio_11204.stp(link is external) Before moving to the examples and studies of Oracle I/O I would like to review the three main parts of the script: tracing physical I/O, logical I/O and wait events.
Tracing physical I/O: Oracle ultimately issues system calls to execute I/O operations. The type of calls depend on database settings, the O/S type and the type of storage. In this article you can find examples to Oracle on Linux using ASM storage on block devices. Notably direct NFS, local storage and ASM with asmlib are left out and possibly addressed in future studies. System calls can be traced using the strace(1) utility in Linux too. However, in this article you can see how SystemTap probes can be built to gather information on the system calls of interest: pread, pwrite, io_submit and io_getevents. Additional useful SystemTap probes to the underlying block I/O layer are: ioblock.request, ioblock_trace.request and ioblock.end (note, these are not system calls, but rather tracepoints for function calls inside the Linux kernel).
Probe name |
Purpose |
Selected parameters |
---|---|---|
syscall.pread/pwrite |
synchronous I/O: read/write from/to a file descriptor |
fd (file descriptor number), offset, count. return value: bytes read/written |
syscall.io_submit |
asynchronous I/O: submit blocks for processing |
nr (number of I/O). For each I/O: file descriptor, offset, bytes, opcode (0 for read, 1 for write) return value: number of I/O operations performed |
syscall.io_getevents |
asynchronous I/O: read events from the completion queue |
min_nr (min number of I/Os to read), timeout for event waiting. Return value: number of I/O reaped, For each I/O: file descriptor, offset, bytes |
ioblock.request, ioblock_trace.request |
I/O requests sent to the block device interface kernel layer |
devname, sector, size, rw, block_io structure address |
ioblock.end |
return from the block device interface kernel layer |
devname, sector, rw, block_io structure address |
Note added, August 2015: Systemtap probes on kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete") can also be used to investigate the block I/O interface with the added advantage of not requiring kernel debuginfo. You can find examples of such probes in the page with download material or on GitHub(link is external).
Tracing logical I/O: This is about providing the history and details of the logical I/O operations, which is when Oracle reads a data block from memory (buffer cache). Logical I/O operations are a super-set of physical I/O: if Oracle does not find data in memory (buffer cache) it will issue a physical read. You can find a few examples of this mechanism in action later in this post.
Oracle instrumentation provides extensive information on logical reads in several V$ views, such as V$SYSSTAT, V$SESSTAT, V$SYSMETRIC. Also logical read data is available from sql*plus with "set autotrace on" and with tkprof reports of 10046 traces. Logical reads come in two main "flavors": consistent reads and current reads. Oracle will use consistent reads when reading data at a given system change number SCN, current reads are used when performing DML operations on the block. Consistent reads can be traced in Oracle using the events 10200 and 10201. More details can be found in the presentation by Julian Dyke on logical I/O(link is external).
In this article we show how to trace logical I/O by attaching SystemTap probes to a set of Oracle kernel functions. The role of functions kcbgtcr and kcbgcur for tracing consistent and current reads has already been discussed by Alexander Anokhin in his excellent blog articles on Dtrace LIO(link is external). More recently I have investigated this topic and found useful to also trace the functions kcbzib, kcbzgb, kcbzvb.
A summary of the main finding on these functions is in the table here below. Note: the expressions %rdi, %rsi, %r8 indicate the values of the corresponding CPU registers. With SystemTap the register value can be read with the function register, example: register("rdi"). The CPU registers are used to extract the function call arguments. Systemtap does not provide the function call arguments when there are no debug symbols available for the executable. Notably the calling conventions for x86-64 as documented in www.x86-64.org/documentation/abi.pdf(link is external) state that the integer parameters for a function call are available in the following registers (in order): %rdi, %rsi, %rdx, %rcx, %r8 and %r9.
Function name |
Purpose |
Selected parameters |
---|---|---|
kcbgtcr |
Kernel Cache Buffers Get Consistent Read used for consistent reads |
tbs#=user_int32(%rdi) rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF block#=user_int32(%rdi+4) & 0x003FFFFF data_object_id#=user_int32(%rdi+8) object_id#=user_int32(%rdi+12)
Note: for bigfile tablespaces: block#=user_int32(%rdi+4) |
kcbgcur |
Kernel Cache Buffers Current used for current reads |
tbs#=user_int32(%rdi) rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF block#=user_int32(%rdi+4) & 0x003FFFFF data_object_id#=user_int32(%rdi+8) object_id#=user_int32(%rdi+12)
Note: for bigfile tablespaces: block#=user_int32(%rdi+4) |
kcbzib |
kcbZIB should stand for: Z (kcbz.o is a module for physical IO helper functions), IB: Input Buffer Oracle will perform physical read(s) into the buffer cache
| |
kcbzgb |
The suffix GB in kcbZGB should stand for: Get (space for) Buffer
Oracle allocates space in the buffer cache for a given block (typically before I/O operations).
|
tbs n#=%rsi, rel file n#=%rdx >> 22 & 0x003FFFFF block#=%rdx & 0x003FFFFF data_object_id#=%r8 object_id#=%r9
Note: for bigfile tablespaces: block#=%rdx |
kcbzvb |
Invoked after Oracle has read a given block It is part of the call chain for checking blocks governed by the db_block_checking parameter. Name guess: Kernel cache buffer verify block. Thanks to @FritsHoogland for pointing this out.
This function is used both for reads in the buffer cache and for direct reads
|
tbs n#=%rsi, rel file n#=%rdx >> 22 & 0x003FFFFF, block#=%rdx & 0x003FFFFF data_object_id=user_int32(%rdi+24) Note: for bigfile tablespaces: block#=%rdx |
Note: as I write this only uprobes for kernel versions 3.10 or higher support return probes for userspace functions. Therefore this functionality is not available with RHEL or OL 6.x, but it is available on RHEL 7.x kernels.
Tracing wait events: The wait event interface is probably one of the best features in Oracle performance tuning as it provides a quantitative method to identify performance issues and bottlenecks. Cary Millsap(link is external) has published inspiring work(link is external) on the subject.
For our purposes it suffices to say that wait events are instrumentation points where selected Oracle kernel actions are timed. In particular Oracle will time the start and end calls to most physical I/O operations, therefore providing a good way to understand how much time is spent during physical I/O (some limitations of this mechanism are discussed later in this post). There are also limitations to using the wait event interface to study I/O latency, in particular in the examples section you can find some interesting challenges related to timing asynchronous I/O events.
Wait event tracing is typically activated in Oracle using the 10046 event(link is external) or with by calling DBMS_MONITOR.SESSION_TRACE_ENABLE. In this article we get the wait event details by attaching SystemTap probes to the relevant Oracle kernel functions and in particular to kskthbwt and kskthewt to gather information respectively at the start and at the end of each wait event. Enabling wait event tracing using event 10046 at level 8 and other methods is a way to externalize the wait event timing information (as well as other SQL processing details), not to enable the wait information. This is done using the database parameter TIMED_STATISTICS, and is and should be set to TRUE in modern Oracle databases. One key piece of information that the Oracle kernel functions kskthbwt and kskthewt give us is a pointer into the X$KSUSE fixed table (or rather to the underlying memory structure in the SGA), which is the underlying table to V$SESSION and therefore opens the road to correlate the wait event with many useful information on session executing the SQL. Resolving the pointer in register r13 to the base value of X$KSUSE and computing the offset for the fields of interest requires additional work, this is described in a previous article(link is external) (see also the script: ksuse_find_offsets.sql).
Function name |
Purpose |
Selected parameters |
---|---|---|
kskthbwt |
Kernel service Kompile thread begin wait. This function is called at the start of an Oracle wait event. kslwtbctx is its parent function call and marks the start of a wait event.
|
Useful information when probing the function: |
kskthewt |
Kernel service Kompile thread end wait. This function is called at the end of an Oracle wait event. kslwtectx is its parent function call marking the end of a wait event. The suffix "ewt" most likely stands for "end wait". |
Useful information when probing the function: |
Running the script:
Below you find a few examples of traces for common Oracle I/O access patterns. The traces have been obtained by runnig stap -v trace_oracle_logicalio_wait_events_physicalio_12102.stp -x <pid> where <pid> is the process id of the session being traced. The output is piped into sed -f eventsname.sed, which takes care of translating the event number with the associated event name. The file eventsname.sed is generated by running the script eventsname.sql(link is external). It’s important to rerun the script to generate the event names before testing, because the wait event numbers can change without any warning, for example by migrations to newer versions, but potentially also by applying a patch. The system used to collect the traces on the following examples is Oracle 12.1.0.2 running on OEL 7.0 with UEK3 under Virtualbox.
Example 1: single-block random reads
This example is about single block (random) reads. This access path to data is very important for many OLTP workloads as it is used for table access via a index key. In Figure 1 here below you can see a snippet of a trace file for a query of the type: select <col_list> from table <tab> where <indexed_col>=<val>.
Figure 1: Oracle single-block random read and db file sequential read wait event |
The key points are:
Example 2: sequential I/O into the buffer cache with synchronous I/O
This example is about multi-block reads into the cache. This is an access path that is used for example when Oracle performs a full table scan. There are multiple ways in which Oracle can access the storage to perform full scans, in this example you see the case when Oracle uses synchronous I/O (i.e. the pread call) to perform O/S reads, examples 3 and 4 cover sequential reads with asynchronous I/O. In Figure 2 here below you can see a snippet from tracing the workload of a query of the type: select <col_list> from table <tab>, a full scan of <tab> is performed by Oracle.
Figure 2: Oracle sequential I/O in a case where reads are performed using pread O/S calls |
The key points of the trace in Figure 2:
Example 3: sequential I/O into the buffer cache with asynchronous I/O O/S calls
In Figure 3 you can see a snippet from the trace of the SQL: select <col_list> from table <tab>, where Oracle performs a full scan of table <tab>. This is the same query that was used for Example 2 above, the difference from that case is that the trace in Figure 3 has been taken at a later time in the execution of the query, when the Oracle process switched from using synchronous I/O with pread calls to using the asynchronous I/O interface. Asynchronous I/O is implemented in Linux with io_submit and io_getevents calls, to send the I/O requests and then to reap the results. Although Oracle uses the asynchronous interface to perform read operations in this example, the final result is very similar to what described in Example 2 (i.e. with synchronous I/O): the Oracle wait event is the same, "db file scattered read", the reads are also performed into the buffer cache and I/Os are reaped with blocking calls, therefore making the process effectively synchronous. A more ‘aggressive use‘ of the asynchronous I/O interface by Oracle aimed is found in Oracle with direct reads, that are discussed in Example 4 below.
Figure 3: Oracle performing sequential I/O reads into the buffer cache using asynchronous I/O calls |
The key points are:
Example 4: full scans with direct reads and asynchronous I/O
In this paragraph you can find two examples of Oracle I/O for the case of direct reads on block storage configured with ASM. The same query that was used in Examples 2 and 3 above is also used here: it is a query that performs a full table scan. In this example differently from Examples 2 and 3 Oracle uses serial direct read to access data. Serial direct read is a feature introduced in 11gR2 for optimizing full segment scan operations and bypassing the buffer cache layer, similarly to what happens with parallel queries. Oracle chooses when to use serial direct reads vs. cached reads based on a few parameters. In first approximation direct reads are preferred when the table is "big" compared to the size of buffer cache, the actual mechanism is more complex and can potentially change between Oracle versions. You can read more on this topic in this blog post by Tanel Poder(link is external).
For the particular setup I used to run this tests I had to force the use of serial direct reads before running the query, this was done with: alter session set "_serial_direct_read"=always;
Note that alter session set "_serial_direct_read"=never; can be used to disable serial direct reads and force Oracle to use cached reads (that is the type of I/O you have seen in Example 2 and 3 above). The default value for the underscore parameter "_serial_direct_read" is "auto".
For the investigations in this example we can use a slightly different SystemTap probe from what we have used in the previous examples. I have removed tracing of logical I/O and block I/O for reducing cluttering and added a new probe on the userspace function io_getevents_0_4 of libaio (libaio is the Linux library that implements asynchronous I/O). This is because serial direct reads use non-blocking calls to reap I/Os and these calls are best traced at the libaio interface for reasons described by Frits Hoogland in this blog post(link is external). The SystemTap script represented in Figure 4a and 4b is: trace_oracle_wait_events_asyncio_libaio_12102.stp(link is external) (a version for 11.2.0.4 is available at trace_oracle_wait_events_asyncio_libaio_11204.stp(link is external)).
Figure 4a: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is what happens in the first phases of the full scan |
The key points of Figure 4a are:
The use of serial direct reads in Oracle is more complex than the simple example of Figure 4a. Frits Hoogland has discussed this in more details the internals of direct reads in Oracle in this presentation(link is external). In particular Oracle uses an adaptive mechanism for direct reads with the goal of driving the storage system to high throughput when possible. This is implemented by increasing the number of outstanding I/O when calling the asynchronous I/O interface. Internally this makes use of I/O slots. Note, from a comment on Oracle diagnostic event 10353 we have additional indirect information on this topic: "Slots are a unit of I/O and this factor controls the number of outstanding I/Os".
Another finding of Frits‘ is that there are I/O operations performed for direct reads that are not timed by the wait event interface. This happens because Oracle can issue the io_getevents with timeout=0, that is reaping the I/O without having to block. You can see the these mechanisms in action in the trace of Figure 4b, which is for the same full scan as reported above in Figure 4a just at later time when the mechanism of adaptive serial direct reads has tuned Oracle I/O requests to be "more aggressive" on the use of storage resources.
Figure 4b: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is an example of what happens when the adaptive mechanism increases the number of outstanding I/Os for optimizing the throughput |
The key points of Figure 4b are:
Example 5: asynchronous I/O for random reads
Oracle will optimize single-block reads when it can do prefetching and/or batching of I/O. This is an optimization for random I/O using the asynchronous I/O interface, as Oracle can group multiple I/O requests in a single io_submit call, as opposed to sending them one after the other as in the case of Example 1, where the OS call pread was used. The example from Figure 5 comes from running Kevin Closson‘s SLOB(link is external) on a 12c database. In the execution plan please note step 3 where Oracle uses the operation "TABLE ACCESS BY INDEX ROWID BATCHED" to access table data.
Figure 5: execution plan for a SLOB query illustrating batch table access |
Here below in Figure 6 you can see a snippet of the Oracle I/O tracing when running the SLOB workload. You can see that the asynchronous I/O interface is being used to to perform random I/O. In the case of Figure 6, 126 requests are submitted in a single io_submit operation and subsequently reaped by io_getevents, waiting for all the requests to finish. The wait event in this case is "db file parallel read" and the calculated wait time is the time for the batch of 126 requests to finish. Therefore the wait time of db file parallel read wait does not provide an accurate measurement of the I/O latency for single block random reads.
|
Example 6: tracing DML
In Figure 7 you can see a snippet of tracing Oracle logical I/O and wait events when inserting a row into a table followed by a commit. No physical I/O was performed in this operation (the table had been previously cached). You can see that most of the block reads have been done in current read mode, as expected for DML. Moreover the blocks involved are both belonging to the table being inserted (you can identify them in Figure 7 as belonging to tbs#=11) to rollback segments (you can identify them in Figure 7 as belonging to tbs#=2). Note that for simplicity the table in this example has no indexes.
Figure 7: Logical I/O trace of Oracle performing and insert and commit operation |
Example 7: tracing log writer
In Figure 8 you can see the trace of log writer during a commit operation. This is the same commit operation traced in Example 6 (see also Figure 7 above). The wait event is "log file parallel write" and its duration is timed by oracle from the beginning to the end of the asynchronous I/O operations performed. Therefore the measured wait event time "log file parallel write" is not an accurate measure of block write latency, as its duration depends on the number of operations performed. In the case of Figure 8 you can see that log writer writes to two distinct block devices, which is consistent with the use of 1 logfile member per redo log group, allocated on an ASM diskgroup with normal redundancy: log writer sends 2 write I/O requests simultaneously in order to write the primary and mirror extents of the logfile and waits for both of them to finish.
Figure 8: log writer tracing during a commit operation |
Conclusions
Dynamic tracing and in particular SystemTap provide the opportunity to investigate Oracle physical and logical I/O and open the way to drill-down investigations of Oracle‘s I/O into the various layer of the Oracle engine down to the Linux kernel layer. This post describes some of methods, tools and examples of how these investigations can be performed.
The examples from real-world Oracle workloads highlight some of the most common cases of Oracle logical and physical I/O and the connections between the wait event interface and the O/S calls issued by Oracle processes. The Oracle wait event interface can provide useful data for measuring I/O latency and in general for troubleshooting I/O problems. This is true for random reads instrumented by the wait event "db file sequential read". The information provided by wait events related to asynchronous I/O, however, require additional attention: such events typically do not provide an accurate measure of the I/O latency. Moreover this post provides examples where some of the asynchronous I/Os for Oracle direct reads are not instrumented at all by the wait event interface.
Acknowledgements: I would like to thank Tanel Poder and Frits Hoogland for original ideas that have inspired this work. Special thanks go to Frits Hoogland for providing comments and additional ideas from his deep expertise on the subject.
Download the scripts:discussed in this post from the webpage with download material or from GitHub(link is external).
Tags: SystemTap. performance Oracle I/OThe views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.
CERN Social Media Guidelines
CERN update, Quantum Diaries(link is external), Careers at CERN
Christian Antognini(link is external), Karl Arao(link is external), Martin Bach(link is external), Mark Bobak(link is external), Wolfgang Breitling(link is external), Doug Burns(link is external), Kevin Closson(link is external), Cloudera blog(link is external), Wim Coekaerts(link is external), Bertrand Drouvot(link is external), Enkitec blog(link is external), Pete Finnigan(link is external), Richard Foote(link is external), Randolf Geist(link is external), Marco Gralike(link is external), Brendan Gregg(link is external), Kyle Hailey(link is external), Tim Hall(link is external), Uwe Hesse(link is external), Frits Hoogland(link is external), Hortonworks blog(link is external), Integrity Oracle Security(link is external), Tom Kyte(link is external), Adam Leventhal(link is external), Jonathan Lewis(link is external), Cary Millsap(link is external), James Morle(link is external), Karen Morton(link is external), Arup Nanda(link is external), Mogens N?rgaard(link is external), Oracle The Data Warehouse insider(link is external), Oracle Enterprise Manager(link is external), Oracle Linux blog(link is external), Oracle Multitenant(link is external), Oracle Optimizer blog,(link is external) Oracle R technologies(link is external), Oracle Upgrade blog(link is external), Oracle (link is external)Virtualization blog(link is external), Kerry Osborne(link is external), Tanel Poder(link is external), Planet PostgreSQL(link is external), Kellyn Pot‘Vin(link is external), Pythian blog(link is external), Greg Rahn(link is external), Mark Rittman(link is external), Riyaj Shamsudeen(link is external), Chen Shapira(link is external), Carlos Sierra(link is external)
Life of an Oracle I/O: tracing logical and physical I/O with systemtap
标签:分享 xpl element start like system import pos because