Trace Event Timeline, Part 3
August 11th, 2009This is part 3 of my series discussing a new technique for analyzing extended SQL trace files that I’m calling a “Trace Event Timeline”. The timeline is presented in a stacked bar chart format based on non-uniform intervals. Each stacked bar is the sum of the time spent in each recorded event during that particular timed interval. The intervals are non-uniform because they are taken only at points in between either WAIT events or recursive depth zero PARSE, EXEC or FETCH statements. A target interval length is derived by dividing the total trace file duration by the number of specified intervals, currently 20. When the running timeline of the trace file exceeds the target interval a new interval is started.
The purpose of the technique is to provide more information about when events occurred during the tracing period. Having just a resource profile of the trace file leaves out information about the event duration skew. An example might be having a large amount of ”SQL*Net message from client time” that exists almost completely at the start and end of the trace period. This might be in contrast to having that relatively evenly distributed across the entire interval. In the first case it might simply be a “scoping” problem with the start and end of the trace whereas the second example might be part of the actual performance issue (i.e. long client side processing separated by row fetches from the database).
As I mentioned in part 1 of this series, the idea for an event timeline occurred to me while I looking at a process that would do a bunch of database work very fast and then pause for a bit and would then speed up again. All sorts of conjecture about database overload were made, but all indications from a workload perspective showed that the database server was barely doing any work at all. The following shows the resource profile from a trace taken during the period that this behavior occurred:
Total Event % Event Total --------------- Cumulative ----------------
Event Name Time (sec) Time Events Avg Time(sec) Min Time(sec) Max Time(sec)
---------------------------- ----------- ------- -------- ------------- ------------- -------------
SQL*Net message from client 255.000799 48.11 152,647 0.001671 0.000138 0.290118
CPU Service 244.962734 46.22 864,046 0.000284 0.000000 0.054992
unaccounted-for 18.943126 3.57
db file sequential read 10.345820 1.95 3,262 0.003172 0.000017 0.295407
SQL*Net message to client 0.364421 0.07 152,647 0.000002 0.000000 0.002573
log file switch completion 0.285168 0.05 4 0.071292 0.003946 0.239785
log file sync 0.130075 0.02 28 0.004646 0.000005 0.017599
latch: library cache 0.002779 0.00 36 0.000077 0.000005 0.001123
latch: cache buffers chains 0.002359 0.00 8 0.000295 0.000005 0.000621
SQL*Net more data from client 0.001583 0.00 201 0.000008 0.000005 0.000051
buffer busy waits 0.000276 0.00 6 0.000046 0.000017 0.000144
latch: session allocation 0.000082 0.00 1 0.000082 0.000082 0.000082
latch: shared pool 0.000062 0.00 1 0.000062 0.000062 0.000062
SQL*Net more data to client 0.000034 0.00 1 0.000034 0.000034 0.000034
----------------------------- ----------- ------- --------- ------------- ------------- -------------
Total 530.039318 100.00 1,172,888
Notice that 94% of the time was spent almost equally in just two events, “SQL*Net message from client” and “CPU service”. One might assume from looking at this resource profile, and even the individual events with a tkprof output, that the “SQL*Net message from client” could just be a scoping problem. Perhaps we started the trace too early or ended it too late (i.e. before the process started or after it ended).[1] However, from the trace event timeline shown below we see that this is not the case at all:

What we do see is that the “EXEC” event (i.e. CPU service for EXEC calls) dominates the timeline and then virtually stops for a short period of time, when we see the “SQL*Net message from client” event accumulate most of the time, and then the “EXEC” event takes over again. In fact, this mirrors the behavior that was observed during the actual running of the process, and from this information we can be pretty sure that the delay or “slowdown” is not happening because of the database. If there were contention or some resource shortage we would see that event and not the “SQL*Net message from client” event.
Hopefully I have provided a reasonably clear picture of what a trace event timeline is, how it can be created, and what it can be used for. We’ll have to see if this becomes a useful technique for those problems that are obscured by the skew of event time across a trace file timeline.
[1] Typically an “idle” database process will accumulate “SQL*Net message from client” time waiting for work to be initiated by the client connection.

Entries