DTrace Introduction
DTrace is Solaris 10's Dynamic Tracing facility. It allows us to peer into the innards of running processes and customize our view to exclude extraneous information and close in on the source of a problem.
DTrace also has capabilities that allow us to examine a crash dump or trace the boot process.
A number of freely available scripts have been made available as the DTrace Toolkit. The toolkit provides both programming examples and also extremely useful tools for different types of system monitoring.
The DTrace facility provides data to a number of consumers,
including commands such as dtrace
and
lockstat
, as well as programs calling libraries
that access DTrace through the dtrace
kernel driver.
Probes
DTrace is built on a foundation of objects called probes. Probes are event handlers that fire when their particular event occurs. DTrace can bind a particular action to the probe to make use of the information.
Probes report on a variety of information about their event. For example, a probe for a kernel function may report on arguments, global variables, timestamps, stack traces, currently running processes or the thread that called the function.
Kernel modules that enable probes are packaged into sets known as providers. In a DTrace context, a module is a kernel module (for kernel probes) or a library name (for applications). A function in DTrace refers to the function associated with a probe, if it belongs to a program location.
Probes may be uniquely addressed by a combination of the provider,
module, function and name. These are frequently organized into a
4-tuple when invoked by the
dtrace
command.
Alternatively, each probe has a unique integer identifier, which can vary depending on Solaris patch level.
These numbers, as well as the provider, module, function and
name, can be listed out through the dtrace -l
command. The list will vary from system to system, depending
on what is installed. Probes can be listed by function, module
or name by specifying it with the -f
, -m
or -n
options, respectively.
Running a dtrace
without a -l
, but
with a -f
, -m
or -n
option,
enables all matching probes. All the probes in a provider can be
enabled by using the -P
option. An individual probe
can be enabled by using its 4-tuple with the -n
option.
(Note: Do not enable more probes than necessary.
If too many probes are enabled, it may adversely impact performance.
This is particularly true of sched
probes.)
Some probes do not list a module or function. These are called "unanchored" probes. Their 4-tuple just omits the nonexistent information.
Providers
Providers are kernel modules that create related groups of probes.
The most commonly referenced providers are:
fbt
: (Function Boundary Tracing) Implements probes at the entry and return points of almost all kernel functions.io
: Implements probes for I/O-related events.pid
: Implements probes for user-level processes at entry, return and instruction.proc
: Implements probes for process creation and life-cycle events.profile
: Implements timer-driven probes.sched
: Implements probes for scheduling-related events.sdt
: (Statistically Defined Tracing) Implements programmer-defined probes at arbitrary locations and names within code. Obviously, the programmer should define names whose meaning is intuitively clear.syscall
: Implements entry and return probes for all system calls.sysinfo
: Probes for updates to the sys kstat.vminfo
: Probes for updates to the vm kstat.
Command Components
The dtrace command has several components:
- A 4-tuple identifier:
provider:module :function:name
Leaving any of these blank is equivalent to using a wildcard match. (If left blank, the left-most members of the 4-tuple are optional.) - A predicate determines whether the action should be taken.
They are enclosed in slashes: /predicate/. The predicate
is a C-style relational expression which must evaluate to an
integer or pointer. If omitted, the action is executed when the
probe fires. Some predicate examples are:
- executable name matches csh:
/execname == "csh"/
- process ID does not match 1234:
/pid != 1234/
arg0
is 1 andarg1
is not 0:
/arg0 == 1 && arg1 !=0/
- executable name matches csh:
- An action (in the D scripting language) to be taken when the probe fires and the predicate is satisfied. Typically, this is listed in curly brackets: {}
Several command examples are provided at the bottom of the page.
D Scripting Language
In order to deal with operations that can become confusing on a single command line, a D script can be saved to a file and run as desired. A D script will have one or more probe clauses, which consist of one or more probe-descriptions, along with the associated predicates and actions:
#!/usr/sbin/dtrace -s
probe-description[, probe-description...]
/
predicate/
{
action; [action; ...]
}
The probe-description section consists of one or more 4-tuple identifiers. If the predicate line is not present, it is the same as a predicate that is always true. The action(s) specified are to be run if the probe fires and the predicate is true.
Each recording action dumps data to a trace buffer. By default, this is the principal buffer.
Several programming examples are provided at the bottom of the page.
D Variables
D specifies both associative arrays and scalar variables. Storage for these variables is not pre-allocated. It is allocated when a non-zero value is assigned and deallocated when a zero value is assigned.
D defines several built-in variables, which are frequently used in creating predicates and actions. The most commonly used built-in variables for D are the following:
args[]
: Theargs[]
array contains the arguments, specified from 0 to the number of arguments less one. These can also be specified byargn
, where this is then+1
th argument.curpsinfo
:psinfo
structure of current process.curthread
: pointer to the current thread'skthread_t
execname
: Current executable namepid
: Current process IDppid
: Parent process IDprobefunc
: function name of the current probeprobemod
: module name of the current probeprobename
: name of the current probetimestamp
: Time since boot in ns
Variable scope can be global, thread-local or clause-local.
Thread-local variables allow separate storage for each thread's
copy of that variable. They are referenced with names of the form:
self->
variable-name
Associative arrays can be indexed by an arbitrary name. There is no pre-defined limit on the number of elements.
Scalar variables hold a single data value.
We can also access Solaris kernel symbols by specifying them in backquotes.
Providers define arguments based on their own requirements. Some of the more useful such arguments are listed below:
Provider-Specific Variables | |
---|---|
io | |
arg[0] |
Pointer to a bufinfo structure. |
arg[0]->b_bcount |
Byte count. |
arg[0]->b_resid |
Bytes not transferred. |
arg[0]->b_iodone |
I/O completion routine. |
arg[0]->b_edev |
Extended device. |
arg[1] |
Pointer to a devinfo structure. |
arg[1]->dev_major |
Major number. |
arg[1]->dev_minor |
Minor number. |
arg[1]->dev_instance |
Instance number. |
arg[1]->dev_name |
Device name. |
arg[1]->dev_pathname |
Device pathname. |
arg[2] |
Pointer to a fileinfo structure. |
arg[2]->fi_name |
File name. |
arg[2]->fi_dirname |
File directory location. |
arg[2]->fi_pathname |
Full path to file. |
arg[2]->fi_offset |
Offset within a file. |
arg[2]->fi_fs |
Filesystem. |
arg[2]->fi_mount |
Filesystem mount point. |
pid | |
arg0-argn (entry) |
For entry probes, arg0-argn represent the
arguments. |
arg0-arg1 (return) |
For return probles, arg0-arg1 represent
the return codes. |
syscall | |
arg0-argn (entry) |
For entry probes, arg0-argn represent the
arguments. |
arg0-arg1 (return) |
For return probles, arg0-arg1 represent
the return codes. |
sysinfo | |
arg0 |
Value of statistic increment |
arg1 |
Pointer to the current value of the statistic before increment. |
arg2 |
Pointer to the cpu_t structure incrementing
the statistic. Defined in sys/cpuvar.h |
vminfo | |
arg0 |
Value of statistic increment |
arg1 |
Pointer to the current value of the statistic before increment. |
Actions
The most commonly used built-in actions are:
breakpoint()
: System stops and transfers control to kernel debugger.chill(
number-nanoseconds)
: DTrace spins for the specified number of nanoseconds.copyinstr(
pointer)
: Returns null terminated string from address space referenced by pointer.copyout(
buffer, address, number-bytes)
: Copies number-bytes from the buffer to a memory address.copyoutstr(
string, address, max-length)
: Copies a string to a memory address.normalize(
aggregation, normalization-factor)
: Divides aggregation values by the normalization-factor.panic()
: Panics the kernel; may be used to generate a core dump.printf(
format, arguments)
: Dumps the arguments to the buffer in the specified format.printa(
aggregation)
does the same thing for aggregation data.raise(
signal)
: Sends the signal to the current process.stack(
number-frames)
: Copies the specified number of frames of the kernel thread's stack to the buffer.stop()
: Stops the process that fired the probe.stringof()
: Converts values to DTrace string values.system(
command)
: Runs a program as if from the shell.trace(
D-expression)
: Dumps the output of D-expression to the trace buffer.tracemem(
address,size_t
number-bytes)
: Dumps the contents from the memory address to the buffer.trunc(
aggregation)
: Truncates or removes the contents of the specified aggregation.ustack(
number-frames)
: Copies the specified number of frames of the user stack to the buffer.
Multiple actions in a probe clause can be combined using a semicolon between them inside the curly brackets.
Aggregations
Aggregating functions allow multiple data points to be combined
and reported. Aggregations take the form:
@name
[ keys ] = aggregating-function(
arguments );
Here, the name is a name assigned to the aggregation, the keys
are a comma-separated list of D expressions which index the output,
the arguments are a comma-separated list and the aggregating
functions may be one of the following:
avg
: Average of the expressions in the arguments.count
: Number of times that the function is called.lquantize
: The arguments are a scalar expression, a lower bound, an upper bound and a step value. This function increments the value in the highest linearly-sized bucket that is less than the expression.max
: The largest value among the arguments.min
: The smallest value among the arguments.quantize
: Increments the value in the highest power of two bucket less than the expression in the argument.sum
: Total value of the expressions in the arguments.
Directives
Directives provide options to a D script that can increase readability of
the output. Each option is enabled by including a line like the
following at the beginning of a D script:
#pragma D option
option-name
The following are the most commonly-used directives.
flowindent
: Indentation increased on function entry; decreased on function return.quiet
: Don't print anything not explicitly specified.
Examples
To report on each active PID for the readch
function:
dtrace -n 'readch {trace(pid)}'
To report on each system call's entry time:
dtrace -n 'syscall:::entry {trace(timestamp)}'
Tracing executable names (stored on a UFS filesystem):
dtrace -m 'ufs {trace(execname)}'
To print all functions from libc
for process 123:
dtrace -n 'pid123:libc::entry'
To count the occurrences of each libc
function for PID 123:
#!/usr/sbin/dtrace -s
pid123:libc::entry
{
@function_count[probefunc]=count();
}
(Once this script is stopped with control-c, the results will be printed
out.)
To identify how much time is spent in each function, we would
add a probe-clause as follows. In the below, the
timestmp[]
array stores the timestamps
for each function as it is entered; the aggregation
function_duration[]
stores the sum total
of durations for invocations of each function. Also
note that the scope of the timestmp[]
array
is limited to a single thread to deal with the possibility
of multiple concurrent copies of the same function:
#!/usr/sbin/dtrace -s
pid123:libc::entry
{
self->timestmp[probefunc] = timestmp;
}
pid123:libc::return
/self->timestmp[probefunc] != 0/
{
@function_duration[probefunc] = sum(timestmp - self->timestmp[probefunc]);
timestmp[probefunc] = 0;
}
The following script prints out the name of every program
that calls the exec
or exece
system calls, as well as the name of the program
being executed:
While this script is straightforward, it may fail if#!/usr/sbin/dtrace -s
syscall::exec*:entry
{
trace(execname);
trace(copyinstr(arg0));
}
arg0
has been written to disk as part of
paging activity. (See Bennett, Part 2.) In that event,
it may make more sense to use a variable to hold the
pointer until return
pulls any pointers
back into memory. The below also uses printf to format
the output and turn off default output with "quiet:"
#!/usr/sbin/dtrace -s
#pragma D option quiet
syscall::exec*:entry
{
self->prog = copyinstr(arg0);
self->exn = execname;
}
syscall::exec*:return
/ self->prog != NULL /
{
printf ("%-20s %s\n", self->exn, self->prog);
self->prog = 0;
self->exn = 0;
}
DTrace Toolkit
A number of freely available scripts have been made available as the DTrace Toolkit. The toolkit provides both programming examples and also extremely useful tools for different types of system monitoring.
DTrace Toolkit | |
---|---|
Script Name | Description |
anonpgpid.d |
Attempts to identify which processes are suffering the most from a system that is hard swapping |
bitesize.d |
Provides graphs of distributions of different I/O sizes |
connections |
Displays server process that accepts each inbound TCP connection. |
cputypes.d |
Reports on types of CPUs on the system. |
cpuwalk.d |
Reports on which CPUs a process runs on. |
cswstat.d |
Reports on context switches and time consumed. |
dapptrace |
Traces user and library function usage. Similar to
apptrace , but also gets elapsed and CPU
times |
dispqlen.d |
Measures dispatcher queue length (CPU saturation). |
dnlcps.d |
Measures DNLC hits and misses by process. |
dnlcsnoop.d |
Real time record of target, process and result of DNLC lookups. |
dtruss |
truss replacement without the performance
hit. |
filebyproc.d |
Snoops files opened by process name. |
fsrw.d |
Traces I/O events at a system call level. |
hotspot.d |
Identifies disk "hot spots." |
inttimes.d |
Reports on time spent servicing interrupts for each device. |
iofile.d |
I/O wait times for each file by process. |
iofileb.d |
I/O size for each file by process. |
iopattern |
System-wide disk I/O usage patterns. |
iosnoop |
Tracks system I/O activity. |
iotop |
Displays processes with highest I/O traffic. |
lockbydist.d |
Lock distribution by process. |
lockbyproc.d |
Lock times by process. |
nfswizard.d |
Identifies top NFS filename requests; reports on access and performance statistics. |
opensnoop |
Snoops open files. |
pfilestat |
I/O statistics for each file descriptor in a process. |
priclass.d |
Reports distribution of thread priorites by class. |
pridist.d |
Reports distribution of thread priorities by class. |
procsystime |
Process system call details; elapsed time, CPU time, counts, etc. |
rfileio.d |
Read size statistics from file systems and physical disks, along with a total miss rate for the file system cache. |
rfsio |
Read size statistics from file systems and physical disks, along with a total miss rate for the file system cache. |
rwsnoop |
Captures read/write activity, including identifying the source processes. |
rwtop |
Displays processes with top read/write activity. |
sampleproc |
Reports which process is on which CPU how much of the time. |
seeksize.d |
Directly measure seek lengths of I/Os. |
swapinfo.d |
Reports a summary of virtual memory use. |
tcpstat.d |
Reports TCP error and traffic statistics. |
tcpsnoop.d |
Snoops TCP packets and associates them with a port and a process. |
tcptop |
Displays to TCP packet-generating processes. |
threaded.d |
Measures effectiveness of thread utilization. |
tcpsyscall |
Reports on busiest system calls. |
udpstat.d |
Reports UDP error and traffic statistics. |
vopstat |
Function-level timings of I/Os. |
xcallsbypid.d |
Provides by-process cross-call information. |
zvmstat |
Zone-specific vmstat. |
Additional Resources
- Bennett, Chip. Learning DTrace-Parts 1-5. SysAdmin Magazine, Sep 2006-Jan 2007. Also at Laurus.
- Gregg, Brendan. DTrace Tools web page.
- McDougall, Richard; Mauro, Jim; Gregg, Brendan. Solaris Performance and Tools. October, 2006. Prentice Hall.
- DTrace User Guide. May, 2006. Sun Microsystems.
- Solaris Dynamic Tracing Guide. 2005. Sun Microsystems.
- Using DTrace from a Solaris 10 System. Sun Microsystems.
No comments:
Post a Comment