Friday, May 10, 2013

dtrace

DTrace Introduction

DTrace is Solaris 10's Dynamic Tracing facility. It allows us to peer into the innards of running processes and customize our view to exclude extraneous information and close in on the source of a problem.

DTrace also has capabilities that allow us to examine a crash dump or trace the boot process.

A number of freely available scripts have been made available as the DTrace Toolkit. The toolkit provides both programming examples and also extremely useful tools for different types of system monitoring.

The DTrace facility provides data to a number of consumers, including commands such as dtrace and lockstat, as well as programs calling libraries that access DTrace through the dtrace kernel driver.

Probes

DTrace is built on a foundation of objects called probes. Probes are event handlers that fire when their particular event occurs. DTrace can bind a particular action to the probe to make use of the information.

Probes report on a variety of information about their event. For example, a probe for a kernel function may report on arguments, global variables, timestamps, stack traces, currently running processes or the thread that called the function.

Kernel modules that enable probes are packaged into sets known as providers. In a DTrace context, a module is a kernel module (for kernel probes) or a library name (for applications). A function in DTrace refers to the function associated with a probe, if it belongs to a program location.

Probes may be uniquely addressed by a combination of the provider, module, function and name. These are frequently organized into a 4-tuple when invoked by the dtrace command.

Alternatively, each probe has a unique integer identifier, which can vary depending on Solaris patch level.

These numbers, as well as the provider, module, function and name, can be listed out through the dtrace -l command. The list will vary from system to system, depending on what is installed. Probes can be listed by function, module or name by specifying it with the -f, -m or -n options, respectively.

Running a dtrace without a -l, but with a -f, -m or -n option, enables all matching probes. All the probes in a provider can be enabled by using the -P option. An individual probe can be enabled by using its 4-tuple with the -n option.

(Note: Do not enable more probes than necessary. If too many probes are enabled, it may adversely impact performance. This is particularly true of sched probes.)

Some probes do not list a module or function. These are called "unanchored" probes. Their 4-tuple just omits the nonexistent information.

Providers

Providers are kernel modules that create related groups of probes. The most commonly referenced providers are:

  • fbt: (Function Boundary Tracing) Implements probes at the entry and return points of almost all kernel functions.
  • io: Implements probes for I/O-related events.
  • pid: Implements probes for user-level processes at entry, return and instruction.
  • proc: Implements probes for process creation and life-cycle events.
  • profile: Implements timer-driven probes.
  • sched: Implements probes for scheduling-related events.
  • sdt: (Statistically Defined Tracing) Implements programmer-defined probes at arbitrary locations and names within code. Obviously, the programmer should define names whose meaning is intuitively clear.
  • syscall: Implements entry and return probes for all system calls.
  • sysinfo: Probes for updates to the sys kstat.
  • vminfo: Probes for updates to the vm kstat.

Command Components

The dtrace command has several components:

  • A 4-tuple identifier:
    provider:module :function:name
    Leaving any of these blank is equivalent to using a wildcard match. (If left blank, the left-most members of the 4-tuple are optional.)
  • A predicate determines whether the action should be taken. They are enclosed in slashes: /predicate/. The predicate is a C-style relational expression which must evaluate to an integer or pointer. If omitted, the action is executed when the probe fires. Some predicate examples are:
    • executable name matches csh:
      /execname == "csh"/
    • process ID does not match 1234:
      /pid != 1234/
    • arg0 is 1 and arg1 is not 0:
      /arg0 == 1 && arg1 !=0/
  • An action (in the D scripting language) to be taken when the probe fires and the predicate is satisfied. Typically, this is listed in curly brackets: {}

Several command examples are provided at the bottom of the page.

D Scripting Language

In order to deal with operations that can become confusing on a single command line, a D script can be saved to a file and run as desired. A D script will have one or more probe clauses, which consist of one or more probe-descriptions, along with the associated predicates and actions:

#!/usr/sbin/dtrace -s
probe-description[, probe-description...]
/predicate/
{
action; [action; ...]
}

The probe-description section consists of one or more 4-tuple identifiers. If the predicate line is not present, it is the same as a predicate that is always true. The action(s) specified are to be run if the probe fires and the predicate is true.

Each recording action dumps data to a trace buffer. By default, this is the principal buffer.

Several programming examples are provided at the bottom of the page.

D Variables

D specifies both associative arrays and scalar variables. Storage for these variables is not pre-allocated. It is allocated when a non-zero value is assigned and deallocated when a zero value is assigned.

D defines several built-in variables, which are frequently used in creating predicates and actions. The most commonly used built-in variables for D are the following:

  • args[]: The args[] array contains the arguments, specified from 0 to the number of arguments less one. These can also be specified by argn, where this is the n+1th argument.
  • curpsinfo: psinfo structure of current process.
  • curthread: pointer to the current thread's kthread_t
  • execname: Current executable name
  • pid: Current process ID
  • ppid: Parent process ID
  • probefunc: function name of the current probe
  • probemod: module name of the current probe
  • probename: name of the current probe
  • timestamp: Time since boot in ns

Variable scope can be global, thread-local or clause-local.

Thread-local variables allow separate storage for each thread's copy of that variable. They are referenced with names of the form:
self->variable-name

Associative arrays can be indexed by an arbitrary name. There is no pre-defined limit on the number of elements.

Scalar variables hold a single data value.

We can also access Solaris kernel symbols by specifying them in backquotes.

Providers define arguments based on their own requirements. Some of the more useful such arguments are listed below:

Provider-Specific Variables
io
arg[0] Pointer to a bufinfo structure.
arg[0]->b_bcount Byte count.
arg[0]->b_resid Bytes not transferred.
arg[0]->b_iodone I/O completion routine.
arg[0]->b_edev Extended device.
arg[1] Pointer to a devinfo structure.
arg[1]->dev_major Major number.
arg[1]->dev_minor Minor number.
arg[1]->dev_instance Instance number.
arg[1]->dev_name Device name.
arg[1]->dev_pathname Device pathname.
arg[2] Pointer to a fileinfo structure.
arg[2]->fi_name File name.
arg[2]->fi_dirname File directory location.
arg[2]->fi_pathname Full path to file.
arg[2]->fi_offset Offset within a file.
arg[2]->fi_fs Filesystem.
arg[2]->fi_mount Filesystem mount point.
pid
arg0-argn (entry) For entry probes, arg0-argn represent the arguments.
arg0-arg1 (return) For return probles, arg0-arg1 represent the return codes.
syscall
arg0-argn (entry) For entry probes, arg0-argn represent the arguments.
arg0-arg1 (return) For return probles, arg0-arg1 represent the return codes.
sysinfo
arg0 Value of statistic increment
arg1 Pointer to the current value of the statistic before increment.
arg2 Pointer to the cpu_t structure incrementing the statistic. Defined in sys/cpuvar.h
vminfo
arg0 Value of statistic increment
arg1 Pointer to the current value of the statistic before increment.
The full list for each provider can be found in the provider's chapter of the Solaris Dynamic Tracing Guide.

Actions

The most commonly used built-in actions are:

  • breakpoint(): System stops and transfers control to kernel debugger.
  • chill(number-nanoseconds): DTrace spins for the specified number of nanoseconds.
  • copyinstr(pointer): Returns null terminated string from address space referenced by pointer.
  • copyout(buffer, address, number-bytes): Copies number-bytes from the buffer to a memory address.
  • copyoutstr(string, address, max-length): Copies a string to a memory address.
  • normalize(aggregation, normalization-factor): Divides aggregation values by the normalization-factor.
  • panic(): Panics the kernel; may be used to generate a core dump.
  • printf(format, arguments): Dumps the arguments to the buffer in the specified format. printa(aggregation) does the same thing for aggregation data.
  • raise(signal): Sends the signal to the current process.
  • stack(number-frames): Copies the specified number of frames of the kernel thread's stack to the buffer.
  • stop(): Stops the process that fired the probe.
  • stringof(): Converts values to DTrace string values.
  • system(command): Runs a program as if from the shell.
  • trace(D-expression): Dumps the output of D-expression to the trace buffer.
  • tracemem(address, size_t number-bytes): Dumps the contents from the memory address to the buffer.
  • trunc(aggregation): Truncates or removes the contents of the specified aggregation.
  • ustack(number-frames): Copies the specified number of frames of the user stack to the buffer.

Multiple actions in a probe clause can be combined using a semicolon between them inside the curly brackets.

Aggregations

Aggregating functions allow multiple data points to be combined and reported. Aggregations take the form:
@name[ keys ] = aggregating-function( arguments );
Here, the name is a name assigned to the aggregation, the keys are a comma-separated list of D expressions which index the output, the arguments are a comma-separated list and the aggregating functions may be one of the following:

  • avg: Average of the expressions in the arguments.
  • count: Number of times that the function is called.
  • lquantize: The arguments are a scalar expression, a lower bound, an upper bound and a step value. This function increments the value in the highest linearly-sized bucket that is less than the expression.
  • max: The largest value among the arguments.
  • min: The smallest value among the arguments.
  • quantize: Increments the value in the highest power of two bucket less than the expression in the argument.
  • sum: Total value of the expressions in the arguments.

Directives

Directives provide options to a D script that can increase readability of the output. Each option is enabled by including a line like the following at the beginning of a D script:
#pragma D option option-name

The following are the most commonly-used directives.

  • flowindent: Indentation increased on function entry; decreased on function return.
  • quiet: Don't print anything not explicitly specified.

Examples

To report on each active PID for the readch function:
dtrace -n 'readch {trace(pid)}'

To report on each system call's entry time:
dtrace -n 'syscall:::entry {trace(timestamp)}'

Tracing executable names (stored on a UFS filesystem):
dtrace -m 'ufs {trace(execname)}'

To print all functions from libc for process 123:
dtrace -n 'pid123:libc::entry'

To count the occurrences of each libc function for PID 123:

#!/usr/sbin/dtrace -s
pid123:libc::entry
{
@function_count[probefunc]=count();
}
(Once this script is stopped with control-c, the results will be printed out.)

To identify how much time is spent in each function, we would add a probe-clause as follows. In the below, the timestmp[] array stores the timestamps for each function as it is entered; the aggregation function_duration[] stores the sum total of durations for invocations of each function. Also note that the scope of the timestmp[] array is limited to a single thread to deal with the possibility of multiple concurrent copies of the same function:

#!/usr/sbin/dtrace -s
pid123:libc::entry
{
self->timestmp[probefunc] = timestmp;
}
pid123:libc::return
/self->timestmp[probefunc] != 0/
{
@function_duration[probefunc] = sum(timestmp - self->timestmp[probefunc]);
timestmp[probefunc] = 0;
}

The following script prints out the name of every program that calls the exec or exece system calls, as well as the name of the program being executed:

#!/usr/sbin/dtrace -s
syscall::exec*:entry
{
trace(execname);
trace(copyinstr(arg0));
}

While this script is straightforward, it may fail if arg0 has been written to disk as part of paging activity. (See Bennett, Part 2.) In that event, it may make more sense to use a variable to hold the pointer until return pulls any pointers back into memory. The below also uses printf to format the output and turn off default output with "quiet:"
#!/usr/sbin/dtrace -s
#pragma D option quiet
syscall::exec*:entry
{
self->prog = copyinstr(arg0);
self->exn = execname;
}
syscall::exec*:return
/ self->prog != NULL /
{
printf ("%-20s %s\n", self->exn, self->prog);
self->prog = 0;
self->exn = 0;
}

DTrace Toolkit

A number of freely available scripts have been made available as the DTrace Toolkit. The toolkit provides both programming examples and also extremely useful tools for different types of system monitoring.

DTrace Toolkit
Script Name Description
anonpgpid.d Attempts to identify which processes are suffering the most from a system that is hard swapping
bitesize.d Provides graphs of distributions of different I/O sizes
connections Displays server process that accepts each inbound TCP connection.
cputypes.d Reports on types of CPUs on the system.
cpuwalk.d Reports on which CPUs a process runs on.
cswstat.d Reports on context switches and time consumed.
dapptrace Traces user and library function usage. Similar to apptrace, but also gets elapsed and CPU times
dispqlen.d Measures dispatcher queue length (CPU saturation).
dnlcps.d Measures DNLC hits and misses by process.
dnlcsnoop.d Real time record of target, process and result of DNLC lookups.
dtruss truss replacement without the performance hit.
filebyproc.d Snoops files opened by process name.
fsrw.d Traces I/O events at a system call level.
hotspot.d Identifies disk "hot spots."
inttimes.d Reports on time spent servicing interrupts for each device.
iofile.d I/O wait times for each file by process.
iofileb.d I/O size for each file by process.
iopattern System-wide disk I/O usage patterns.
iosnoop Tracks system I/O activity.
iotop Displays processes with highest I/O traffic.
lockbydist.d Lock distribution by process.
lockbyproc.d Lock times by process.
nfswizard.d Identifies top NFS filename requests; reports on access and performance statistics.
opensnoop Snoops open files.
pfilestat I/O statistics for each file descriptor in a process.
priclass.d Reports distribution of thread priorites by class.
pridist.d Reports distribution of thread priorities by class.
procsystime Process system call details; elapsed time, CPU time, counts, etc.
rfileio.d Read size statistics from file systems and physical disks, along with a total miss rate for the file system cache.
rfsio Read size statistics from file systems and physical disks, along with a total miss rate for the file system cache.
rwsnoop Captures read/write activity, including identifying the source processes.
rwtop Displays processes with top read/write activity.
sampleproc Reports which process is on which CPU how much of the time.
seeksize.d Directly measure seek lengths of I/Os.
swapinfo.d Reports a summary of virtual memory use.
tcpstat.d Reports TCP error and traffic statistics.
tcpsnoop.d Snoops TCP packets and associates them with a port and a process.
tcptop Displays to TCP packet-generating processes.
threaded.d Measures effectiveness of thread utilization.
tcpsyscall Reports on busiest system calls.
udpstat.d Reports UDP error and traffic statistics.
vopstat Function-level timings of I/Os.
xcallsbypid.d Provides by-process cross-call information.
zvmstat Zone-specific vmstat.

Additional Resources

No comments: