Friday, May 24, 2013

Paging

Solaris uses both common types of paging in its virtual memory system. These types are swapping (swaps out all memory associated with a user process) and demand paging (swaps out the not recently used pages). Which method is used is determined by comparing the amount of available memory with several key parameters:

physmem: physmem is the total page count of physical memory.
lotsfree: The page scanner is woken up when available memory falls below lotsfree. The default value for this is physmem/64 (or 512 KB, whichever is greater); it can be tuned in the /etc/system file if necessary. The page scanner runs in demand paging mode by default. The initial scan rate is set by the kernel parameter slowscan (which is 100 by default).
minfree: Between lotsfree and minfree, the scan rate increases linearly between slowscan and fastscan. (fastscan is determined experimentally by the system as the maximum scan rate that can be supported by the system hardware. minfree is set to desfree/2, and desfree is set to lotsfree/2 by default.) Each page scanner will run for desscan pages. This parameter is dynamically set based on the scan rate.
maxpgio: maxpgio (default 40 or 60) limits the rate at which I/O is queued to the swap devices. It is set to 40 for x86 architectures and 60 for SPARC architectures. With modern hard drives, maxpgio can safely be set to 100 times the number of swap disks.
throttlefree: When free memory falls below throttlefree (default minfree), the page_create routines force the calling process to wait until free pages are available.
pageout_reserve: When free memory falls below this value (default throttlefree/2), only the page daemon and the scheduler are allowed memory allocations.

The page scanner operates by first freeing a usage flag on each page at a rate reported as "scan rate" in vmstat and sar -g. After handspreadpages additional pages have been read, the page scanner checks to see whether the usage flag has been reset. If not, the page is swapped out. (handspreadpages is set dynamically in current versions of Solaris. Its maximum value is pageout_new_spread.)

Solaris 8 introduced an improved algorithm for handling file system page caching (for file systems other than ZFS). This new architecture is known as the cyclical page cache. It is designed to remove most of the problems with virtual memory that were previously caused by the file system page cache.

In the new algorithm, the cache of unmapped/inactive file pages is located on a cachelist which functions as part of the freelist.

When a file page is mapped, it is mapped to the relevant page on the cachelist if it is already in memory. If the referenced page is not on the cachelist, it is mapped to a page on the freelist and the file page is read (or “paged”) into memory. Either way, mapped pages are moved to the segmap file cache. Once all other freelist pages are consumed, additional allocations are taken from the cachelist on a least recently accessed basis. With the new algorithm, file system cache only competes with itself for memory. It does not force applications to be swapped out of primary memory as sometimes happened with the earlier OS versions. As a result of these changes, vmstat reports statistics that are more in line with our intuition. In particular, scan rates will be near zero unless there is a systemwide shortage of available memory. (In the past, scan rates would reflect file caching activity, which is not really relevant to memory shortfalls.)

Every active memory page in Solaris is associated with a vnode (which is a mapping to a file) and an offset (the location within that file). This references the backing store for the memory location, and may represent an area on the swap device, or it may represent a location in a file system. All pages that are associated with a valid vnode and offset are placed on the global page hash list.

vmstat -p reports paging activity details for applications (executables), data (anonymous) and file system activity.

The parameters listed above can be viewed and set dynamically via mdb, as below:

# mdb -kw Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba fcp fctl nca lofs zfs random logindmux ptm cpc fcip sppp crypto nfs ] > physmem/E physmem: physmem: 258887 > lotsfree/E lotsfree: lotsfree: 3984 > desfree/E desfree: desfree: 1992 > minfree/E minfree: minfree: 996 > throttlefree/E throttlefree: throttlefree: 996 > fastscan/E fastscan: fastscan: 127499 > slowscan/E slowscan: slowscan: 100 > handspreadpages/E handspreadpages: handspreadpages:127499 > pageout_new_spread/E pageout_new_spread: pageout_new_spread: 161760 > lotsfree/Z fa0 lotsfree: 0xf90 = 0xfa0 > lotsfree/E lotsfree: lotsfree: 4000

Wednesday, May 22, 2013

Segmentation Violations

Segmentation violations occur when a process references a memory address not mapped by any segment. The resulting SIGSEGV signal originates as a major page fault hardware exception identified by the processor and is translated by as_fault() in the address space layer.

When a process overflows its stack, a segmentation violation fault results. The kernel recognizes the violation and can extend the stack size, up to a configurable limit. In a multithreaded environment, the kernel does not keep track of each user thread's stack, so it cannot perform this function. The thread itself is responsible for stack SIGSEGV (stack overflow signal) handling.

(The SIGSEGV signal is sent by the threads library when an attempt is made to write to a write-protected page just beyond the end of the stack. This page is allocated as part of the stack creation request.)

It is often the case that segmentation faults occur because of resource restrictions on the size of a process's stack. See “Resource Management” for information about how to increase these limits.

See “Process Virtual Memory” for a more detailed description of the structure of a process's address space.

Monday, May 20, 2013

Measuring Memory Shortfalls

In the real world, memory shortfalls are much more devastating than having a CPU bottleneck. Two primary indicators of a RAM shortage are the scan rate and swap device activity. Here are some useful commands for monitoring both types of activity:

Memory Saturation: Scan Rate

sar -g
vmstat

Memory Saturation: Swap Space Usage and Paging Rates

In both cases, the high activity rate can be due to something that does not have a consistently large impact on performance. The processes running on the system have to be examined to see how frequently they are run and what their impact is. It may be possible to re-work the program or run the process differently to reduce the amount of new data being read into memory.

(Virtual memory takes two shapes in a Unix system: physical memory and swap space. Physical memory usually comes in DIMM modules and is frequently called RAM. Swap space is a dedicated area of disk space that the operating system addresses almost as if it were physical memory. Since disk I/O is much slower than I/O to and from memory, we would prefer to use swap space as infrequently as possible. Memory address space refers to the range of addresses that can be assigned, or mapped, to virtual memory on the system. The bulk of an address space is not mapped at any given point in time.)

We have to weigh the costs and benefits of upgrading physical memory, especially to accommodate an infrequently scheduled process. If the cost is more important than the performance, we can use swap space to provide enough virtual memory space for the application to run. If adequate total virtual memory space is not provided, new processes will not be able to open. (The system may report "Not enough space" or "WARNING: /tmp: File system full, swap space limit exceeded.")

Swap space is usually only used when physical memory is too small to accommodate the system's memory requirements. At that time, space is freed in physical memory by paging (moving) it out to swap space. (See “Paging” below for a more complete discussion of the process.)

If inadequate physical memory is provided, the system will be so busy paging to swap that it will be unable to keep up with demand. (This state is known as "thrashing" and is characterized by heavy I/O on the swap device and horrendous performance. In this state, the scanner can use up to 80% of CPU.)

When this happens, we can use the vmstat -p command to examine whether the stress on the system is coming from executables, application data or file system traffic. This command displays the number of paging operations for each type of data.

Scan Rate

When available memory falls below certain thresholds, the system attempts to reclaim memory that is being used for other purposes. The page scanner is the program that runs through memory to see which pages can be made available by placing them on the free list. The scan rate is the number of times per second that the page scanner makes a pass through memory. (The “Paging” section later in this chapter discusses some details of the page scanner's operation.) The page scanning rate is the main tipoff that a system does not have enough physical memory. We can use sar -g or vmstat to look at the scan rate. vmstat 30 checks memory usage every 30 seconds. (Ignore the summary statistics on the first line.) If page/sr is much above zero for an extended time, your system may be running short of physical memory. (Shorter sampling periods may be used to get a feel for what is happening on a smaller time scale.)

A very low scan rate is a sure indicator that the system is not running short of physical memory. On the other hand, a high scan rate can be caused by transient issues, such as a process reading large amounts of uncached data. The processes on the system should be examined to see how much of a long-term impact they have on performance. Historical trends need to be examined with sar -g to make sure that the page scanner has not come on for a transient, non-recurring reason.

A nonzero scan rate is not necessarily an indication of a problem. Over time, memory is allocated for caching and other activities. Eventually, the amount of memory will reach the lotsfree memory level, and the pageout scanner will be invoked. For a more thorough discussion of the paging algorithm, see “Paging” below.

Swap Device Activity

The amount of disk activity on the swap device can be measured using iostat. iostat -xPnce provides information on disk activity on a partition-by-partition basis. sar -d provides similar information on a per-physical-device basis, and vmstat provides some usage information as well. Where Veritas Volume Manager is used, vxstat provides per-volume performance information.

If there are I/O's queued for the swap device, application paging is occurring. If there is significant, persistent, heavy I/O to the swap device, a RAM upgrade may be in order.

Process Memory Usage

The /usr/proc/bin/pmap command can help pin down which process is the memory hog. /usr/proc/bin/pmap -x PID prints out details of memory use by a process.

Summary statistics regarding process size can be found in the RSS column of ps -ly or top.

dbx, the debugging utility in the SunPro package, has extensive memory leak detection built in. The source code will need to be compiled with the -g flag by the appropriate SunPro compiler.

ipcs -mb shows memory statistics for shared memory. This may be useful when attempting to size memory to fit expected traffic.

Friday, May 17, 2013

vmstat

The first line of vmstat represents a summary of information since boot time. To obtain useful real-time statistics, run vmstat with a time step (eg vmstat 30).

The vmstat output columns are as follows use the pagesize command to determine the size of the pages):

procs or kthr/r: Run queue length.
procs or kthr/b: Processes blocked while waiting for I/O.
procs or kthr/w: Idle processes which have been swapped.
memory/swap: Free, unreserved swap space (Kb).
memory/free: Free memory (Kb). (Note that this will grow until it reaches lotsfree, at which point the page scanner is started. See "Paging" for more details.)
page/re: Pages reclaimed from the free list. (If a page on the free list still contains data needed for a new request, it can be remapped.)
page/mf: Minor faults (page in memory, but not mapped). (If the page is still in memory, a minor fault remaps the page. It is comparable to the vflts value reported by sar -p.)
page/pi: Paged in from swap (Kb/s). (When a page is brought back from the swap device, the process will stop execution and wait. This may affect performance.)
page/po: Paged out to swap (Kb/s). (The page has been written and freed. This can be the result of activity by the pageout scanner, a file close, or fsflush.)
page/fr: Freed or destroyed (Kb/s). (This column reports the activity of the page scanner.)
page/de: Freed after writes (Kb/s). (These pages have been freed due to a pageout.)
page/sr: Scan rate (pages). Note that this number is not reported as a "rate," but as a total number of pages scanned.
disk/s#: Disk activity for disk # (I/O's per second).
faults/in: Interrupts (per second).
faults/sy: System calls (per second).
faults/cs: Context switches (per second).
cpu/us: User CPU time (%).
cpu/sy: Kernel CPU time (%).
cpu/id: Idle + I/O wait CPU time (%).

vmstat -i reports on hardware interrupts.

vmstat -s provides a summary of memory statistics, including statistics related to the DNLC, inode and rnode caches.

vmstat -S reports on swap-related statistics such as:

si: Swapped in (Kb/s).
so: Swap outs (Kb/s).

(Note that the man page for vmstat -s incorrectly describes the swap queue length. In Solaris 2, the swap queue length is the number of idle swapped-out processes. (In SunOS 4, this referred to the number of active swapped-out processes.)

Solaris 8


vmstat

under Solaris 8 will report different statistics than would be expected under an earlier version of Solaris due to a different paging algorithm:

Page Reclaim rate higher.
Higher reported Free Memory: A large component of the filesystem cache is reported as free memory.
Low Scan Rates: Scan rates will be near zero unless there is a systemwide shortage of available memory.

vmstat -p reports paging activity details for applications (executables), data (anonymous) and filesystem activity.

Thursday, May 16, 2013

sar

The word "sar" is used to refer to two related items:

The system activity report package
The system activity reporter

System Activity Report Package

This facility stores a great deal of performance data about a system. This information is invaluable when attempting to identify the source of a performance problem.

The Report Package can be enabled by uncommenting the appropriate lines in the sys crontab. The sa1 program stores performance data in the /var/adm/sa directory. sa2 writes reports from this data, and sadc is a more general version of sa1.

In practice, I do not find that the sa2-produced reports are terribly useful in most cases. Depending on the issue being examined, it may be sufficient to run sa1 at intervals that can be set in the sys crontab.

Alternatively, sar can be used on the command line to look at performance over different time slices or over a constricted period of time:

sar -A -o outfile 5 2000

(Here, "5" represents the time slice and "2000" represents the number of samples to be taken. "outfile" is the output file where the data will be stored.)

The data from this file can be read by using the "-f" option (see below).

System Activity Reporter

sar has several options that allow it to process the data collected by sa1 in different ways:

-a: Reports file system access statistics. Can be used to look at issues related to the DNLC.

iget/s: Rate of requests for inodes not in the DNLC. An iget will be issued for each path component of the file's path.

namei/s: Rate of file system path searches. (If the directory name is not in the DNLC, iget calls are made.)

dirbk/s: Rate of directory block reads.

-A: Reports all data.

-b: Buffer activity reporter:

bread/s, bwrit/s: Transfer rates (per second) between system buffers and block devices (such as disks).

lread/s, lwrit/s: System buffer access rates (per second).

%rcache, %wcache: Cache hit rates (%).

pread/s, pwrit/s: Transfer rates between system buffers and character devices.

-c: System call reporter:

scall/s: System call rate (per second).

sread/s, swrit/s, fork/s, exec/s: Call rate for these calls (per second).

rchar/s, wchar/s: Transfer rate (characters per second).

-d: Disk activity (actually, block device activity):

%busy: % of time servicing a transfer request.

avque: Average number of outstanding requests.

r+w/s: Rate of reads+writes (transfers per second).

blks/s: Rate of 512-byte blocks transferred (per second).

avwait: Average wait time (ms).

avserv: Average service time (ms). (For block devices, this includes seek rotation and data transfer times. Note that the iostat svc_t is equivalent to the avwait+avserv.)

-e HH:MM: CPU useage up to time specified.

-f filename: Use filename as the source for the binary sar data. The default is to use today's file from /var/adm/sa.

-g: Paging activity (see "Paging" for more details):
- pgout/s: Page-outs (requests per second).
- ppgout/s: Page-outs (pages per second).
- pgfree/s: Pages freed by the page scanner (pages per second).
- pgscan/s: Scan rate (pages per second).
- %ufs_ipf: Percentage of UFS inodes removed from the free list while still pointing at reuseable memory pages. This is the same as the percentage of igets that force page flushes.
-i sec: Set the data collection interval to i seconds.

-k: Kernel memory allocation:

sml_mem: Amount of virtual memory available for the small pool (bytes). (Small requests are less than 256 bytes)

lg_mem: Amount of virtual memory available for the large pool (bytes). (512 bytes-4 Kb)

ovsz_alloc: Memory allocated to oversize requests (bytes). Oversize requests are dynamically allocated, so there is no pool. (Oversize requests are larger than 4 Kb)

alloc: Amount of memory allocated to a pool (bytes). The total KMA useage is the sum of these columns.

fail: Number of requests that failed.

-m: Message and semaphore activities.

msg/s, sema/s: Message and semaphore statistics (operations per second).

-o filename: Saves output to filename.

-p: Paging activities.

atch/s: Attaches (per second). (This is the number of page faults that are filled by reclaiming a page already in memory.)

pgin/s: Page-in requests (per second) to file systems.

ppgin/s: Page-ins (per second). (Multiple pages may be affected by a single request.)

pflt/s: Page faults from protection errors (per second).

vflts/s: Address translation page faults (per second). (This happens when a valid page is not in memory. It is comparable to the vmstat-reported page/mf value.)

slock/s: Faults caused by software lock requests that require physical I/O (per second).

-q: Run queue length and percentage of the time that the run queue is occupied.

-r: Unused memory pages and disk blocks.

freemem: Pages available for use (Use pagesize to determine the size of the pages).

freeswap: Disk blocks available in swap (512-byte blocks).

-s time: Start looking at data from time onward.

-u: CPU utilization.

%usr: User time.

%sys: System time.

%wio: Waiting for I/O (does not include time when another process could be schedule to the CPU).

%idle: Idle time.

-v: Status of process, inode, file tables.

proc-sz: Number of process entries (proc structures) currently in use, compared with max_nprocs.

inod-sz: Number of inodes in memory compared with the number currently allocated in the kernel.

file-sz: Number of entries in and size of the open file table in the kernel.

lock-sz: Shared memory record table entries currently used/allocated in the kernel. This size is reported as 0 for standards compliance (space is allocated dynamically for this purpose).

ov: Overflows between sampling points.

-w: System swapping and switching activity.

swpin/s, swpot/s, bswin/s, bswot/s: Number of LWP transfers or 512-byte blocks per second.

pswch/s: Process switches (per second).

-y: TTY device activity.

rawch/s, canch/s, outch/s: Input character rate, character rate processed by canonical queue, output character rate.

rcvin/s, xmtin/s, mdmin/s: Receive, transmit and modem interrupt rates.

Wednesday, May 15, 2013

nfsstat

nfsstat can be used to examine NFS performance.

nfsstat -s reports server-side statistics. In particular, the following are important:

calls: Total RPC calls received.
badcalls: Total number of calls rejected by the RPC layer.
nullrecv: Number of times an RPC call was not available even though it was believed to have been received.
badlen: Number of RPC calls with a length shorter than that allowed for RPC calls.
xdrcall: Number of RPC calls whose header could not be decoded by XDR (External Data Representation).
readlink: Number of times a symbolic link was read.
getattr: Number of attribute requests.
null: Null calls are made by the automounter when looking for a server for a filesystem.
writes: Data written to an exported filesystem.

Sun recommends the following tuning actions for some common conditions:

writes > 10%: Write caching (either array-based or host-based, such as a Prestoserv card) would speed up operation.
badcalls >> 0: The network may be overloaded and should be checked out. The rsize and wsize mount options can be set on the client side to reduce the effect of a noisy network, but this should only be considered a temporary workaround.
readlink > 10%: Replace symbolic links with directories on the server.
getattr > 40%: The client attribute cache can be increased by setting the actimeo mount option. Note that this is not appropriate where the attributes change frequently, such as on a mail spool. In these cases, mount the filesystems with the noac option.

nfsstat -c reports client-side statistics. The following statistics are of particular interest:

calls: Total number of calls made.
badcalls: Total number of calls rejected by RPC.
retrans: Total number of retransmissions. If this number is larger than 5%, the requests are not reaching the server consistently. This may indicate a network or routing problem.
badxid: Number of times a duplicate acknowledgement was received for a single request. If this number is roughly the same as badcalls, the network is congested. The rsize and wsize mount options can be set on the client side to reduce the effect of a noisy network, but this should only be considered a temporary workaround.
If on the other hand, badxid=0, this can be an indication of a slow network connection.
timeout: Number of calls that timed out. If this is roughly equal to badxid, the requests are reaching the server, but the server is slow.
wait: Number of times a call had to wait because a client handle was not available.
newcred: Number of times the authentication was refreshed.
null: A large number of null calls indicates that the automounter is retrying the mount frequently. The timeo parameter should be changed in the automounter configuration.

nfsstat -m (from the client) provides server-based performance data.

srtt: Smoothed round-trip time. If this number is larger than 50ms, the mount point is slow.
dev: Estimated deviation.
cur: Current backed-off timeout value.
Lookups: If cur>80 ms, the requests are taking too long.
Reads: If cur>150 ms, the requests are taking too long.
Writes: If cur>250 ms, the requests are taking too long.

Tuesday, May 14, 2013

p-commands

p-Commands

In Unix, every object is either a file or a process. With the /proc virtual file system, even processes may be treated like files.

/proc (or procfs) is a virtual file system that allows us to examine processes like files. This means that /proc allows us to use file-like operations and intuitions when looking at processes. /proc does not occupy disk space; it is located in working memory. This structure was originally designed as a programming interface for writing debuggers, but it has grown considerably since then.

To avoid confusion, we will refer to the virtual file system as /proc or procfs. The man page for procfs is proc(4). proc, on the other hand, will be used to refer to the process data structure discussed in the Process Structure page.

Under /proc is a list of numbers, each of which is a Process ID (PID) for a process on our system. Under these directories are subdirectories referring to the different components of interest of each process. This directory structure can be examined directly, but we usually prefer to use commands written to extract information from this structure. These are known as the p-commands.

pcred: Display process credentials (eg EUID/EGID, RUID/RGID, saved UIDs/GIDs)
pfiles: Reports fstat() and fcntl() information for all open files. This includes information on the inode number, file system, ownership and size.
pflags: Prints the tracing flags, pending and held signals and other /proc status information for each LWP.
pgrep: Finds processes matching certain criteria.
pkill: Kills specified processes.
pldd: Lists dynamic libraries linked to the process.
pmap: Prints process address space map.
prun: Starts stopped processes.
prstat: Display process performance-related statistics.
ps: List process information.
psig: Lists signal actions.
pstack: Prints a stack trace for each LWP in the process.
pstop: Stops the process.
ptime: Times the command; does not time children.
ptree: Prints process genealogy.
pwait: Wait for specified processes to complete.
pwdx: Prints process working directory.

prstat Example 1

CPU Saturation is can be directly measured via prstat. (Saturation refers to a situation where there is not enough CPU capacity to adequately handle requests for processing resources.) Saturation can be measured directly by looking at the CPU latency time for each thread reported by prstat -mL. (LAT is reported as a percentage of the time that a thread is waiting to use a processor.)

This example shows the prstat -mL output from a single-CPU system that has been overloaded. Notice the load average and LAT numbers.

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 2724 root 24 0.2 0.0 0.0 0.0 0.0 2.2 74 284 423 361 0 gzip/1 2729 root 21 0.3 0.0 0.0 0.0 0.0 3.3 75 396 564 518 0 gzip/1 2733 root 20 0.3 0.0 0.0 0.0 0.0 5.3 75 391 514 484 0 gzip/1 2737 root 14 0.2 0.0 0.0 0.0 0.0 4.1 81 176 415 383 0 gzip/1 2730 root 3.3 0.3 0.0 0.0 0.0 0.0 96 0.7 602 258 505 0 gunzip/1 2734 root 2.9 0.3 0.0 0.0 0.0 0.0 92 4.5 522 280 457 0 gunzip/1 2738 root 2.7 0.2 0.0 0.0 0.0 0.0 93 3.9 377 147 370 0 gunzip/1 2725 root 2.4 0.2 0.0 0.0 0.0 0.0 95 2.4 495 179 355 0 gunzip/1 2728 root 0.1 1.4 0.0 0.0 0.0 0.0 97 1.7 769 11 2K 0 tar/1 2732 root 0.1 1.3 0.0 0.0 0.0 0.0 99 0.2 762 14 2K 0 tar/1 2723 root 0.0 1.1 0.0 0.0 0.0 0.0 99 0.1 564 7 1K 0 tar/1 2731 root 0.3 0.4 0.0 0.0 0.0 0.0 98 1.2 754 3 1K 0 tar/1 2735 root 0.3 0.4 0.0 0.0 0.0 0.0 98 0.9 722 0 1K 0 tar/1 2736 root 0.0 0.6 0.0 0.0 0.0 0.0 99 0.0 341 2 1K 0 tar/1 2726 root 0.3 0.3 0.0 0.0 0.0 0.0 98 1.0 473 145 1K 0 tar/1 2739 root 0.2 0.2 0.0 0.0 0.0 0.0 99 0.3 335 1 664 0 tar/1 2749 scromar 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 23 0 194 0 prstat/1 337 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 6 0 36 6 xntpd/1 2716 scromar 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 1 21 0 sshd/1 124 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 17 0 picld/4 119 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 21 0 63 0 nscd/26 Total: 51 processes, 164 lwps, load averages: 4.12, 2.13, 0.88

prstat Example 2

In this case, we sort prstat output to look for the processes with heavy memory utilization:

# prstat -s rss PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 471 juser 125M 58M sleep 59 0 4:26:46 0.6% java/17 200 daemon 62M 55M sleep 59 0 0:01:21 0.0% nfsmapid/4 18296 juser 116M 39M sleep 26 11 0:05:36 0.1% java/23 ... 254 root 3968K 1016K sleep 59 0 0:00:03 0.0% sshd/1 Total: 47 processes, 221 lwps, load averages: 0.20, 0.21, 0.20

Other Usage Examples

# ps -ef | grep more | grep -v grep root 18494 8025 0 08:53:09 pts/3 0:00 more # pgrep more 18494 # pmap -x 18494 18494: more Address Kbytes RSS Anon Locked Mode Mapped File 00010000 32 32 - - r-x-- more 00028000 8 8 8 - rwx-- more 0002A000 16 16 16 - rwx-- [ heap ] FF200000 864 824 - - r-x-- libc.so.1 FF2E8000 32 32 32 - rwx-- libc.so.1 FF2F0000 8 8 8 - rwx-- libc.so.1 FF300000 16 16 - - r-x-- en_US.ISO8859-1.so.3 FF312000 16 16 16 - rwx-- en_US.ISO8859-1.so.3 FF330000 8 8 - - r-x-- libc_psr.so.1 FF340000 8 8 8 - rwx-- [ anon ] FF350000 168 104 - - r-x-- libcurses.so.1 FF38A000 32 32 24 - rwx-- libcurses.so.1 FF392000 8 8 8 - rwx-- libcurses.so.1 FF3A0000 24 16 16 - rwx-- [ anon ] FF3B0000 184 184 - - r-x-- ld.so.1 FF3EE000 8 8 8 - rwx-- ld.so.1 FF3F0000 8 8 8 - rwx-- ld.so.1 FFBFC000 16 16 16 - rw--- [ stack ] -------- ------- ------- ------- ------- total Kb 1456 1344 168 - # pstack 18494 18494: more ff2c0c7c read (2, ffbff697, 1) 00015684 ???????? (0, 1, 43858, ff369ad4, 0, 28b20) 000149a4 ???????? (ffbff82f, 28400, 15000000, 28af6, 0, 28498) 00013ad8 ???????? (0, 28b10, 28c00, 400b0, ff2a4a74, 0) 00012780 ???????? (2a078, ff393050, 0, 28b00, 2a077, 6b) 00011c68 main (28b10, ffffffff, 28c00, 0, 0, 1) + 684 000115cc _start (0, 0, 0, 0, 0, 0) + 108 # pfiles 18494 18494: more Current rlimit: 256 file descriptors 0: S_IFIFO mode:0000 dev:292,0 ino:2083873 uid:0 gid:0 size:0 O_RDWR 1: S_IFCHR mode:0620 dev:284,0 ino:12582922 uid:1000 gid:7 rdev:24,3 O_RDWR|O_NOCTTY|O_LARGEFILE /devices/pseudo/pts@0:3 2: S_IFCHR mode:0620 dev:284,0 ino:12582922 uid:1000 gid:7 rdev:24,3 O_RDWR|O_NOCTTY|O_LARGEFILE /devices/pseudo/pts@0:3 # pcred 18494 18494: e/r/suid=0 e/r/sgid=0 groups: 0 1 2 3 4 5 6 7 8 9 12

Monday, May 13, 2013

netstat

netstat provides useful information regarding traffic flow.

In particular, netstat -i lists statistics for each interface, netstat -s provides a full listing of several counters, and netstat -rs provides routing table statistics. netstat -an reports all open ports.

netstat -k provides a useful summary of several network-related statistics up through Solaris 9, but this option was removed in Solaris 10 in favor of the /bin/kstat command. Through Solaris 9, netstat -k provides a listing of several component kstat statistics.

Here are some of the issues that can be revealed with netstat:

netstat -i: (Collis+Ierrs+Oerrs)/(Ipkts+Opkts) > 2%: This may indicate a network hardware issue.

netstat -i: (Collis/Opkts) > 10%: The interface is overloaded. Traffic will need to be reduced or redistributed to other interfaces or servers.

netstat -i: (Ierrs/Ipkts) > 25%: Packets are probably being dropped by the host, indicating an overloaded network (and/or server). Retransmissions can be dropped by reducing the rsize and wsize mount parameters to 2048 on the clients. Note that this is a temporary workaround, since this has the net effect of reducing maximum NFS throughput on the segment.

netstat -s: If significant numbers of packets arrive with bad headers, bad data length or bad checksums, check the network hardware.

netstat -i: If there are more than 120 collisions/second, the network is overloaded. See the suggestions above.

netstat -i: If the sum of input and output packets is higher than about 600 for a 10Mbs interface or 6000 for a 100Mbs interface, the network segment is too busy. See the suggestions above.
netstat -r: This form of the command provides the routing table. Make sure that the routes are as you expect them to be.

Sunday, May 12, 2013

Happy Mother's Day!

mpstat

mpstat reports information which is useful in understanding lock contention and CPU loading issues.

mpstat reports the following:

CPU: Processor ID
minf: Minor faults
mjf: Major faults
xcal: Processor cross-calls (when one CPU wakes up another by interrupting it). If this exceeds 200/second, the application in question may need to be examined.
intr: Interrupts.
ithr: Interrupts as threads (except clock).
csw: Context switches
icsw: Involuntary context switches (this is probably the more relevant statistic when examining performance issues.)
migr: Thread migrations to another processor. If the migr measurement of mpstat is greater than 500, rechoose_interval should be sent longer in the kernel.
smtx: Number of times a CPU failed to obtain a mutex.
srw: Number of times a CPU failed to obtain a read/write lock on the first try.
syscl: Number of system calls.
usr/sys/wt/idl: User/system/wait/idle CPU percentages.

Saturday, May 11, 2013

UFS File System Troubleshooting

Filesystem corruption can be detected and often repaired by the

format

and fsck commands. If the filesystem corruption is not due to an improper system shutdown, the hard drive hardware may need to be replaced.

ufs filesystems contain the following types of blocks:

boot block: This stores information used to boot the system.
superblock: Much of the filesystems internal information is stored in these.
inode: Stores location information about a file--everything except for the file name. The number of inodes in a filesystem can be changed from the default if newfs -i is used to create the filesystem.
data block: The file's data is stored in these.

fsck

The fsck command is run on each filesystem at boot time. This utility checks the internal consistency of the filesystem, and can make simple repairs on its own. More complex repairs require feedback from the root user, either in terms of a "y" keyboard response to queries, or invocation with the -y option.

If fsck cannot determine where a file belongs, the file may be renamed to its inode number and placed in the filesystem's lost+found directory. If a file is missing after a noisy fsck session, it may still be intact in the lost+found directory.

Sometimes the fsck command complains that it cannot find the superblock. Alternative superblock locations were created by newfs at the time that the filesystem was created. The newfs -N command can be invoked to nondestructively discover the superblock locations for the filesystem.

ufs filesystems can carry "state flags" that have the value of fsclean, fsstable, fsactive or fsbad (unknown). These can be used by fsck during boot time to skip past filesystems that are believed to be okay.

format

The analyze option of format can be used to examine the hard drive for flaws in a nondestructive fashion.

df

df can be used to check a filesystem's available space. Of particular interest is df -kl, which checks available space for all local filesystems and prints out the statistics in kilobytes. Solaris 10 also allows us to use df -h, which presents the statistics in a more human-friendly form that doesn't require counting digits to decide whether a file is 100M or 1G in size.

du

du can be used to check space used by a directory. In particular, du -dsk will report useage in kilobytes of a directory and its descendants, without including space totals from other filesystems.

Filesystem Tuning

Filesystem performance can be improved by looking at filesystem caching issues.

The following tuning parameters may be valuable in tuning filesystem performance with tunefs or mkfs/newfs:

inode count: The default is based upon an assumption of average file sizes of 2 KB. This can be set with mkfs/newfs at the time of filesystem creation.
time/space optimization: Optimization can be set to allow for fastest performance or most efficient space useage.
minfree: In Solaris 2.6+, this is set to (64 MB / filesystem size) x 100. Filesystems in earlier OS versions reserved 10%. This parameter specifies how much space is to be left empty in order to preserve filesystem performance.
maxbpg: This is the maximum number of blocks a file can leave in a single cylinder group. Increasing this limit can improve large file performance, but may have a negative impact on small file performance.

Filesystem Performance Monitoring

McDougall, Mauro and Gregg suggest that the best way to see if I/O is a problem at all is to look at the amount of time spent on POSIX read() and write() system calls via DTrace. If so, we need to look at the raw disk I/O performance.

Friday, May 10, 2013

dtrace

DTrace Introduction

DTrace is Solaris 10's Dynamic Tracing facility. It allows us to peer into the innards of running processes and customize our view to exclude extraneous information and close in on the source of a problem.

DTrace also has capabilities that allow us to examine a crash dump or trace the boot process.

A number of freely available scripts have been made available as the DTrace Toolkit. The toolkit provides both programming examples and also extremely useful tools for different types of system monitoring.

The DTrace facility provides data to a number of consumers, including commands such as dtrace and lockstat, as well as programs calling libraries that access DTrace through the dtrace kernel driver.

Probes

DTrace is built on a foundation of objects called probes. Probes are event handlers that fire when their particular event occurs. DTrace can bind a particular action to the probe to make use of the information.

Probes report on a variety of information about their event. For example, a probe for a kernel function may report on arguments, global variables, timestamps, stack traces, currently running processes or the thread that called the function.

Kernel modules that enable probes are packaged into sets known as providers. In a DTrace context, a module is a kernel module (for kernel probes) or a library name (for applications). A function in DTrace refers to the function associated with a probe, if it belongs to a program location.

Probes may be uniquely addressed by a combination of the provider, module, function and name. These are frequently organized into a 4-tuple when invoked by the dtrace command.

Alternatively, each probe has a unique integer identifier, which can vary depending on Solaris patch level.

These numbers, as well as the provider, module, function and name, can be listed out through the dtrace -l command. The list will vary from system to system, depending on what is installed. Probes can be listed by function, module or name by specifying it with the -f, -m or -n options, respectively.

Running a dtrace without a -l, but with a -f, -m or -n option, enables all matching probes. All the probes in a provider can be enabled by using the -P option. An individual probe can be enabled by using its 4-tuple with the -n option.

(Note: Do not enable more probes than necessary. If too many probes are enabled, it may adversely impact performance. This is particularly true of sched probes.)

Some probes do not list a module or function. These are called "unanchored" probes. Their 4-tuple just omits the nonexistent information.

Providers

Providers are kernel modules that create related groups of probes. The most commonly referenced providers are:

fbt: (Function Boundary Tracing) Implements probes at the entry and return points of almost all kernel functions.
io: Implements probes for I/O-related events.
pid: Implements probes for user-level processes at entry, return and instruction.
proc: Implements probes for process creation and life-cycle events.
profile: Implements timer-driven probes.
sched: Implements probes for scheduling-related events.
sdt: (Statistically Defined Tracing) Implements programmer-defined probes at arbitrary locations and names within code. Obviously, the programmer should define names whose meaning is intuitively clear.
syscall: Implements entry and return probes for all system calls.
sysinfo: Probes for updates to the sys kstat.
vminfo: Probes for updates to the vm kstat.

Command Components

The dtrace command has several components:

A 4-tuple identifier:
provider:module :function:name
Leaving any of these blank is equivalent to using a wildcard match. (If left blank, the left-most members of the 4-tuple are optional.)
A predicate determines whether the action should be taken. They are enclosed in slashes: /predicate/. The predicate is a C-style relational expression which must evaluate to an integer or pointer. If omitted, the action is executed when the probe fires. Some predicate examples are:
- executable name matches csh:
  /execname == "csh"/
- process ID does not match 1234:
  /pid != 1234/
- arg0 is 1 and arg1 is not 0:
  /arg0 == 1 && arg1 !=0/
An action (in the D scripting language) to be taken when the probe fires and the predicate is satisfied. Typically, this is listed in curly brackets: {}

Several command examples are provided at the bottom of the page.

D Scripting Language

In order to deal with operations that can become confusing on a single command line, a D script can be saved to a file and run as desired. A D script will have one or more probe clauses, which consist of one or more probe-descriptions, along with the associated predicates and actions:

#!/usr/sbin/dtrace -s
probe-description[, probe-description...]
/predicate/
{
action; [action; ...]
}

The probe-description section consists of one or more 4-tuple identifiers. If the predicate line is not present, it is the same as a predicate that is always true. The action(s) specified are to be run if the probe fires and the predicate is true.

Each recording action dumps data to a trace buffer. By default, this is the principal buffer.

Several programming examples are provided at the bottom of the page.

D Variables

D specifies both associative arrays and scalar variables. Storage for these variables is not pre-allocated. It is allocated when a non-zero value is assigned and deallocated when a zero value is assigned.

D defines several built-in variables, which are frequently used in creating predicates and actions. The most commonly used built-in variables for D are the following:

args[]: The args[] array contains the arguments, specified from 0 to the number of arguments less one. These can also be specified by argn, where this is the n+1th argument.
curpsinfo: psinfo structure of current process.
curthread: pointer to the current thread's kthread_t
execname: Current executable name
pid: Current process ID
ppid: Parent process ID
probefunc: function name of the current probe
probemod: module name of the current probe
probename: name of the current probe
timestamp: Time since boot in ns

Variable scope can be global, thread-local or clause-local.

Thread-local variables allow separate storage for each thread's copy of that variable. They are referenced with names of the form:
self->variable-name

Associative arrays can be indexed by an arbitrary name. There is no pre-defined limit on the number of elements.

Scalar variables hold a single data value.

We can also access Solaris kernel symbols by specifying them in backquotes.

Providers define arguments based on their own requirements. Some of the more useful such arguments are listed below:

Provider-Specific Variables
io
`arg[0]`	Pointer to a `bufinfo` structure.
`arg[0]->b_bcount`	Byte count.
`arg[0]->b_resid`	Bytes not transferred.
`arg[0]->b_iodone`	I/O completion routine.
`arg[0]->b_edev`	Extended device.
`arg[1]`	Pointer to a `devinfo` structure.
`arg[1]->dev_major`	Major number.
`arg[1]->dev_minor`	Minor number.
`arg[1]->dev_instance`	Instance number.
`arg[1]->dev_name`	Device name.
`arg[1]->dev_pathname`	Device pathname.
`arg[2]`	Pointer to a `fileinfo` structure.
`arg[2]->fi_name`	File name.
`arg[2]->fi_dirname`	File directory location.
`arg[2]->fi_pathname`	Full path to file.
`arg[2]->fi_offset`	Offset within a file.
`arg[2]->fi_fs`	Filesystem.
`arg[2]->fi_mount`	Filesystem mount point.
pid
`arg0-argn` (entry)	For entry probes, `arg0-argn` represent the arguments.
`arg0-arg1` (return)	For return probles, `arg0-arg1` represent the return codes.
syscall
`arg0-argn` (entry)	For entry probes, `arg0-argn` represent the arguments.
`arg0-arg1` (return)	For return probles, `arg0-arg1` represent the return codes.
sysinfo
`arg0`	Value of statistic increment
`arg1`	Pointer to the current value of the statistic before increment.
`arg2`	Pointer to the `cpu_t` structure incrementing the statistic. Defined in `sys/cpuvar.h`
vminfo
`arg0`	Value of statistic increment
`arg1`	Pointer to the current value of the statistic before increment.

The full list for each provider can be found in the provider's chapter of the Solaris Dynamic Tracing Guide.

Actions

The most commonly used built-in actions are:

breakpoint(): System stops and transfers control to kernel debugger.
chill(number-nanoseconds): DTrace spins for the specified number of nanoseconds.
copyinstr(pointer): Returns null terminated string from address space referenced by pointer.
copyout(buffer, address, number-bytes): Copies number-bytes from the buffer to a memory address.
copyoutstr(string, address, max-length): Copies a string to a memory address.
normalize(aggregation, normalization-factor): Divides aggregation values by the normalization-factor.
panic(): Panics the kernel; may be used to generate a core dump.
printf(format, arguments): Dumps the arguments to the buffer in the specified format. printa(aggregation) does the same thing for aggregation data.
raise(signal): Sends the signal to the current process.
stack(number-frames): Copies the specified number of frames of the kernel thread's stack to the buffer.
stop(): Stops the process that fired the probe.
stringof(): Converts values to DTrace string values.
system(command): Runs a program as if from the shell.
trace(D-expression): Dumps the output of D-expression to the trace buffer.
tracemem(address, size_t number-bytes): Dumps the contents from the memory address to the buffer.
trunc(aggregation): Truncates or removes the contents of the specified aggregation.
ustack(number-frames): Copies the specified number of frames of the user stack to the buffer.

Multiple actions in a probe clause can be combined using a semicolon between them inside the curly brackets.

Aggregations

Aggregating functions allow multiple data points to be combined and reported. Aggregations take the form:
@name[ keys ] = aggregating-function( arguments );
Here, the name is a name assigned to the aggregation, the keys are a comma-separated list of D expressions which index the output, the arguments are a comma-separated list and the aggregating functions may be one of the following:

avg: Average of the expressions in the arguments.
count: Number of times that the function is called.
lquantize: The arguments are a scalar expression, a lower bound, an upper bound and a step value. This function increments the value in the highest linearly-sized bucket that is less than the expression.
max: The largest value among the arguments.
min: The smallest value among the arguments.
quantize: Increments the value in the highest power of two bucket less than the expression in the argument.
sum: Total value of the expressions in the arguments.

Directives

Directives provide options to a D script that can increase readability of the output. Each option is enabled by including a line like the following at the beginning of a D script:
#pragma D option option-name

The following are the most commonly-used directives.

flowindent: Indentation increased on function entry; decreased on function return.
quiet: Don't print anything not explicitly specified.

Examples

To report on each active PID for the readch function:
dtrace -n 'readch {trace(pid)}'

To report on each system call's entry time:
dtrace -n 'syscall:::entry {trace(timestamp)}'

Tracing executable names (stored on a UFS filesystem):
dtrace -m 'ufs {trace(execname)}'

To print all functions from libc for process 123:
dtrace -n 'pid123:libc::entry'

To count the occurrences of each libc function for PID 123:

#!/usr/sbin/dtrace -s pid123:libc::entry { @function_count[probefunc]=count(); }

(Once this script is stopped with control-c, the results will be printed out.)

To identify how much time is spent in each function, we would add a probe-clause as follows. In the below, the timestmp[] array stores the timestamps for each function as it is entered; the aggregation function_duration[] stores the sum total of durations for invocations of each function. Also note that the scope of the timestmp[] array is limited to a single thread to deal with the possibility of multiple concurrent copies of the same function:

#!/usr/sbin/dtrace -s pid123:libc::entry { self->timestmp[probefunc] = timestmp; } pid123:libc::return /self->timestmp[probefunc] != 0/ { @function_duration[probefunc] = sum(timestmp - self->timestmp[probefunc]); timestmp[probefunc] = 0; }

The following script prints out the name of every program that calls the exec or exece system calls, as well as the name of the program being executed:

#!/usr/sbin/dtrace -s syscall::exec*:entry { trace(execname); trace(copyinstr(arg0)); }

While this script is straightforward, it may fail if arg0 has been written to disk as part of paging activity. (See Bennett, Part 2.) In that event, it may make more sense to use a variable to hold the pointer until return pulls any pointers back into memory. The below also uses printf to format the output and turn off default output with "quiet:"

#!/usr/sbin/dtrace -s #pragma D option quiet syscall::exec*:entry { self->prog = copyinstr(arg0); self->exn = execname; } syscall::exec*:return / self->prog != NULL / { printf ("%-20s %s\n", self->exn, self->prog); self->prog = 0; self->exn = 0; }

DTrace Toolkit

DTrace Toolkit
Script Name	Description
`anonpgpid.d`	Attempts to identify which processes are suffering the most from a system that is hard swapping
`bitesize.d`	Provides graphs of distributions of different I/O sizes
`connections`	Displays server process that accepts each inbound TCP connection.
`cputypes.d`	Reports on types of CPUs on the system.
`cpuwalk.d`	Reports on which CPUs a process runs on.
`cswstat.d`	Reports on context switches and time consumed.
`dapptrace`	Traces user and library function usage. Similar to `apptrace`, but also gets elapsed and CPU times
`dispqlen.d`	Measures dispatcher queue length (CPU saturation).
`dnlcps.d`	Measures DNLC hits and misses by process.
`dnlcsnoop.d`	Real time record of target, process and result of DNLC lookups.
`dtruss`	`truss` replacement without the performance hit.
`filebyproc.d`	Snoops files opened by process name.
`fsrw.d`	Traces I/O events at a system call level.
`hotspot.d`	Identifies disk "hot spots."
`inttimes.d`	Reports on time spent servicing interrupts for each device.
`iofile.d`	I/O wait times for each file by process.
`iofileb.d`	I/O size for each file by process.
`iopattern`	System-wide disk I/O usage patterns.
`iosnoop`	Tracks system I/O activity.
`iotop`	Displays processes with highest I/O traffic.
`lockbydist.d`	Lock distribution by process.
`lockbyproc.d`	Lock times by process.
`nfswizard.d`	Identifies top NFS filename requests; reports on access and performance statistics.
`opensnoop`	Snoops open files.
`pfilestat`	I/O statistics for each file descriptor in a process.
`priclass.d`	Reports distribution of thread priorites by class.
`pridist.d`	Reports distribution of thread priorities by class.
`procsystime`	Process system call details; elapsed time, CPU time, counts, etc.
`rfileio.d`	Read size statistics from file systems and physical disks, along with a total miss rate for the file system cache.
`rfsio`	Read size statistics from file systems and physical disks, along with a total miss rate for the file system cache.
`rwsnoop`	Captures read/write activity, including identifying the source processes.
`rwtop`	Displays processes with top read/write activity.
`sampleproc`	Reports which process is on which CPU how much of the time.
`seeksize.d`	Directly measure seek lengths of I/Os.
`swapinfo.d`	Reports a summary of virtual memory use.
`tcpstat.d`	Reports TCP error and traffic statistics.
`tcpsnoop.d`	Snoops TCP packets and associates them with a port and a process.
`tcptop`	Displays to TCP packet-generating processes.
`threaded.d`	Measures effectiveness of thread utilization.
`tcpsyscall`	Reports on busiest system calls.
`udpstat.d`	Reports UDP error and traffic statistics.
`vopstat`	Function-level timings of I/Os.
`xcallsbypid.d`	Provides by-process cross-call information.
`zvmstat`	Zone-specific vmstat.

Additional Resources

Bennett, Chip. Learning DTrace-Parts 1-5. SysAdmin Magazine, Sep 2006-Jan 2007. Also at Laurus.
Gregg, Brendan. DTrace Tools web page.
McDougall, Richard; Mauro, Jim; Gregg, Brendan. Solaris Performance and Tools. October, 2006. Prentice Hall.
DTrace User Guide. May, 2006. Sun Microsystems.
Solaris Dynamic Tracing Guide. 2005. Sun Microsystems.
Using DTrace from a Solaris 10 System. Sun Microsystems.

Thursday, May 09, 2013

iostat

iostat is the front-line command for examining disk performance issues. dtrace allows a more detailed examination of I/O operations.

As with most of the monitoring commands, the first line of iostat reflects a summary of statistics since boot time. To look at meaningful real-time data, run iostat with a time step (eg iostat 30) and look at the lines that report summaries over the time step intervals.

For Solaris 2.6 and later, use iostat -xPnce 30 to get information including the common device names of the disk partitions, CPU statistics, error statistics, and extended disk statistics.

For Solaris 2.5.1 and earlier, or for more compact output, use iostat -xc 30 to get the extended disk and CPU statistics.

In either case, the information reported is:

disk: Disk device name.
r/s, w/s: Average reads/writes per second.
Kr/s, Kw/s: Average Kb read/written per second.
wait: Time spent by a process while waiting for block

Notes on Odd Behavior

actv: Number of active requests in the hardware queue.
%w: Occupancy of the wait queue.
%b: Occupancy of the active queue with the device busy.
svc_t: Service time (ms). Includes everything: wait time, active queue time, seek rotation, transfer time.
us/sy: User/system CPU time (%).
wt: Wait for I/O (%).
id: Idle time (%).

Notes on Odd Behavior

The "wait" time reported by iostat refers to time spent by a process while waiting for block device (such as disk) I/O to finish. In Solaris 2.6 and earlier, the calculation algorithm sometimes overstates the problem on multi-processor machines, since it does not take into account that an I/O wait on one CPU does not mean that I/O is blocked for processes on the other CPUs. Solaris 7 has corrected this problem.

iostat also sometimes reports excessive svc_t (service time) readings for disks that are very inactive. This is due to the action of fsflush keeping the data in memory and on the disk up-to-date. Since many writes are specified over a very short period of time to random parts of the disk, a queue forms briefly, and the average service time goes up. svc_t should only be taken seriously on a disk that is showing 5% or more activity.

Wednesday, May 08, 2013

truss

truss traces library and system calls and signal activity for a given process. This can be very useful in seein where a program is choking.

In order to be able to use truss output, it is important to understand some basic vocabulary:

brk() requests memory during execution.
exec() opens a program.
fcntl() performs control functions on open files.
fstat() obtains information about open files.
getdents64() reads directory information.
ioctl() performs terminal I/O.
lstat() obtains file attributes.
mmap() maps the program image into memory.
open() opens a file. It returns a number which is referenced when the file is used.
write() writes to an open file/device.

Obviously, this is only a subset of the system calls seen in truss output, but this subset is usually sufficient to figure out what is going on. Other system calls can be looked up in the vendor web pages or man pages.

In most cases we will be looking for error messages in the truss output. These entries will contain the string "Err" in the last column of the output.

These errors can be categorized as system call errors or missing file errors. Many missing file errors are the result of a library not being in a directory early in the LD_LIBRARY_PATH or something of the sort. If the truss output shows a successful open() (or whatever) of the file name later in the process, that is probably not your culprit.

System call errors can be interpreted by looking at the man page of the specific system call or examining the file /usr/include/sys/errno.h.

Sometimes you need more information than truss offers. Dtrace offers a finer-grained method for examining every aspect of a process's functioning.

Tuesday, May 07, 2013

Sun Cluster

Introduction

Sun Cluster 3.2 has the following features and limitations:

Support for 2-16 nodes.
Global device capability--devices can be shared across the cluster.
Global file system --allows a file system to be accessed simultaneously by all cluster nodes.
Tight implementation with Solaris--The cluster framework services have been implemented in the kernel.
Application agent support.
Tight integration with zones.
Each node must run the same revision and update of the Solaris OS.
Two node clusters must have at least one quorum device.
Each cluster needs at least two separate private networks. (Supported hardware, such as ce and bge may use tagged VLANs to run private and public networks on the same physical connection.)
Each node's boot disk should include a 500M partition mounted at /globaldevices prior to cluster installation. At least 750M of swap is also required.
Attached storage must be multiply connected to the nodes.
ZFS is a supported file system and volume manager. Veritas Volume Manager (VxVM) and Solaris Volume Manager (SVM) are also supported volume managers.
Veritas multipathing (vxdmp) is not supported. Since vxdmp must be enabled for current VxVM versions, it must be used in conjunction with mpxio or another similar solution like EMC's Powerpath.
SMF services can be integrated into the cluster, and all framework daemons are defined as SMF services
PCI and SBus based systems cannot be mixed in the same cluster.
Boot devices cannot be on a disk that is shared with other cluster nodes. Doing this may lead to a locked-up cluster due to data fencing.

The overall health of the cluster may be monitored using the cluster status or scstat -v commands. Other useful options include:

scstat -g: Resource group status
scstat -D: Device group status
scstat -W: Heartbeat status
scstat -i: IPMP status
scstat -n: Node status

Failover applications (also known as "cluster-unaware" applications in the Sun Cluster documentation) are controlled by rgmd (the resource group manager daemon). Each application has a data service agent, which is the way that the cluster controls application startups, shutdowns, and monitoring. Each application is typically paired with an IP address, which will follow the application to the new node when a failover occurs.

"Scalable" applications are able to run on several nodes concurrently. The clustering software provides load balancing and makes a single service IP address available for outside entities to query the application.

"Cluster aware" applications take this one step further, and have cluster awareness programmed into the application. Oracle RAC is a good example of such an application.

All the nodes in the cluster may be shut down with cluster shutdown -y -g0. To boot a node outside of the cluster (for troubleshooting or recovery operations, run boot -x

clsetup is a menu-based utility that can be used to perform a broad variety of configuration tasks, including configuration of resources and resource groups.

Cluster Configuration

The cluster's configuration information is stored in global files known as the "cluster configuration repository" (CCR). The cluster framework files in /etc/cluster/ccr should not be edited manually; they should be managed via the administrative commands.

The cluster show command displays the cluster configuration in a nicely-formatted report.

The CCR contains:

Names of the cluster and the nodes.
The configuration of the cluster transport.
Device group configuration.
Nodes that can master each device group.
NAS device information (if relevant).
Data service parameter values and callback method paths.
Disk ID (DID) configuration.
Cluster status.

Some commands to directly maintain the CCR are:

ccradm: Allows (among other things) a checksum re-configuration of files in /etc/cluster/ccr after manual edits. (Do NOT edit these files manually unless there is no other option. Even then, back up the original files.) ccradm -i /etc/cluster/ccr/filename -o
scgdefs: Brings new devices under cluster control after they have been discovered by devfsadm.

The scinstall and clsetup commands may

We have observed that the installation process may disrupt a previously installed NTP configuration (even though the installation notes promise that this will not happen). It may be worth using ntpq to verify that NTP is still working properly after a cluster installation.

Resource Groups

Resource groups are collections of resources, including data services. Examples of resources include disk sets, virtual IP addresses, or server processes like httpd.

Resource groups may either be failover or scalable resource groups. Failover resource groups allow groups of services to be started on a node together if the active node fails. Scalable resource groups run on several nodes at once.

The rgmd is the Resource Group Management Daemon. It is responsible for monitoring, stopping, and starting the resources within the different resource groups.

Some common resource types are:

SUNW.LogicalHostname: Logical IP address associated with a failover service.
SUNW.SharedAddress: Logical IP address shared between nodes running a scalable resource group.
SUNW.HAStoragePlus: Manages global raw devices, global file systems, non-ZFS failover file systems, and failover ZFS zpools.

Resource groups also handle resource and resource group dependencies. Sun Cluster allows services to start or stop in a particular order. Dependencies are a particular type of resource property. The r_properties man page contains a list of resource properties and their meanings. The rg_properties man page has similar information for resource groups. In particular, the Resource_dependencies property specifies something on which the resource is dependent.

Some resource group cluster commands are:

clrt register resource-type: Register a resource type.
clrt register -n node1name,node2name resource-type: Register a resource type to specific nodes.
clrt unregister resource-type: Unregister a resource type.
clrt list -v: List all resource types and their associated node lists.
clrt show resource-type: Display all information for a resource type.
clrg create -n node1name,node2name rgname: Create a resource group.
clrg delete rgname: Delete a resource group.
clrg set -p property-name rgname: Set a property.
clrg show -v rgname: Show resource group information.
clrs create -t HAStoragePlus -g rgname -p AffinityOn=true -p FilesystemMountPoints=/mountpoint resource-name
clrg online -M rgname
clrg switch -M -n nodename rgname
clrg offline rgname: Offline the resource, but leave it in a managed state.
clrg restart rgname
clrs disable resource-name: Disable a resource and its fault monitor.
clrs enable resource-name: Re-enable a resource and its fault monitor.
clrs clear -n nodename -f STOP_FAILED resource-name
clrs unmonitor resource-name: Disable the fault monitor, but leave resource running.
clrs monitor resource-name: Re-enable the fault monitor for a resource that is currently enabled.
clrg suspend rgname: Preserves online status of group, but does not continue monitoring.
clrg resume rgname: Resumes monitoring of a suspended group
clrg status: List status of resource groups.
clrs status -g rgname

Data Services

A data service agent is a set of components that allow a data service to be monitored and fail over within the cluster. The agent includes methods for starting, stopping, monitoring, or failing the data service. It also includes a registration information file allowing the CCR to store the information about these methods in the CCR. This information is encapsulated as a resource type.

The fault monitors for a data sevice place the daemons under the control of the process monitoring facility (rpc.pmfd), and the service, using client commands.

Public Network

The public network uses pnmd (Public Network Management Daemon) and the IPMP in.mpathd daemon to monitor and control the public network addresses.

IPMP should be used to provide failovers for the public network paths. The health of the IPMP elements can be monitored with scstat -i

The clrslh and clrssa commands are used to configure logical and shared hostnames, respectively.

clrslh create -g rgname logical-hostname

Private Network

The "private," or "cluster transport" network is used to provide a heartbeat between the nodes so that they can determine which nodes are available. The cluster transport network is also used for traffic related to global devices.

While a 2-node cluster may use crossover cables to construct a private network, switches should be used for anything more than two nodes. (Ideally, separate switching equipment should be used for each path so that there is no single point of failure.)

The default base IP address is 172.16.0.0, and private networks are assigned subnets based on the results of the cluster setup.

Available network interfaces can be identified by using a combination of dladm show-dev and ifconfig.

Private networks should be installed and configured using the scinstall command during cluster configuration. Make sure that the interfaces in question are connected, but down and unplumbed before configuration. The clsetup command also has menu options to guide you through the private network setup process.

Alternatively, something like the following command string can be used to establish a private network:

clintr add nodename1:ifname
clintr add nodename2:ifname2
clintr add switchname
clintr add nodename1:ifname1,switchname
clintr add nodename2:ifname2,switchname
clintr status

The health of the heartbeat networks can be checked with the scstat -W command. The physical paths may be checked with clintr status or cluster status -t intr.

Quorum

Sun Cluster uses a quorum voting system to prevent split-brain and cluster amnesia. The Sun Cluster documentation refers to "failure fencing" as the mechanism to prevent split-brain (where two nodes run the same service at the same time, leading to potential data corruption).

"Amnesia" occurs when a change is made to the cluster while a node is down, then that node attempts to bring up the cluster. This can result in the changes being forgotten, hence the use of the word "amnesia."

One result of this is that the last node to leave a cluster when it is shut down must be the first node to re-enter the cluster. Later in this section, we will discuss ways of circumventing this protection.

Quorum voting is defined by allowing each device one vote. A quorum device may be a cluster node, a specified external server running quorum software, or a disk or NAS device. A majority of all defined quorum votes is required in order to form a cluster. At least half of the quorum votes must be present in order for cluster services to remain in operation. (If a node cannot contact at least half of the quorum votes, it will panic. During the reboot, if a majority cannot be contacted, the boot process will be frozen. Nodes that are removed from the cluster due to a quorum problem also lose access to any shared file systems. This is called "data fencing" in the Sun Cluster documentation.)

Quorum devices must be available to at least two nodes in the cluster.
Disk quorum devices may also contain user data. (Note that if a ZFS disk is used as a quorum device, it should be brought into the zpool before being specified as a quorum device.)
Sun recommends configuring n-1 quorum devices (the number of nodes minus 1). Two node clusters must contain at least one quorum device.
Disk quorum devices must be specified using the DID names.
Quorum disk devices should be at least as available as the storage underlying the cluster resource groups.

Quorum status and configuration may be investigating using:

scstat -q
clq status

These commands report on the configured quorum votes, whether they are present, and how many are required for a majority.

Quorum devices can be manipulated through the following commands:

clq add did-device-name
clq remove did-device-name: (Only removes the device from the quorum configuration. No data on the device is affected.)
clq enable did-device-name
clq disable did-device-name: (Removes the quorum device from the total list of available quorum votes. This might be valuable if the device is down for maintenance.)
clq reset: (Resets the configuration to the default.)

By default, doubly-connected disk quorum devices use SCSI-2 locking. Devices connected to more than two nodes use SCSI-3 locking. SCSI-3 offers persistent reservations, but SCSI-2 requires the use of emulation software. The emulation software uses a 64-bit reservation key written to a private area on the disk.

In either case, the cluster node that wins a race to the quorum device attempts to remove the keys of any node that it is unable to contact, which cuts that node off from the quorum device. As noted before, any group of nodes that cannot communicate with at least half of the quorum devices will panic, which prevents a cluster partition (split-brain).

In order to add nodes to a 2-node cluster, it may be necessary to change the default fencing with scdidadm -G prefer3 or cluster set -p global_fencing=prefer3, create a SCSI-3 quorum device with clq add, then remove the SCSI-2 quorum device with clq remove.

NetApp filers and systems running the scqsd daemon may also be selected as quorum devices. NetApp filers use SCSI-3 locking over the iSCSI protocol to perform their quorum functions.

The claccess deny-all command may be used to deny all other nodes access to the cluster. claccess allow nodename re-enables access for a node.

Purging Quorum Keys

CAUTION: Purging the keys from a quorum device may result in amnesia. It should only be done after careful diagnostics have been done to verify why the cluster is not coming up. This should never be done as long as the cluster is able to come up. It may need to be done if the last node to leave the cluster is unable to boot, leaving everyone else fenced out. In that case, boot one of the other nodes to single-user mode, identify the quorum device, and:

For SCSI 2 disk reservations, the relevant command is pgre, which is located in /usr/cluster/lib/sc:

pgre -c pgre_inkeys -d /dev/did/rdks/d#s2 (List the keys in the quorum device.)
pgre -c pgre_scrub -d /dev/did/rdks/d#s2 (Remove the keys from the quorum device.)

Similarly, for SCSI 3 disk reservations, the relevant command is scsi:

scsi -c inkeys -d /dev/did/rdks/d#s2 (List the keys in the quorum device.)
scsi -c scrub -d /dev/did/rdks/d#s2 (Remove the keys from the quorum device.)

Global Storage

Sun Cluster provides a unique global device name for every disk, CD, and tape drive in the cluster. The format of these global device names is /dev/did/device-type. (eg /dev/did/dsk/d2s3) (Note that the DIDs are a global naming system, which is separate from the global device or global file system functionality.)

DIDs are componentsof SVM volumes, though VxVM does not recognize DID device names as components of VxVM volumes.

DID disk devices, CD-ROM drives, tape drives, SVM volumes, and VxVM volumes may be used as global devices. A global device is physically accessed by just one node at a time, but all other nodes may access the device by communicating across the global transport network.

The file systems in /global/.devices store the device files for global devices on each node. These are mounted on mount points of the form /global/.devices/node@nodeid, where nodeid is the identification number assigned to the node. These are visible on all nodes. Symbolic links may be set up to the contents of these file systems, if they are desired. Sun Cluster sets up some such links in the /dev/global directory.

Global file systems may be ufs, VxFS, or hsfs. To mount a file system as a global file system, add a "global" mount option to the file system's vfstab entry and remount. Alternatively, run a mount -o global... command.

(Note that all nodes in the cluster should have the same vfstab entry for all cluster file systems. This is true for both global and failover file systems, though ZFS file systems do not use the vfstab at all.)

In the Sun Cluster documentation, global file systems are also known as "cluster file systems" or "proxy file systems."

Note that global file systems are different from failover file systems. The former are accessible from all nodes; the latter are only accessible from the active node.

Maintaining Devices

New devices need to be read into the cluster configuration as well as the OS. As usual, we should run something like devfsadm or drvconfig; disks to create the /device and /dev links across the cluster. Then we use the scgdevs or scdidadm command to add more disk devices to the cluster configuration.

Some useful options for scdidadm are:

scdidadm -l: Show local DIDs
scdidadm -L: Show all cluster DIDs
scdidadm -r: Rebuild DIDs

We should also clean up unused links from time to time with devfsadm -C and scdidadm -C

The status of device groups can be checked with scstat -D. Devices may be listed with cldev list -v. They can be switched to a different node via a cldg switch -n target-node dgname command.

Monitoring for devices can be enabled and disabled by using commands like:

cldev monitor all
cldev unmonitor d#
cldev unmonitor -n nodename d#
cldev status -s Unmonitored

Parameters may be set on device groups using the cldg set command, for example:

cldg set -p failback=false dgname

A device group can be taken offline or placed online with:

cldg offline dgname
cldg online dgname

VxVM-Specific Issues

Since vxdmp cannot be disabled, we need to make sure that VxVM can only see one path to each disk. This is usually done by implementing mpxio or a third party product like Powerpath. The order of installation for such an environment would be:

Install Solaris and patches.
Install and configure multipathing software.
Install and configure Sun Cluster.
Install and configure VxVM

If VxVM disk groups are used by the cluster, all nodes attached to the shared storage must have VxVM installed. Each vxio number in /etc/name_to_major must also be the same on each node. This can be checked (and fixed, if necessary) with the clvxvm initialize command. (A reboot may be necessary if the /etc/name_to_major file is changed.)

The clvxvm encapsulate command should be used if the boot drive is encapsulated (and mirrored) by VxVM. That way the /global/.devices information is set up properly.

The clsetup "Device Groups" menu contains items to register a VxVM disk group, unregister a device group, or synchronize volume information for a disk group. We can also re-synchronize with the cldg sync dgname command.

Solaris Volume Manager-Specific Issues

Sun Cluster allows us to add metadb or partition information in the /dev/did format or in the usual format. In general:

Use local format for boot drive mirroring in case we need to boot outside the cluster framework.
Use cluster format for shared disksets because otherwise we will need to assume the same controller numbers on each node.

Configuration information is kept in the metadatabase replicas. At least three local replicas are required to boot a node; these should be put on their own partitions on the local disks. They should be spread across controllers and disks to the degree possible. Multiple replicas may be placed on each partition; they should be spread out so that if any one disk fails, there will still be at least three replicas left over, constituting at least half of the total local replicas.

When disks are added to a shared diskset, database replicas are automatically added. These will always be added to slice 7, where they need to remain. If a disk containing replicas is removed, the replicas must be removed using metadb.

If fewer than 50% of the replicas in a diskset are available, the diskset ceases to operate. If exactly 50% of the replicas are available, the diskset will continue to operate, but will not be able to be enabled or switched on another node.

A mediator can be assigned to a shared diskset. The mediator data is contained within a Solaris process on each node and counts for two votes in the diskset quorum voting.

Standard c#t#d#s# naming should be used when creating local metadb replicas, since it will make recovery easier if we need to boot the node outside of a cluster context. On the other hand, /dev/did/rdsk/d#s# naming should be used for shared disksets, since otherwise the paths will need to be identical on all nodes.

Creating a new shared diskset involves the following steps:
(Create an empty diskset.)
metaset -s set-name -a -h node1-name node2-name
(Create a mediator.)
metaset -s set-name -a -m node1-name node2-name
(Add disks to the diskset.)
metaset -s set-name -a /dev/did/rdsk/d# /dev/did/rdsk/d#
(Check that the diskset is present in the cluster configuration.)
cldev list -v
cldg status
cldg show set-name

ZFS-Specific Issues

ZFS is only available as a Sun Cluster failover file system, not as a global file system. No vfstab entries are required, since that information is contained in the zpools. No synchronization commands are required like in VxVM; Sun Cluster takes care of the synchronization automatically.

Zones

Non-global zones may be treated as virtual nodes. Keep in mind that some services, such as NFS, will not run in non-global zones.

Services can be failed over between zones, even zones on the same server. Where possible, it is best to use full rather than sparse zones. Certain types of failures within the non-global zone can cause a crash in the global zone.

Configuration of cluster resources and resource groups must be performed in the global zone. The rgmd runs in the global zone.

To specify a non-global zone as a node, use the form
nodename:zonename
or specify
-n nodename -z zonename

Additional Reading

Sun Cluster Command Cheat Sheet