Solaris Troubleshooting: April 2013

Tuesday, April 30, 2013

kstat Structures

kstat structures are created by system components to contain information about component characteristics that need to be measured. Each component has its own format for kstat information. This information is queried by monitoring commands.

Much of the same information (and even configurable access to the kstat structures themselves) is available through the dtrace command.

The libkstat library contains C-language functions available to programs which need to access the kstat structure. The /dev/kstat pseudo-device is the driver for this access.

Prior to Solaris 10, thenetstat -k provided a listing of several kstat structures. Starting with Solaris 8, the /bin/kstat command has allowed access to particular kstats. The rest of this page focuses on the /bin/kstat command.

A particular statistic man be specified by 4-tuples of the following sort:
module:instance:name:statistic
where each element may be either a shell glob or a Perl regular expression enclosed in '/' (forward-slash) characters. The -m, -i, -n and -s options may be used to match a particular object (or glob of objects).

At the end of a kstat command, numbers representing intervals and count may be added.

kstat -l
lists kstat names but not values. For a given kernel module, the command
kstat -l -m module-name
lists all available kstat measurements for that module.

Sunday, April 28, 2013

Basic Kernel Module Commands

modinfo prints out information on all currently loaded kernel modules.
modload loads kernel modules.
modunload unloads kernel modules.
forceload in the /etc/system file loads a module at boot time.

Saturday, April 27, 2013

Book Review: PANIC! UNIX System Crash Dump Analysis Handbook

Even almost 20 years later, this book is still the single best volume out there on how to use adb to examine crash dumps. Without this book, I don't believe I would have been able to write the adb section of my own book.

I'm sorry that the author never produced an updated version of this book. Even though some of the examples don't work any more, you can learn techniques and a mindset that can be translated to current operating systems.

For anyone interested in performing analysis of crash dumps (or of the kernel of a running system), this book should be on your shelf.

Friday, April 26, 2013

File Descriptors

A file descriptor is a handle created by a process when a file is opened. A new descriptor is created each time the file is opened. It is associated with a file object which includes information such as the mode in which the file was opened and the offset pointer where the next operation will begin. This information is called the context of the file.

File descriptors are retired when the file is closed or the process terminates. Opens always choose the lowest-numbered file descriptor available. Available file descriptors are allocated as follows (both parameters have a maximum setting of 65536 in current versions of Solaris 10):

rlim_fd_cur: Prior to Solaris 10, it was dangerous to set this value higher than 256 due to limitations with the stdio library. If programs require more than 256 file descriptors, they should use setrlimit directly.

rlim_fd_max: Prior to Solaris 10, it was dangerous to set this value higher than 1024 due to limitations with select. If programs require more than 1024 file descriptors, they should use setrlimit directly.

Thursday, April 25, 2013

Changing a hostname

The following steps are required to change a Sun system's hostname.

/etc/hosts.allow (to correct access permissions)
/etc/dfs/dfstab on this system's NFS servers (to allow proper mount access)
/etc/vfstab on this system's NFS clients (so they will point at the correct server)
kerberos configurations
ethers and hosts NIS maps
DNS information
Netgroup information
cron jobs should be reviewed.
Other hostname-specific scripts and configuration files.

Additional steps may be required in order to correct issues involving other systems.

Having said all that, the minumum number of changes required are:

/etc/nodename
/etc/inet/hosts (aka /etc/inet/ipnodes)
/etc/hostname.*
/etc/net/*/hosts (These are dynamically set at boot time in current Solaris 10 versions.)

Wednesday, April 24, 2013

Solaris 2.x Core Dump Analysis

If you are having trouble getting a core dump, see the savecore page.

Several useful pieces of information can be found by running strings on the vmcore.# file and piping it through more or grep: strings vmcore.# | more. In particular, this will tell you the architecture of the system and the OS level. The message buffer is also near the top of the strings output, and may include messages that had not been written to the console or log files yet. (Note that the message buffer is a ring buffer, so the messages may not be in cronological order.)

If the system panic'ed due to a bad trap, adb can determine the instruction that was running at the time of the crash.

crash can also provide useful information, including lock and kernel memory allocation. crash output is notably less cryptic than adb output.

Some netstat, nfsstat and arp commands are also available for crash dumps. After running the command on the corefiles (eg netstat -d unix.# vmcore.#), compare the output to that on the live system, since some of the options do not report the crash dump statistics.

ipcs can also be used with crash dumps in the following format: ipcs -a -C vmcore.# -N unix.# See the IPC page for information on these facilities.

Tuesday, April 23, 2013

savecore

For a system to produce a panic core dump, savecore must be enabled (by uncommenting and editing the savecore lines in /etc/init.d/sysetup in Solaris 2.6 and earlier, or by dumpadm and /etc/init.d/savecore in Solaris 7 and later). If the system is hung, it may be necessary to force a panic by using Stop-a and typing sync at the

ok>

prompt. (sync forces a write from memory to swap.)

There are several reasons why savecore may be unable to save a core following a panic. These include:

The primary swap area may not be big enough to hold the coredump. Solaris 7 and later versions are more careful about only preserving "interesting" parts of memory, but coredumps may still be 128 M or 35% of main memory, whichever is larger. In Solaris 7, the main dump device can be specified with dumpadm. Solaris 2.6 and earlier use the first swap device specified in the /etc/vfstab.

The dump area may be too large. Although some patches have been released to deal with this problem, dump partitions larger than 2G can cause problems with a core dump.

The dump area is not on a regular disk. If the swap area is managed by Veritas or Disk Suite, it is possible that this software may not be functional at the time of the dump. A regular disk partition should be specified as the primary dump area.

The dump area is not local. If the NFS client software is not functional at the time of the software, memory may not be able to dump to the swap area.

The dump device hardware is having problems. Use hardware diagnostics to check out the hardware in question.

The storage area for the corefiles is too small. By default, the storage area is in /var/crash. Either make sure that there is enough storage space there or change the storage area to a partition with more space.

The storage area hardware has problems. Use hardware diagnostics to check out the hardware in question.

savecore is not enabled after all.

After you have a corefile, it can be submitted to Sun for analysis. You can perform some initial analysis yourself, using either mdb or adb.

Monday, April 22, 2013

mdb and adb on a Live System

Solaris's debugger environment allows us access to extremely detailed views of exactly what is happening inside the operating system. This level of detail is well beyond what most system administrators want or need. But every now and again we trip over a problem that requires a microscopic view of the system state. Some links to debugging sessions are provided at the bottom of the page.

mdb was introduced in Solaris 8 as a replacement for the adb and code>crash commands. It extends the adb command set, so it makes sense to look at the basic functionality of the two utilities side by side. For practical purposes, only mdb should be used for systems running Solaris 8+. In Solaris 10, adb is implemented as a link to mdb in adb compatibility mode.

This page has been updated to deal specifically with mdb, with some usage notes for adb. There may be some additional differences between the two that have not been noted. In general, anything that can be done on adb can be done in the same way on mdb, but the converse is not true.

mdb can look at the live kernel with the command:
mdb -k

adb can be run to analyze a live system with the command:
adb -k -P adb: /dev/ksyms /dev/mem
To tweak the kernel, the command is:
adb -kw -P adb: /dev/ksyms /dev/mem
(Here the -P option sets a prompt. Since adb does not have one by default, this can help reduce confusion.)

To look at a kernel crash dump, specify the name list and image by running:
mdb -k unix.# vmcore.#

Kernel Debuggers

The associated kernel debuggers, kadb and kmdb also use much of the same syntax. The old kadb must be loaded at boot time. To do this, from the ok> prompt:
ok> boot kadb -d
kadb: (hit the "return" key)
kadb[0]: kmem_flags/W 01 kadb[0]: :c
(Loading kadb this way means that kadb will only be effective for this current boot session.)

Fortunately, kmdb no longer requires this level of fiddling. In Solaris 10, it can be loaded by running
mdb -K
from the console. (Invoking kmdb this way activates the session at the same time. Note that when the kmdb session is active, the system is stopped.) In order to unload kmdb, it is necessary to run
mdb -U
or quit out of the debugger session with
::quit -u

kmdb can be entered at boottime like kadb. To boot into kmdb, perform the following:
ok> boot kmdb -d
0> kmem_flags/W 01
0> :c

Loading kmdb or kadb at boot time means that the system must be rebooted to unload it. Note that a system that crashes with the kernel debugger loaded returns to the kernel debugger prompt rather than the ok> prompt (unless nopanicdebug has been set to 1, either via kmdb or the /etc/system).

We can also set and manipulate breakpoints within our environment to drop us to the kernel debugger prompt when specified actions take place.

If we have the debugger loaded and have not set a breakpoint, we can enter the kernel debugger environment by sending a break (on a serial console), Stop-A or L1-A (on Sparc systems with a Sun keyboard) or F1-A (for x86/x64 systems with a local keyboard).

Not all mdb macros are accessible to kmdb; use $M to see a list of available macros.

Modular Debugger

Besides being able to analyze a core dump in the same way as adb, mdb is a modular debugger which allows the end user to create custom tools and commands to do almost anything.

The modularity of mdb is its main strength. Modules can allow us to look at programs in a number of contexts, both live and post-mortem. These tools are located in loadable modules which can be accessed via dlopen(). These modules are called dmods and include both dcmds (commands) and walkers. ("Walker" commands allow mdb to change the target to a different part of the program structure.)

mdb also has a number of good interactive features, including command line history, editing and logging. In addition, there is a syntax-checking facility and a built-in output pager.

Target Properties

The target is the object under inspection by mdb. It may be a core file, a live kernel, a crash dump, a user process, a data file, or an ELF object file.

The following types of target properties are available to be read and/or written by mdb:

address spaces: Allows reading and writing data from the target's virtual address space.
symbol table: Allows access to the symbol tables (both static and dynamic) of the target's primary executable file.
external data: Read target's external data buffers.
load objects: Objects can be loaded within mdb
threads: Execution of threads can be controlled.

Dot

In mdb jargon, the current object is known as "dot" ("."). This represents an address in the memory space. Walkers are commands that allow us to shift our focus to another area of memory.

Process Analysis

We can look at a particular process by either running the command as an argument to mdb or by specifying the mdb -p PID command for the given process ID.

Command Syntax

A simple command has the following syntax:
[address] [,count] command
where a command is a verb followed by modifiers. Verbs include:

?: Examine code or variables in executable object file
/: Examine value.
=: Value printed in different formats.
$<: Invoke miscellaneous commands, including macros.
>: Assign value to variable or register.
<: Read value from variable or register.

The address is usually provided by an expression. In mdb, an expression is one of the following:

An integer: May be specified in binary (0i), hexidecimal (0x), or decimal ( 0t).
0tnatural-number.natural-number: Specifies a decimal floating-point number.
'string-of-characters': Generates an integer by converting the characters to ASCII equivalents.
<identifier: The value of the indicated variable.
identifier: The value of the indicated symbol.
(expression): The value of the expression.
.: The value of the current location.
&: The value of the location most recently used to execute a dcmd.
+: The incremented value of the current location.
^: The decremented value of the current location.

Registers

%g0-%g7: General registers.
g0=zero
g7=address of current thread.
%i0-%i7: Input registers.
i6=Frame pointer (for tracing previous function through stack.)
%o0=%o7: Output registers.
o6=Stack pointer (sp)
o7=program counter (pc).
%l0-%l7: Local registers.

Variables

Variables are assigned using the or ::typeset dcmds. Variables may use non-reserved names consisting of sequences of letters, digits, underscores or periods. The value of a variable is a 64-bit unsigned integer.

The following variables are persistent:

0: Most recent value printed by \ / ? or =
9: Most recent count from $<
b: Virtual address of the base of the data section.
d: Size of the data section (bytes).
e: Virtual address of entry point.
hits: Number of times the event specifier has been matched.
m: Magic number of target's primary object file.
t: Size of text section (bytes).
thread: Current representative thread's identifier.

Symbols

In an expression context, a symbol is evaluated, usually to the virtual address associated with the symbol. The symbol table can be examined by the ::nm dcmd and the ::nm -P dcmd (for the private symbol table). Symbols can be scoped via the backtick character: object`file`name evaluates to the address of name from file in the object kernel module.

Headers

The header files give information about the structures we will examine with adb. These files are located in /usr/include/* and /usr/platform/arch_name/include/*

General Commands

The following are the most commonly-used general commands. A full list can be obtained from the mdb prompt using the ::dcmds dcmd.

In the listings below, commands of the following forms are part of adb compatibility mode:

:character
$character

The commands have been grouped together based on the ways in which they are most commonly used:

Control Commands
`$<` or `$<<`	Replace input with a macro or source a macro.
`$>`filename or `::log` filename	Log session to a file. If no filename, use default.
`\|`	Pipe. Allows simple commands to be joined.
`!`	Shell escape. Acts as a pipe to a shell command. (Not available in `kmdb`.)
`//`	Comment. Following words on the same line are ignored.
`$M`	Show built-in macros (Kernel debugger only).
`$P`string	Set prompt to string.
`$Q` `::quit`	Quit. (From `kmdb`, use `-u` option to avoid exiting to `ok>` prompt.)
`$W`	Re-open target in writable mode.
`$p`	Change target context.
`$w`	Control output page width.
`:A`	Attach to a process or core file.
`:R`	Release attachment.
`:k`	Kill and release targets.
`$v`	Print non-zero variables.
`>` `::typeset`	Assign a variable.
`::dcmds`	Print available commands.
`::nm`	Print symbol table. (`-P` specifies a private symbol table. Manipulated with `::nmadd` and `::nmdelete`
`::help` dcmd	Provide usage notes on a dcmd.
`::typeset`	Manipulate variable.
`::walk`	Walk data structure.
`::walkers`	List available walkers.

Input & Output Commands
`$<` `$<<`	Replace input with a macro or source a macro.
`$>`filename	Log session to a file. If no filename, use default.
address`/`format-spec `/`format-spec	Read the value in a memory address formatted as format-spec. If no address is provided, use dot.
address`/W` value	Write the value in the four bytes starting with address. If no address is provided, use dot. `v`, `w` or `Z` may also be used instead of `W` to write 1, 2 or 8 bytes, respectively.
address`=`format-spec `=`format-spec	Format immediate value of address or dot.
`?`	Read/write primary object file.
`@`format-spec	Read/write physical address as format-spec.
`\`format-spec	Read/write physical address as format-spec.

The difference between / and = is subtle. For example, to find the address holding the value of the maxphys symbol in decimal, we would run:
maxphys=D
To find the value inside the above address, we would use / like:
maxphys/D

Format Specification
Note that the `::formats` dcmd prints out a full list of supported formats.
`D`	Display in signed decimal.
`i`	Display as a disassembled instruction.
`U`	Display in unsigned decimal.
`X`	Display in signed hexidecimal.
`0t`xyz	Specifies xyz as a decimal value.

System Examination
`cpu$<cpus`	Display `cpu0`.
`cpu`n`$<cpu`	Display cpu #n.
`$<msgbuf`	Display message buffer, including all console messages up to panic.
`<`sp`$<stacktrace`	Use the stack pointer address (sp) to display a detailed stack trace.
`$r` `::regs`	Display general registers, including program counter and stack pointer.
`::callout`	Print callout table.
`::cpuinfo -v`	Information about activities of CPUs, including runqueue inhabitants.
`::cpuregs ::cpuregs -c cpuid`	Print CPU registers. `kmdb` only. Can specify a cpu.
`::cpustack ::cpustack -c cpuid`	Print CPU stack. `kmdb` only. Can specify a cpu.
`::dnlc`	Print DNLC contents.
`::ipcs`	Print SVR4 IPC information.
`::kmalog`	Display kernel memory log and stack traces.
`::kmastat`	Print current kernel memory allocations
`::memstat`	Print current memory usage.
`::nm`	Print symbol table. (`-P` specifies a private symbol table. Manipulated with `::nmadd` and `::nmdelete`
`::ps`	List processes with associated threads and lwps
`::ptree`	Print process tree.

Target Examination
`$?`	Print status and registers.
`$C`	Show call trace and arguments, saved frame pointer and saved program counter for each stack frame.
`$X`, `$Y`, `$x`, `$y` and `::fpregs`	Display floating point registers.
`$c`	Display stack backtrace.
`$e`	Print list of global symbols.
`$f`	Print list of source files.
`$l`	Print representative thread's lwp ID.
`$m`	Print address space mappings.
`$r` `::regs`	Display general registers, including program counter and stack pointer.
`as::as2proc`	Convert as pointer to a `proc_t` pointer.
`::devbindings`	`devinfo` nodes bound to device-name or major-num.
`::devinfo`	Detailed `devinfo` of node.
`::devinfo2driver`	Driver name for this node.
`::devnames`	Print `devnames` array.
`::devt`	Display `dev_t`'s major & minor numbers.
`::did2thread`	Kernel thread for this ID.
`::dump`address	Dump memory from address.
`::findfalse`	Find potentially falsely shared structures.
`::findleaks`	Search for potential kernel memory leaks.
`::findlocks`	Find locks held by specified thread.
`threadp::findstack`	Find kernel thread stack for associated thread.
`::inode`	Display summary of `inode_t`.
`::kmsqid`	Display message queue structure (`kmsqid`).
`::ksemid`	Display a semaphore structure (`ksemid`).
`::kshmid`	Display a shared memory structure (`kshmid`).
`::pgrep pattern`	Find `proc_t` pointers matching the pattern.
`0tPID::pid2proc`	Convert decimal PID to a `proc_t` pointer.
`procp::ps`	Process information matching the associated `proc_t`.
`::status`	Print summary of target status.
`sobj::walk blocked`	Walk threads blocked on a particular synchronization object (sobj).
`procp::walk thread`	Walk threads of associated process.
`sobj::wchaninfo -v`	Blocked on condition variables for a particular synchronization object (sobj).
`address::whatis`	Attempts to identify address contents.
`vnode::whereopen`	Processes with vnode open.

Tracing, Watchpoints and Breakpoints
(Breakpoints for kernel debugger only.)
`$b`	Show all breakpoints.
`$i`	Print list of ignored signals.
`:a`	Set a watchpoint.
`:b`	Set a breakpoint.
`:c` or `::cont`	Continue target execution.
`:d`	Delete a breakpoint.
`:e`	Step over next instruction.
`:i`	Step over next instruction.
`:k`	Kill and release targets.
`:p`	Set execute access watchpoints.
`:r`	Run new target process.
`:s`	Step target to next instruction.
`:t`	Stop on delivery of specified signals.
`:u`	Step out of current function.
`:w`	Set write access watchpoint.
`:z`	Delete all breakpoints.

General Debugging
`$G`	Toggle C++ demangling.
`$V`	Toggle disassembly mode.
`$g`	Toggle C++ demangling.
`address::dis`	Disassemble text starting at address.

Comparison Operators
`==`	Logical equality.
`!=`	Logical inequality.
`&`	Bitwise AND.
`\|`	Bitwise OR.
`^`	Bitwise XOR.

Usage Examples

There are several usage examples available on the web. Here are a few:

Eric Schrock debugs an x86 race condition
Stacey Marshall debugs a segmentation fault (with very helpful comments)
C Omand debugs a Subversion problem.

Sunday, April 21, 2013

Troubleshooting Intermittent Problems

Intermittent problems are extremely difficult to troubleshoot. Any reproducible problem can be troubleshot, if for no other reason than that each individual component can be proven to not be the problem through experimentation. Problems that are not reproducible cannot be approached in the same way.

Problems present as intermittent for one of two reasons:

We have not identified the real cause of the problem.
The problem is being caused by failing or flaky hardware.

The first possibility should be addressed by going back to brainstorming hypotheses.

It may be helpful to bring a fresh perspective into the brainstorming session, either by bringing in different people, or by sleeping on the problem.

The second problem is tougher. There are hardware diagnostics tests that can be run to try to identify the failing piece of hardware.

The first thing to do is to perform general maintenance on the system. Re-seat memory chips, processors, expansion boards and hard drives.

Once general maintenance has been performed, test suites like SunVTS can perform stress-testing on a system to try to trigger the failure and identify the failing part.

It may be the case, however, that the costs associated with this level of troubleshooting are prohibitive. In this case, we may want to attempt to shotgun the problem.

Shotgunning is the practice of replacing potentially failing parts without having identified them as actually being flaky. In general, parts are replaced by price point, with the cheapest parts being replaced first.

Though we are likely to inadvertently replace working parts, the cost of the replacement may be cheaper than the costs of the alternatives (like the downtime cost associated with stress testing).

When parts are removed during shotgunning, it is important to discard them rather than keep them as spares. Any part you remove as part of a troubleshooting exercise is questionable. (After all, what if a power surge caused multiple parts to fail? Or what if there was a cascading failure?) It does not make sense to have questionable parts in inventory; such parts would be useless for troubleshooting, and putting questionable parts into service just generates additional downtime down the road.

This practice may violate your service contract if performed without the knowledge and consent of your service provider.

Regardless of the method used to deal with intermittent problems, it is essential to keep good records. Relationships between our problem and other events may only become clear when we look at patterns over time. We may only be confident that we have really resolved the problem if we can demonstrate that we've gone well beyond the usual re-occurrence frequency without the problem re-emerging.

Saturday, April 20, 2013

Troubleshooting Hard Drive Connectivity

Disk drive connectivity problems on a Solaris 2.x systems can be caused by software, hardware or PROM configuration problems.

Software Problems

New devices may require that the appropriate /dev and /devices files be created. This can be done through use of the drvconfig and disks commands, but it is usually done by performing a boot -r from the

ok>

prompt.

Once the system is back, the root user should be able to run format and see the disk listed as available. The disk can then be partitioned and labelled with format and have filesystems created with newfs or mkfs, as appropriate.

The presence of the appropriate /dev and /devices files can be verified by running the commands ls -lL /dev/dsk and ls -lL /dev/rdsk and making sure that they are block and character special files respectively, with major numbers depending on the driver used. (See the ls man page if you are not sure what this means.)

Files that can cause problems with hard drive connectivity include:

/dev/dsk/c#t#d#s#

/dev/rdsk/c#t#d#s#

/devices

/etc/name_to_major

/etc/minor_perm

Problems with the /dev and /devices files can be corrected directly by removing the offending files and recreating them, either directly with mknod and ln -s or indirectly with drvconfig, disks or boot -r (as appropriate).

Hardware Problems

The most common sources of hard drive connectivity problems (once the device files are built) are loose cables and terminators. Check these first before proceeding.

The system SCSI buses can be probed at the ok> prompt. To set this up, perform the following:


ok> setenv auto-boot? false

ok> reset

ok> probe-scsi-all

ok> setenv auto-boot? true

This will give a hardware mapping of all SCSI devices on the system. If the hard drive in question does not appear, you have either a hardware problem or a PROM search path problem. To check for the PROM search path problem, run the following:

ok> printenv

Look for the pcia-probe-list or sbus-probe-default parameters and make sure that they are set to the default for your system.

Some additional hardware diagnostics are available at the PROM monitor (ok>) prompt. Additional information may come from navigating the PROM device tree at the ok> prompt.

Friday, April 19, 2013

Sun CD ROM Troubleshooting

Many CD ROM problems are software rather than hardware problems. If your problem is hardware-related, you can look at our Hardware Diagnostics page. Our page on Hard Drive Connectivity may also be useful by analogy.

Some Sun patches resolve security holes in vold by commenting devices out of the /etc/rmmount.conf file. While this is effective, it also disables the devices that are commented out.

For vold to work with your CD ROM and floppy drives within CDE in the designed fashion, you should uncomment the following lines in your /etc/rmmount.conf:

ident hsfs ident_hsfs.so cdrom ident ufs ident_ufs.so floppy cdrom pcmem ident pcfs ident_pcfs.so floppy pcmem action cdrom action_filemgr.so action floppy action_filemgr.so

Note that this will allow mounted CDs and floppies to be mounted with SUID permissions. For systems where this is inappropriate, your /etc/rmmount.conf file should also contain the following:

mount * ufs -o nosuid mount * hsfs -o nosuid

(Note that pcfs does not understand SUID, so pcfs does not pose a risk in the same way that hsfs and ufs do.)

The Solaris Volume Manager documentation contains substantial additional information about the operation of vold.

Thursday, April 18, 2013

ZFS Management

ZFS was first publicly released in the 6/2006 distribution of Solaris 10. Previous versions of Solaris 10 did not include ZFS.

ZFS is flexible, scalable and reliable. It is a POSIX-compliant filesystem with several important features:

integrated storage pool management
data protection and consistency, including RAID
integrated management for mounts and NFS sharing
scrubbing and data integrity protection
snapshots and clones
advanced backup and restore features
excellent scalability
built-in compression
maintenance and troubleshooting capabilities
automatic sharing of disk space and I/O bandwidth across disk devices in a pool
endian neutrality

No separate filesystem creation step is required. The mount of the filesystem is automatic and does not require vfstab maintenance. Mounts are controlled via the mountpoint attribute of each file system.

Pool Management

Members of a storage pool may either be hard drives or slices of at least 128MB in size.

To create a mirrored pool:
zpool create -f pool-name mirror c#t#d# c#t#d#
To check a pool's status, run:
zpool status -v pool-name
To list existing pools:
zpool list
To remove a pool and free its resources:
zpool destroy pool-name
A destroyed pool can sometimes be recovered as follows:
zpool import -D

Additional disks can be added to an existing pool. When this happens in a mirrored or RAID Z pool, the ZFS is resilvered to redistribute the data. To add storage to an existing mirrored pool:
zpool add -f pool-name mirror c#t#d# c#t#d#

Pools can be exported and imported to transfer them between hosts.
zpool export pool-name
zpool import pool-name
Without a specified pool, the import command lists available pools. zpool import

To clear a pool's error count, run:
zpool clear pool-name

Although virtual volumes (such as those from DiskSuite or VxVM) can be used as base devices, it is not recommended for performance reasons.

Filesystem Management

Similar filesystems should be grouped together in hierarchies to make management easier. Naming schemes should be thought out as well to make it easier to group administrative commands for similarly managed filesystems.

When a new pool is created, a new filesystem is mounted at /pool-name.

To create another filesystem:
zfs create pool-name/fs-name
To delete a filesystem:
zfs destroy filesystem-name

To rename a ZFS filesystem:
zfs rename old-name new-name

Properties are set via the zfs set command.
To turn on compression:
zfs set compression=on pool-name/filesystem-name
To share the filesystem via NFS:
zfs set sharenfs=on pool-name/fs-name
zfs set sharenfs="mount-options " pool-name/fs-name
Rather than editing the /etc/vfstab:
zfs set mountpoint= mountpoint-name pool-name/filesystem-name

Quotas are also set via the same command:
zfs set quota=#gigG pool-name/filesystem-name

RAID Levels

ZFS filesystems automatically stripe across all top-level disk devices. (Mirrors and RAID-Z devices are considered to be top-level devices.) It is not recommended that RAID types be mixed in a pool. (zpool tries to prevent this, but it can be forced with the -f flag.)

The following RAID levels are supported:

RAID-0 (striping)
RAID-1 (mirror)
RAID-Z (similar to RAID 5, but with variable-width stripes to avoid the RAID 5 write hole)
RAID-Z2

The zfs man page recommends 3-9 disks for RAID-Z pools.

Performance Monitoring

ZFS performance management is handled differently than with older generation file systems. In ZFS, I/Os are scheduled similarly to how jobs are scheduled on CPUs. The ZFS I/O scheduler tracks a priority and a deadline for each I/O. Within each deadline group, the I/Os are scheduled in order of logical block address.

Writes are assigned lower priorities than reads, which can help to avoid traffic jams where reads are unable to be serviced because they are queued behind writes. (If a read is issued for a write that is still underway, the read will be executed against the in-memory image and will not hit the hard drive.)

In addition to scheduling, ZFS attempts to intelligently prefetch information into memory. The algorithm tries to pick information that is likely to be needed. Any forward or backward linear access patterns are picked up and used to perform the prefetch.

The zpool iostat command can monitor performance on ZFS objects:

USED CAPACITY: Data currently stored
AVAILABLE CAPACITY: Space available
READ OPERATIONS: Number of operations
WRITE OPERATIONS: Number of operations
READ BANDWIDTH: Bandwidth of all read operations
WRITE BANDWIDTH: Bandwidth of all write operations

The health of an object can be monitored with
zpool status

Snapshots and Clones

To create a snapshot:
zfs snapshot pool-name/filesystem-name@ snapshot-name
To clone a snapshot:
zfs clone snapshot-name filesystem-name
To roll back to a snapshot:
zfs rollback pool-name/filesystem-name@snapshot-name

zfs send and zfs receive allow clones of filesystems to be sent to a development environment.

The difference between a snapshot and a clone is that a clone is a writable, mountable copy of the file system. This capability allows us to store multiple copies of mostly-shared data in a very space-efficient way.

Each snapshot is accessible through the .zfs/snapshot in the /pool-name directory. This can allow end users to recover their files without system administrator intervention.

Zones

If the filesystem is created in the global zone and added to the local zone via zonecfg, it may be assigned to more than one zone unless the mountpoint is set to legacy.
zfs set mountpoint=legacy pool-name/filesystem-name

To import a ZFS filesystem within a zone:
zonecfg -z zone-name

add fs
set dir=mount-point
set special=pool-name/filesystem-name
set type=zfs
end
verify
commit
exit

Administrative rights for a filesystem can be granted to a local zone:
zonecfg -z zone-name

add dataset
set name=pool-name/filesystem-name
end
commit exit

Data Protection

ZFS is a transactional file system. Data consistency is protected via Copy-On-Write (COW). For each write request, a copy is made of the specified block. All changes are made to the copy. When the write is complete, all pointers are changed to point to the new block.

Checksums are used to validate data during reads and writes. The checksum algorithm is user-selectable. Checksumming and data recovery is done at a filesystem level; it is not visible to applications. If a block becomes corrupted on a pool protected by mirroring or RAID, ZFS will identify the correct data value and fix the corrupted value.

Raid protections are also part of ZFS.

Scrubbing is an additional type of data protection available on ZFS. This is a mechanism that performs regular validation of all data. Manual scrubbing can be performed by:
zpool scrub pool-name
The results can be viewed via:
zpool status
Any issues should be cleared with:
zpool clear pool-name

The scrubbing operation walks through the pool metadata to read each copy of each block. Each copy is validated against its checksum and corrected if it has become corrupted.

Hardware Maintenance

To replace a hard drive with another device, run:
zpool replace pool-name old-disk new-disk

To offline a failing drive, run:
zpool offline pool-name disk-name
(A -t flag allows the disk to come back online after a reboot.)

Once the drive has been physically replaced, run the replace command against the device:
zpool replace pool-name device-name
After an offlined drive has been replaced, it can be brought back online:
zpool online pool-name disk-name

Firmware upgrades may cause the disk device ID to change. ZFS should be able to update the device ID automatically, assuming that the disk was not physically moved during the update. If necessary, the pool can be exported and re-imported to update the device IDs.

Troubleshooting ZFS

The three categories of errors experienced by ZFS are:

missing devices: Missing devices placed in a "faulted" state.
damaged devices: Caused by things like transient errors from the disk or controller, driver bugs or accidental overwrites (usually on misconfigured devices).
data corruption: Data damage to top-level devices; usually requires a restore. Since ZFS is transactional, this only happens as a result of driver bugs, hardware failure or filesystem misconfiguration.

It is important to check for all three categories of errors. One type of problem is often connected to a problem from a different family. Fixing a single problem is usually not sufficient.

Data integrity can be checked by running a manual scrubbing:
zpool scrub pool-name
zpool status -v pool-name
checks the status after the scrubbing is complete.

The status command also reports on recovery suggestions for any errors it finds. These are reported in the action section. To diagnose a problem, use the output of the status command and the fmd messages in /var/adm/messages.

The config section of the status section reports the state of each device. The state can be:

ONLINE: Normal
FAULTED: Missing, damaged, or mis-seated device
DEGRADED: Device being resilvered
UNAVAILABLE: Device cannot be opened
OFFLINE: Administrative action

The status command also reports READ, WRITE or CHKSUM errors.

To check if any problem pools exist, use
zpool status -x
This command only reports problem pools.

If a ZFS configuration becomes damaged, it can be fixed by running export and import.

Devices can fail for any of several reasons:

"Bit rot:" Corruption caused by random environmental effects.
Misdirected Reads/Writes: Firmware or hardware faults cause reads or writes to be addressed to the wrong part of the disk.
Administrative Error
Intermittent, Sporadic or Temporary Outages: Caused by flaky hardware or administrator error.
Device Offline: Usually caused by administrative action.

Once the problems have been fixed, transient errors should be cleared:
zpool clear pool-name

In the event of a panic-reboot loop caused by a ZFS software bug, the system can be instructed to boot without the ZFS filesystems:
boot -m milestone=none
When the system is up, remount / as rw and remove the file /etc/zfs/zpool.cache. The remainder of the boot can proceed with the
svcadm milestone all command. At that point import the good pools. The damaged pools may need to be re-initialized.

Scalability

The filesystem is 128-bit. 256 quadrillion zetabytes of information is addressable. Directories can have up to 256 trillion entries. No limit exists on the number of filesystems or files within a filesystem.

ZFS Recommendations

Because ZFS uses kernel addressable memory, we need to make sure to allow enough system resources to take advantage of its capabilities. We should run on a system with a 64-bit kernel, at least 1GB of physical memory, and adequate swap space.

While slices are supported for creating storage pools, their performance will not be adequate for production uses.

Mirrored configurations should be set up across multiple controllers where possible to maximize performance and redundancy.

Scrubbing should be scheduled on a regular basis to identify problems before they become serious.

When latency or other requirements are important, it makes sense to separate them onto different pools with distinct hard drives. For example, database log files should be on separate pools from the data files.

Root pools are not yet supported in the Solaris 10 6/2006 release, though they are anticipated in a future release. When they are used, it is best to put them on separate pools from the other filesystems.

On filesystems with many file creations and deletions, utilization should be kept under 80% to protect performance.

The recordsize parameter can be tuned on ZFS filesystems. When it is changed, it only affects new files. zfs set recordsize=size tuning can help where large files (like database files) are accessed via small, random reads and writes. The default is 128KB; it can be set to any power of two between 512B and 128KB. Where the database uses a fixed block or record size, the recordsize should be set to match. This should only be done for the filesystems actually containing heavily-used database files.

In general, recordsize should be reduced when iostat regularly shows a throughput near the maximum for the I/O channel. As with any tuning, make a minimal change to a working system, monitor it for long enough to understand the impact of the change, and repeat the process if the improvement was not good enough or reverse it if the effects were bad.

The ZFS Evil Tuning Guide contains a number of tuning methods that may or may not be appropriate to a particular installation. As the document suggests, these tuning mechanisms will have to be used carefully, since they are not appropriate to all installations.

For example, the Evil Tuning Guide provides instructions for:

Turning off file system checksums to reduce CPU usage. This is done on a per-file system basis:
zfs set checksum=off filesystem

zfs set checksum='on | fletcher2 | fletcher4 | sha256' filesystem

Limiting the ARC size by setting
set zfs:zfs_arc_max
in /etc/system on 8/07 and later.
If the I/O includes multiple small reads, the file prefetch can be turned off by setting
zfs:zfs_prefetch_disable
on 8/07 and later.
If the I/O channel becomes saturated, the device level prefetch can be turned off with
set zfs:zfs_vdev_cache_bshift = 13
in /etc/system for 8/07 and later
I/O concurrency can be tuned by setting
set zfs:zfs_vdev_max_pending = 10
in /etc/system in 8/07 and later.
If storage with an NVRAM cache is used, cache flushes may be disabled with
set zfs:zfs_nocacheflush = 1
in /etc/system for 11/06 and later.
ZIL intent logging can be disabled. (WARNING: Don't do this.)
Metadata compression can be disabled. (Read this section of the Evil Tuning Guide first-- you probably do not need to do this.)

Sun Cluster Integration

ZFS can be used as a failover-only file system with Sun Cluster installations.

If it is deployed on disks also used by Sun Cluster, do not deploy it on any Sun Cluster quorum disks. (A ZFS-owned disk may be promoted to be a quorum disk on current Sun Cluster versions, but adding a disk to a ZFS pool may result in quorum keys being overwritten.)

ZFS Internals

Max Bruning wrote an excellent paper on how to examine the internals of a ZFS data structure. (Look for the article on the ZFS On-Disk Data Walk.) The structure is defined in ZFS On-Disk Specification.

Some key structures:

uberblock_t: The starting point when examining a ZFS file system. 128k array of 1k uberblock_t structures, starting at 0x20000 bytes within a vdev label. Defined in uts/common/fs/zfs/sys/uberblock_impl.h Only one uberblock is active at a time; the active uberblock can be found with
zdb -uuu zpool-name
blkptr_t: Locates, describes, and verifies blocks on a disk. Defined in uts/common/fs/zfs/sys/spa.h.
dnode_phys_t: Describes an object. Defined by uts/common/fs/zfs/sys/dmu.h
objset_phys_t: Describes a group of objects. Defined by uts/common/fs/zfs/sys/dmu_objset.h
ZAP Objects: Blocks containing name/value pair attributes. ZAP stands for ZFS Attribute Processor. Defined by uts/common/fs/zfs/sys/zap_leaf.h
Bonus Buffer Objects:
- dsl_dir_phys_t: Contained in a DSL directory dnode_phys_t; contains object ID for a DSL dataset dnode_phys_t
- dsl_dataset_phys_t: Contained in a DSL dataset dnode_phys_t; contains a blkprt_t pointing indirectly at a second array of dnode_phys_t for objects within a ZFS file system.
- znode_phys_t: In the bonus buffer of dnode_phys_t structures for files and directories; contains attributes of the file or directory. Similar to a UFS inode in a ZFS context.

Wednesday, April 17, 2013

Introduction to Solaris 10 Zones

Zones are containers to segregate services so that they do not interfere with each other. One zone, the global zone, is the locus for system-wide administrative functions. Non-global zones are not able to interact with each other except through network interfaces. When using management commands that reference PIDs, only processes in the same zone will be visible from any non-global zone.

Zones requiring network connectivity have at least one dedicated IP address. Non-global zones cannot observe each other's network traffic. Users in the global zone, however, are able to observe the functioning of processes in non-global zones. It is usually good practice to limit user access to the global zone to system administrators. Other processes and users should be assigned to a non-global zone.

Each zone is assigned a zone name and a unique numeric zone ID. The global zone always has the name "global" and ID "0." A node name is also assigned to each zone, including global. The node names are independent of the zone names.

Each zone has a path to its root directory relative to the global zone's root directory.

A non-global zone's scheduling class is set to be the same as the system's scheduling class. If a zone is assigned to a resource pool, its scheduling class can be controlled by controlling the pool's scheduling class.

Non-global zones can have their own zone administrators. Their authority is limited to their home zone.

The separation of the environments allows for better security, since the security for each zone is independent. Separation also allows for the installation of environments with distinct profiles on the same hardware.

The virtualization of the environment makes it easier to duplicate an environment on different physical servers.

ZFS is supported in Solaris 10 zones from the 6/2006 release and after.

Zone Installation

The system administrator configures new non-global zones via the zonecfg command, administers them via zoneadm and logs into them via zlogin.

Zone States

Zone state information can be viewed with zoneadm list -iv from the global zone. Non-global zones have one of the following states:

configured: Configuration complete and in stable storage.
incomplete: Installation or uninstallation underway
installed: Configuration instantiated on system. Zone has no associated virtual platform.
ready: Virtual platform established, zsched started, IPs plumbed, filesystems mounted, zone ID assigned. No zone processes started yet.
running: This state entered when zone init process starts.
shutting down: Zone being halted.
down: Transitional state during zone shutdown.

Zone Control Commands

The following control commands can be used to manage and monitor transitions between states:

zlogin options zone-name
zoneadm -z zone-name boot
zoneadm -z zone-name halt
zoneadm -z zone-name install
zoneadm -z zone-name ready
zoneadm -z zone-name reboot
zoneadm -z zone-name uninstall
zoneadm -z zone-name verify
zonecfg -z zone-name: Interactive mode; can be used to remove properties of the following types: fs, device, rctl, net, attr
zonecfg -z zone-name commit
zonecfg -z zone-name create
zonecfg -z zone-name delete
zonecfg -z zone-name verify

Resource Management

Zones can be used to dynamically control resource allocations on a zone basis. This means that an application on a zone can be isolated and prevented from throttling other processes in other zones on the same server.

Zone Components

The following components may be included in a zone:

Zone name
zonepath: Path to the zone root in the global zone's file space.
autoboot: Whether to start the zone automatically. (Note that the svc:/system/zones:default service needs to be running in SMF for this to work.)
pool: Associate the zone with a resource pool; multiple zones may share a pool.
net: Zone network interface
fs: File systems from the zone's /etc/vfstab, automounted file systems configured within the zone, manually mounted file systems or ZFS mounts from within the zone.
dataset: This allows a non-global zone to manage a ZFS file system.
inherit-pkg-dir: In a sparse root zone, represents directories containing packaged software that a non-global zone shares with the global zone. (Should not be used in a whole root zone.)
device: Devices that should be configured in a non-global zone.
rctl: Zone-wide resource controls such as zone.cpu-shares and zone.max-lwps
attr: Zone comments

The components can be set using the zonecfg command.

zonecfg Interactive Mode

In interactive mode, zonecfg can refer to either a global scope or a specific resource. If no zone is specified in the original zonecfg command, the scope is global by default.

If a select or add subcommand is used to specify a resource, the scope limited to that resource until an end or cancel command is issued.

The following subcommands are supported:

add: Add the specified resource or property to the configuration in the scope.
cancel: Ends the resource specification and returns to the global scope without retaining partially specified resources.
commit: Dump current configuration to disk.
create: In-memory configuration begun for a new zone. A -t template option specifies a template, -F overwrites an existing configuration, and -b creates a blank configuration.
delete: Destroy the specified configuration.
end: Ends the resource specification
exit: Ends the zonecfg session.
export: Export the configuration in a form that can be used in a command file.
help: Context-sensitive help depending on the current scope
info: Display information about the configuration of the current scope.
remove: Remove the specified resource or property from the command scope.
revert: Return to the last state written to disk.
select: From the global scope, changes scope to the specified resource
set: Set the specified property to the specified value
verify: Verify the current configuration for correctness.

Adding Resources

dataset:
zonecfg:zone-name> add dataset zonecfg:zone-name:dataset> set name=pool/filesys zonecfg:zone-name:dataset> end

fs:
zonecfg:zone-name> add fs zonecfg:zone-name:fs> set directory=/mountpoint zonecfg:zone-name:fs> set special=/dev/dsk/c#t#d#s# zonecfg:zone-name:fs> set raw=/dev/rdsk/c#t#d#s# zonecfg:zone-name:fs> set type=ufs zonecfg:zone-name:fs> add options logging zonecfg:zone-name:fs> end

inherit-pkg-dir:
zonecfg:zone-name> add inherit-pkg-dir zonecfg:zone-name:inherit-pkg-dir> set dir=/package-home zonecfg:zone-name:inherit-pkg-dir> end

net:
zonecfg:zone-name> add net zonecfg:zone-name:net> set physical=interface-name zonecfg:zone-name:net> set address=xxx.xxx.xxx.xxx zonecfg:zone-name:net> end (Examples of interface names include hme0 and bge0.)

rctl:
zonecfg:zone-name> add rctl zonecfg:zone-name:rctl> set name=resource-name zonecfg:zone-name:rctl> add value (priv=priv-level,limit=#,action=action-type) zonecfg:zone-name:rctl> end
(See Resource Management for information about what sorts of values to use.)

Zone Models

There are two different zone models, sparse and whole root.

Sparse zones only contain a subset of the packages installed into the root zone. Additional packages can be brought in using the inherit-pkg-dir resources. Each sparse zone requires about 100MB of free space in the global file system. 40MB of free RAM are also recommended.

Whole root zones contain all required packages and also any optional Solaris packages that have been selected. The disk space required for whole root zones is as much as is required for a full installation. Whole root zones allow maximum configuration within the zone context.

Zone Creation Example

From within the global zone:
# zonecfg -z zone-name zonecfg:zone-name> create zonecfg:zone-name> set zonepath=/zone-root-path (Note that the zone's root path cannot be on ZFS, though that capability is coming.)
zonecfg:zone-name> set autoboot=true zonecfg:zone-name> add fs zonecfg:zone-name:fs> set dir=/mount-point zonecfg:zone-name:fs> set special=/global-source-dir zonecfg:zone-name:fs> set type=lofs zonecfg:zone-name:fs> end (Inside the non-global zone, the mounted loopback file system will be readable and writable.)
zonecfg:zone-name> add dataset zonecfg:zone-name:dataset> set name=zone-pool/ZFS-filesys zonecfg:zone-name:dataset> end zonecfg:zone-name> verify zonecfg:zone-name> commit zonecfg:zone-name> end

Additional Reading

System Administration Guide: Solaris Containers-Resource Management and Solaris Zones

Tuesday, April 16, 2013

Solaris Process Scheduling

In Solaris, highest priorities are scheduled first. Kernel thread scheduling information can be revealed with ps -elcL.

A process can exist in one of the following states: running, sleeping or ready.

Kernel Threads Model

The Solaris 10 kernel threads model consists of the following major objects:

kernel threads: This is what is scheduled/executed on a processor
user threads: The user-level thread state within a process.
process: The object that tracks the execution environment of a program.
lightweight process (lwp): Execution context for a user thread. Associates a user thread with a kernel thread.

In the Solaris 10 kernel, kernel services and tasks are executed as kernel threads. When a user thread is created, the associated lwp and kernel threads are also created and linked to the user thread.

(This single-level model was first introduced in Solaris 8's alternative threads library, which was made the default in Solaris 9. Prior to that, user threads had to bind to an available lwp before becoming eligible to run on the processor.)

Priority Model

The Solaris kernel is fully preemptible. This means that all threads, including the threads that support the kernel's own activities, can be deferred to allow a higher- priority thread to run.

Solaris recognizes 170 different priorities, 0-169. Within these priorities fall a number of different scheduling classes:

TS (timeshare): This is the default class for processes and their associated kernel threads. Priorities within this class range 0-59, and are dynamically adjusted in an attempt to allocate processor resources evenly.
IA (interactive): This is an enhanced version of the TS class that applies to the in-focus window in the GUI. Its intent is to give extra resources to processes associated with that specific window. Like TS, IA's range is 0-59.
FSS (fair-share scheduler): This class is share-based rather than priority- based. Threads managed by FSS are scheduled based on their associated shares and the processor's utilization. FSS also has a range 0-59.
FX (fixed-priority): The priorities for threads associated with this class are fixed. (In other words, they do not vary dynamically over the lifetime of the thread.) FX also has a range 0-59.
SYS (system): The SYS class is used to schedule kernel threads. Threads in this class are "bound" threads, which means that they run until they block or complete. Priorities for SYS threads are in the 60-99 range.
RT (real-time): Threads in the RT class are fixed-priority, with a fixed time quantum. Their priorities range 100-159, so an RT thread will preempt a system thread.

Of these, FSS and FX were implemented in Solaris 9. (An extra-cost option for Solaris 8 included the SHR (share-based) class, but this has been subsumed into FSS.)

Fair Share Scheduler

The default Timesharing (TS) scheduling class in Solaris attempts to allow each process on the system to have relatively equal CPU access. The nice command allows some management of process priority, but the new Fair Share Scheduler (FSS) allows more flexible process priority management that integrates with the project framework.

Each project is allocated a certain number of CPU shares via the project.cpu-shares resource control. Each project is allocated CPU time based on its cpu-shares value divided by the sum of the cpu-shares values for all active projects.

Anything with a zero cpu-shares value will not be granted CPU time until all projects with non-zero cpu-shares are done with the CPU.

The maximum number of shares that can be assigned to any one project is 65535.

FSS can be assigned to processor sets, resulting in more sensitive control of priorities on a server than raw processor sets. The dispadmin command command controls the assignment of schedulers to processor sets, using a form like:
dispadmin -d FSS
To enable this change now, rather than after the next reboot, run a command like the following:
priocntl -s -C FSS
priocntl can control cpu-shares for a project:
priocntl -r -n project.cpu-shares -v number-shares -i project project-name

The Fair Share Scheduler should not be combined with the TS, FX (fixed-priority) or IA (interactive) scheduling classes on the same CPU or processor set. All of these scheduling classes use priorities in the same range, so unexpected behavior can result from combining FSS with any of these. (There is no problem, however, with running TS and IA on the same processor set.)

To move a specific project's processes into FSS, run something like:
priocntl -s -c FSS -i projid project-ID

All processes can be moved into FSS by first converting init, then the rest of the processes:
priocntl -s -c FSS -i pid 1
priocntl -s -c FSS -i all

Implementation Details

Time Slicing for TS and IA

TS and IA scheduling classes implement an adaptive time slicing scheme that increases the priority of I/O-bound processes at the expense of compute-bound processes. The exact values that are used to implement this can be found in the dispatch table. To examine the TS dispatch table, run the command dispadmin -c TS -g. (If units are not specified, dispadmin reports time values in ms.)

The following values are reported in the dispatch table:

ts_quantum: This is the default length of time assigned to a process with the specified priority.
ts_tqexp: This is the new priority that is assigned to a process that uses its entire time quantum.
ts_slpret: The new priority assigned to a process that blocks before using its entire time quantum.
ts_maxwait: If a thread does not receive CPU time during a time interval of ts_maxwait, its priority is raised to ts_lwait.
ts_lwait:

The man page for ts_dptbl contains additional information about these parameters.

dispadmin can be used to edit the dispatch table to affect the decay of priority for compute-bound processes or the growth in priority for I/O-bound processes. Obviously, the importance of the different types of processing on different systems will make a difference in how these parameters are tweaked. In particular, ts_maxwait and ts_lwait can prevent CPU starvation, and raising ts_tqexp slightly can slow the decline in priority of CPU-bound processes.

In any case, the dispatch tables should only be altered slightly at each step in the tuning process, and should only be altered at all if you have a specific goal in mind.

The following are some of the sorts of changes that can be made:

Decreasing ts_quantum favors IA class objects.
Increasing ts_quantum favors compute-bound objects.
ts_maxwait and ts_lwait control CPU starvation.
ts_tqexp can cause compute-bound objects' priorities to decay more or less rapidly.
ts_slpret can cause I/O-bound objects' priorities to rise more or less rapidly.

RT objects time slice differently in that ts_tqexp and ts_slpret do not increase or decrease the priority of the

IA objects add 10 to the regular TS priority of the process in the active window. This priority shifts with the focus on the active window. object. Each RT thread will execute until its time slice is up or it is blocked while waiting for a resource.

Time Slicing for FSS

In FSS, the time quantum is the length of time that a thread is allowed to run before it has to release the processor. This can be checked using
dispadmin -c FSS -g

The QUANTUM is reported in ms. (The output of the above command displays the resolution in the RES parameter. The default is 1000 slices per second.) It can be adjusted using dispadmin as well. First, run the above command and capture the output to a text file (filename.txt). Then run the command:
dispadmin -c FSS -s filename.txt

Callouts

Solaris handles callouts with a callout thread that runs at maximum system priority, which is still lower than any RT thread. RT callouts are handled separately and are invoked at the lowest interrupt level, which ensures prompt processing.

Priority Inheritance

Each thread has two priorities: global priority and inherited priority. The inherited priority is normally zero unless the thread is sitting on a resource that is required by a higher priority thread.

When a thread blocks on a resource, it attempts to "will" or pass on its priority to all threads that are directly or indirectly blocking it. The pi_willto() function checks each thread that is blocking the resource or that is blocking a thread in the syncronization chain. When it sees threads that are a lower priority, those threads inherit the priority of the blocked thread. It stops traversing the syncronization chain when it hits an object that is not blocked or is higher priority than the willing thread.

This mechanism is of limited use when considering condition variable, semaphore or read/write locks. In the latter case, an owner-of-record is defined, and the inheritance works as above. If there are several threads sharing a read lock, however, the inheritance only works on one thread at a time.

Thundering Herd

When a resource is freed, all threads awaiting that resource are woken. This results in a footrace to obtain access to that object; one succeeds and the others return to sleep. This can lead to wasted overhead for context switches, as well as a problem with lower priority threads obtaining access to an object before a higher-priority thread. This is called a "thundering herd" problem.

Priority inheritance is an attempt to deal with this problem, but some types of syncronization do not use inheritance.

Turnstiles

Each syncronization object (lock) contains a pointer to a structure known as a turnstile. These contain the data needed to manipulate the syncronization object, such as a queue of blocked threads and a pointer to the thread that is currently using the resource. Turnstiles are dynamically allocated based on the number of allocated threads on the system. A turnstile is allocated by the first thread that blocks on a resource and is freed when no more threads are blocked on the resource.

Turnstiles queue the blocked threads according to their priority. Turnstiles may issue a signal to wake up the highest-priority thread, or they may issue a broadcast to wake up all sleeping threads.

Adjusting Priorities

The priority of a process can be adjusted with priocntl or nice, and the priority of an LWP can be controlled with priocntl().

Real Time Issues

STREAMS processing is moved into its own kernel threads, which run at a lower priority than RT threads. If an RT thread places a STREAMS request, it may be serviced at a lower priority level than is merited.

Real time processes also lock all their pages in memory. This can cause problems on a system that is underconfigured for the amount of memory that is required.

Since real time processes run at such a high priority, system daemons may suffer if the real time process does not permit them to run.

When a real time process forks, the new process also inherits real time privileges. The programmer must take care to prevent unintended consequences. Loops can also be hard to stop, so the programmer also needs to make sure that the program does not get caught in an infinite loop.

Interrupts

Interrupt levels run between 0 and 15. Some typical interrupts include:

soft interrupts
SCSI/FC disks (3)
Tape, Ethernet
Video/graphics
clock() (10)
serial communications
real-time CPU clock
Nonmaskable interrupts (15)