Solaris Troubleshooting: March 2013

Sunday, March 31, 2013

Using tip for Serial Terminal Access

The tip command can be used to allow one Unix workstation to act as a serial terminal for another Unix system. The following must be in place to allow this to work between two Sun systems:

The system to be observed should be halted. If a keyboard needs to be removed from the system, the system should be powered off. (Some older models will blow a keyboard fuse if the keyboard is removed while the system is powered up.)
The /etc/remote file on the observing machine needs to have the hardwire line pointing to the correct serial port.
- By default, the file points at port b. In this case, the line should look like:
  :dv=/dev/term/b:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D:
- If serial port a is to be used, change the line to look like:
  :dv=/dev/term/a:br#9600:el=^C^S^Q^U^D:ie=%$:oe=^D:
A null modem cable should be run between serial port a on the system that is under observation and the serial port configured in the /etc/remote file's hardwire line on the observing system. (A null modem cable interchanges wires 2 and 3 on one end.)

On the observer system type "tip hardwire" in a window. (It is best to use a windowed environment so that control of the system can be regained in case of a session hang.) A "connected" message should be echoed to the window. If not, use admintool or another utility to see if the serial port is already in use.

A tip session should not be closed by killing the process, the shell, or rebooting the observer machine. In these cases a /var/spool/locks/LCK file may not be cleaned up properly, which may prevent further tip sessions.

Some common tip commands are:

~. (end session)
~# (break--same as STOP-A)
~? (list all tip commands)

(Other commands may be found on the

tip

man page.)

The system to be observed/controlled can be powered up. If the diag-switch? PROM environment variable is set to true, hardware diagnostic data will be displayed to the tip window. (See the Hardware Diagnostics page for further information.)

Saturday, March 30, 2013

Book Review: Essential System Administration

This is the latest edition of the book that taught me to be a Unix administrator. Her descriptions were understandable, and her procedures were well-explained. And throughout the book, Frisch explained the mindset that every professional system administrator needs to bring to the job.

The book has expanded with each edition, and there is increased information about different Unix-like Operating System options. In particular, the coverage of Linux and AIX has increased in successive editions of the book.

But the beating heart of the book has not changed. This is the book that encapsulates the essence of what it means to be a professional system administrator.

Friday, March 29, 2013

Sun POST-Based Hardware Diagnostics

The POST-based hardware diagnostics only check out the devices and buses required to access I/O devices; they do not check the devices themselves. Even so, the onboard hardware diagnostics can often pinpoint the source of a hardware failure.

To run Sun hardware diagnostics, perform the following at the ok> prompt:

ok> setenv auto-boot? false ok> setenv diag-switch? true ok> setenv diag-level max ok> setenv diag-device disk net (if appropriate)
ok> reset (watch results of diagnostic tests)

If devices appear to be missing, you can also run the following tests:

ok> probe-scsi-all ok> probe-sbus ok> show-sbus ok> show-disks ok> show-tapes ok> show-nets ok> show-devs

In addition, the following commands can be used to examine the CPUs or switch to another CPU:

ok> module-info ok> processor_number switch-cpu

Sometimes additional information can be obtained by navigating the PROM device tree. You can also try Sun's web site for additional information on PROM monitor diagnostics.

At the end of this process, reset your PROM parameters:
ok> setenv auto-boot? true (if appropriate)
ok> setenv diag-switch? false (if appropriate)

(Note that the diagnostics can take a substantial amount of time to run, depending on your hardware configuration. Most admins prefer to turn them off unless they are diagnosing a problem.)

For sun4u (Ultra) systems, you can get some of this information by running /usr/platform/`arch -k`/sbin/prtdiag -v on a running system.

Results from the above should be compared to log entries in /var/adm/messages or console error messages.

Additional PROM Diagnostics

Some additional PROM diagnostics are available at the ok> prompt. To discover what additional diagnostics are available for your hardware, type help diag at the ok> prompt. The output will include the appropriate syntax for all available PROM diagnostic functions. Note that reset should be run as above before running the tests. It is also possible that


test-all

might hang the system, requiring a power cycle.

Thursday, March 28, 2013

Navigating the PROM Hardware Tree

For versions 2.x and higher (SparcStation 2 and newer), the OpenBoot firmware provides for two command line interfaces:

Restricted Monitor: This interface is signalled by the > prompt. It provides for execution of the b (boot), c (continue), and n (new command mode) commands. The Restricted Monitor is used to implement PROM security via the security-mode PROM environment variable.

PROM Monitor: (Also known as the "Forth Monitor" or the "New Monitor.") This interface provides additional control, including a Forth command interpreter. The PROM Monitor is signalled by the ok> prompt.

Once in the ok> PROM monitor mode, it is possible to examine the tree of hardware devices known by the system. The following are the crucial commands to remember:

cd changes location in the device tree
ls lists the contents of the present node
pwd gives the current location in the device tree
dev device_pathname selects a particular node of the tree for examination.
.properties shows the properties for a particular node.
device-end unselects a node.

The device names are cryptic, but are closely related to the names of devices in the Operating System's /devices directory. The /etc/driver_aliases file may also be useful when trying to identify a device.

A full device path node name has the following form:

name@address:arguments

One example of such a name represents the 0 slice of a sun4m boot disk:

/sbus@1,f8000000/esp@0,40000/sd@3,0:a

The following commands may come in handy when trying to identify the location of a device:

ok> probe-scsi-all ok> probe-sbus ok> show-sbus ok> show-disks ok> show-tapes ok> show-nets ok> show-devs

The ability to navigate the device tree on such a primitive level is useful for troubleshooting. If the device in question is not present, we have a physical connectivity issue. At that point, we might try a reset or power cycle, then check cables and terminators, then examine the device itself.

If the device shows up on the PROM hardware tree but not in the Operating System, we would try a boot -r, examine the /dev and /devices directories, and look at the relevant driver files.

Wednesday, March 27, 2013

Error Message Interpretation

See below for a list of common error messages.

Traps and interrupts can be blocked by a kernel thread's signal mask, or they can trigger an exception handling routine. In the absence of such a routine or mask, the process is terminated.

Traps

Traps are syncronous messages generated by the process or its underlying kernel thread. Examples include SIGSEGV, SIGPIPE and SIGSYS. They are delivered to the process that caused the signal.

Trap messages can be discovered in a number of places, including error logs, adb output, and console messages. Sun provides a couple of files that can help determine the type of trap encountered:

/usr/include/sys/trap.h (software traps)
/usr/include/v7/sys/machtrap.h (hardware traps, 32 bit)
/usr/include/v9/sys/machtrap.h (hardware traps, 64 bit)

ECC (Error Checking and Correcting) interrupts are reported as traps when a bit error is corrected. These, while they do not crash the system, are usually a signal that the memory chip in question needs to be replaced.

Critical errors include things like fan/temperature warnings or power loss that require immediate attention and shutdown.

Fatal errors are hardware errors where proper system function cannot be guaranteed. These result in a watchdog reset.

Bus Errors

A bus error is issued to the processor when it references a location that cannot be accessed.

Illegal address: (usually a software failure)
Instruction fetch/Data load: (device driver bug)
DVMA: (on an Sbus system)
Synchronous/asynchronous data store
MMU: (Memory Management Unit: can be hardware or software, but frequently are system board problems.)

Interrupts

These notify the CPU of external device conditions that are asynchronous with normal operation. They can be delivered to the responsible process or kernel thread.

In Solaris, interrupts are handled by dedicated interrupt-handling kernel threads, which use mutex locks and semaphores. The kernel will block interrupts in a few exceptional circumstances, such as during the process of acquiring a mutex lock protecting a sleep queue.

Device done or ready.
Error detected.
Power on/off.

Watchdog Reset

Watchdog resets can be caused by hardware or software issues. See the watchdog reset page for information on how to troubleshoot watchdog resets.

Error Message List

A complete (or even reasonably complete) listing of error messages on Solaris is beyond the scope of this site. For that matter, the nature of an evolving operating system may put it beyond the scope of any reasonably sized page. Maybe a wiki? If someone has such a resource, let me know and I will link to it.

Having said that, this page contains a list of several of the most common error messages. Where I have been able to identify a usual cause for an error message, I have included that.

There are several sources that contain listings of error messages that are useful for debugging purposes.

One of the best resources is the Solaris Common Messages and Troubleshooting Guide released by Sun with Solaris 8. Since this is a better resource than I could provide for Solaris up through 8, I have focused on Solaris 10. (There is obviously a lot of overlap.)

The SunSolve web site is available to anyone with a Sun service contract. Its search feature can be used to look up key words in an error message to look for current bug reports and patches that may resolve them. This page does not provide a listing of bug reports or patches to apply for given error messages in certain conditions. This page is intended as a supplement to Sunsolve, not a replacement.

The Intro(2) man page contains an introduction to system calls and error numbers. The information comes from the errno.h include file. Several include files contain at least basic information about different kinds of error messages:

/usr/include/sys/errno.h (error messages, including abbreviations and numbers seen in truss output.)
/usr/include/sys/trap.h (software traps)
/usr/include/v7/sys/machtrap.h (hardware traps, 32 bit)
/usr/include/v9/sys/machtrap.h (hardware traps, 64 bit)

These messages are alphabetized by the first non-variable portion of the message. Wording may vary slightly between Solaris versions or even patch levels. If you run across common messages not on this list, feel free to make a comment to the Solaris Troubleshooting blog.

Accessing a corrupted shared library (ELIBBAD): exec(2) was unable to load a required static shared library. The most common cause for this is a corrupted library.
Address already in use (EADDRINUSE): The protocol does not permit using an address that is already in use. This error indicates a software programming bug.
Address family not supported by protocol family (EAFNOSUPPORT): The protocol does not support the requested address. This indicates a software programming bug.
Arg list too long (E2BIG): The argument list includes both the argument list and the environment variable settings. The most common cause for this problem is that so many environment variables are set that it exceeds the size of the argument buffer used by exec(2). The easiest solution may be to unset some environment variables in the calling shell.
Argument out of domain (EDOM): This error appears when an improper argument is submitted to a math package programming function. (For example, an attempt to take a square root of a negative number would probably yield this error.) It may be helpful to use matherr(3M) to diagnose the problem, or the programmer may need to implement argument-checking before the function is called.
Arguments too long: This is a C shell message indicating that more than 1706 arguments follow a command. This may happen if globbing is applied to a large number of objects (eg rm * in a directory of more than 1706 objects). Temporarily switching to Bourne shell may resolve the problem, since Bourne shells dynamically allocate space for arguments.
Assertion failed: This is a result of an assert(3C) debugging command that the programmer inserted into the program. The output will include an expression, a source file number and a code line number. The information may be useful in examining the source code.
Attachment point not found: Use cfgadm to list available attachment points. Check the physical connection to the desired device.
Attempting to link in more shared libraries than system limit (ELIBMAX): The executable requires more static libraries than the current system limit.
authentication receive failed: Initiator unable to receive authentication information. Verify network connectivity to storage device and authentication server.
authentication transmit failed: Initiator unable to transmit authentication information. Verify network connectivity to storage device and authentication server.
Bad address (EFAULT): A function taking pointer argument has been passed an invalid address. This may result from supplying the wrong device or option to a command, or it may be the result of a programming bug.
Bad file number (EBADF): The file descriptor references a file that is either not open or is open for a conflicting purpose. (eg, a read(2) is specified against a file that is open for write(2) or vice-versa.) This is a programming bug.
Bad module/chip: This error message usually indicates a memory module or chip that is associated with parity errors. This is a hardware fault.
BAD SUPER BLOCK: Check the Trap 3E entry below to see if there are possible hardware or SCSI configuration causes for this problem. It may be possible to boot from alternate super blocks. If there is no current backup, boot from a CD and back up the raw partition with ufsdump or another similar utility. Solaris 10's 6/06 release includes enhancements to fsck to automatically find and repair bad superblocks. This option should only be used to repair filesystems that were created with mkfs or newfs. For older systems, an alternate superblock can frequently be found with a
newfs -N /dev/rdsk/c#t#d#s#
command while booted from a CD. (Note the -N option. Running this command without this option may mess things up beyond repair.) fsck can be run against an alternate superblock with
fsck -o b=superblock /dev/rdsk/c#t#d#s#
If there is a lot of output, it may be necessary to choose the -y option to avoid having to answer a ton of prompts. We may need to try several alternate superblocks before finding a working one. Once we are done, we need to re-install the bootblock:
cd /usr/platform/`arch -k`/lib/fs/ufs
/usr/sbin/installboot ./bootblk /dev/rdsk/c#t#d#s#
BAD TRAP: The causes for bad traps include system text errors, data access faults, data alignment errors or some types of user software traps. These can indicate either a hardware fault or a mismatch between the hardware and its software configuration. They may also indicate a CPU with an obsolete firmware. Bad traps usually result in a panic, sync, dump, reboot cycle. The kernel traceback message on the console will frequently indicate the hardware component that generated the bad trap. If the configuration for this component is correct, it will need to be replaced (or at least reseated).
/bin/sh: ... too big: This Bourne shell message is a variant of Not enough space. Check that message for steps to take.
Block device required (ENOTBLK): A raw device was specified where a block device is required.
Broken pipe (EPIPE): No reading process was available to accept a write on the other end of a pipe. This can happen when the reading process (the process after the pipe) exits suddenly.
Bus Error: I/O was attempted to a device that is unavailable or does not exist. See Bus Error above.
Cannot access a needed shared library (ELIBACC): Either the library does not exist, the LD_LIBRARY_PATH variable does not include the library, or the user is not permissioned to use it. The library in question can usually be pinned down with truss.
Cannot assign requested address (EADDRNOTAVAIL): The requested address is not on the current machine.
Cannot exec a shared library directly (ELIBEXEC): You can't execute shared libraries directly. This error indicates a software bug.
Cannot install bootblock: On an x86 system, this error typically appears when a newfs and restore operation was carried out without performing a installboot before installing the OS. It may be possible to install the bootblock from the CD drive in single-user mode (note that Sun does not guarantee this procedure):
cd /usr/platform/`arch -k`/lib/fs/ufs
installboot ./pboot ./bootblk /dev/rdsk/c#t#d#s#
Cannot send after transport endpoint shutdown (ESHUTDOWN): The transport endpoint has been shut down, so data was unable to be sent. The solution is usually to restore the endpoint and re-run the transfer. (We may need to troubleshoot why the remote endpoint became unavailable.)
can't accept: Initiator does not accept the specified data of the given format. Consult storage device documentation to look for compatibility information for the server hardware and OS.
can't accept ... in security stage: Device responded with unsupported login information during login security phase. Verify storage device authentication settings. Consult storage device documentation to look for compatibility information for the server hardware and OS.
can't find environment variable: The specified environment variable has not been set. Check for a typo and/or verify that the variable has been set.
Can't invoke /etc/init: The init binary is missing or corrupted during a reboot. We may be able to complete the boot by copying init from a CDROM during a CDROM reboot.
capacity of this LUN is too large: SCSI partitions must be less than 2TB.
Channel number out of range (ECHRNG): A stream head attempted to open a minor device that is in use or does not exist. We need to make sure that the stream device exists, along with an appropriate number of minor devices, and that it matches the hardware configuration. It may be necessary to schedule jobs differently to allow for limited system resources.
check boot archive content: If SMF does not start up on its own, this message in response to svcs -x may indicate a failure of svc:/system/boot-archive:default To resolve this problem, select the Solaris failsafe archive option in the GRUB boot menu during the next reboot. The failsafe boot option provides instructions for rebuilding the boot archive. Once that is complete, the boot can be continued by clearing the SMF boot archive with the svcadm clear boot-archive command.
Command not found: This is a C shell error message that means exactly what it says. It typically means that the command was misspelled or does not live on the PATH.
Communication error on send (ECOMM): The link between machines breaks after data is sent, but before the confirmation is received.
Component system is busy, try again: failed to offline: cfgadm attempted to remove or replace a device with a mounted file system, swap area or configured dump device. Unmount the file system, remove the swap and/or disable the dump device, then retry the cfgadm command. See the cfgadm(1M) man page.
Configuration operation invalid: invalid transition: The incorrect device may have been specified, or there may be a problem with the device or its seating. Use cfgadm to check the receptacle and its state. The card may need to be reseated.
Connection refused (ECONNREFUSED): The target machine actively refused the connection. The service may not be active, or there may be restrictions on connections (such as the hosts.allow and hosts.deny in TCP wrappers).
Connection reset (ECONNRESET): The target system forcibly closed an existing connection. This typically happens as a result of a reboot or a timeout.
Connection timed out (ETIMEOUT): The target host is unreachable due to network problems or the system being down.
Core dumped: A core file (image of software memory at the time of failure) has been taken. See Core File Management.
Corrupt label: This happens if cylinder 0 has been overwritten, usually by a database using a raw partition including cylinder 0. The best solution is to back everything up and repartition the disk with cylinder 0 either not in any partition or at least in a partition with a filesystem (such as UFS) that respects cylinder 0.
cpio: Bad magic number/header: The cpio archive has become corrupted. We can try to recover whatever we can by using the cpio -k command.
Cross-device link (EXDEV): Hard links are not permitted across different filesystems. Use a soft link instead.
Data access exception: Mismatch between the operating system and disk storage hardware. This can be due to mis-seated DIMMs or disk problems, so it makes sense to try to identify any hardware problems. Usually, the operating system (and perhaps filesystem) will need to be upgraded to deal with the newer hardware.
DataDigest=... is required, can't accept: Device returned an improperly processed DataDigest. Verify that storage device digest settings are compatible with the initiator.
Data Fault: This is a particular type of bad trap that indicates a configuration text or data access fault. See BAD TRAP above.
Deadlock situation detected/avoided (EDEADLK): A potential deadlock over a system resource (usually a lock) was detected and avoided. The software should be examined to see if it can be made more resilient.
Destination address required (EDESTADDRREQ): An address was omitted from an operation that requires one.
/dev/fd/#: cannot open: Indicates that the file descriptor file system (fdfs) is not mounted correctly. In most cases, the problem is that it is mounted either nosuid or not at all. The file descriptor file system should have the following options in the vfstab:
fd - /dev/fd fd - no -
Device busy (EBUSY): A hard drive or removable media failed to unmount or eject due to an active process using them. The fuser command allows us to see what processes are using the filesystem or even kill them with a command like:
fuser -ck /mountpoint
(Make sure that you know what processes are running on a filesystem before killing them.)
DIMMs Manufacturer Mismatch: DIMMs in the system are not on the hardware compatibility list.
Directory not empty: This is an error from rmdir which means exactly what it says. Non-empty directories cannot be removed. (If a process is holding a file open, it is possible to track down the culprit by looking for the inode of the file in question (ls -i filename) in pfiles output.)
Disc quota exceeded (EDQUOT): A user's disk quota has been exceeded. Some of the user's files can be removed or the quotas can be increased with edquota.
Disk# not unique: This error is displayed if there are multiple EEPROM devalias entries for a disk. At the ok> prompt, the values of the aliases can be shown with
ok> printenv
the aliases can be reset with
ok> nvunalias disk#
ok> nvalias disk# device-path
dquot table full: The UFS quota table needs to be increased in size. This is done by increasing ndquot in /etc/system and rebooting. ndquot defaults to (maxusers x 40)/4 + max_nprocs
dr in progress: This error may occur if a SCSI unconfigure operation fails while only partially completed. The controller may need to be reconfigured with cfgadm
driver not attached: No driver currently attached to the specified device because no device exists at the node or the device is not in use. This may or may not mean that a proper driver is not installed. Make sure that the driver is installed and properly configured.
empty RADIUS shared secret: The RADIUS shared secret needs to be set.
Error 88 (EILSEQ): This is an illegal byte sequence error. Multiple characters have been provided where only one is expected.
Error code 2: access violation: This error is due to a permissioning or pathing error on a tftp get.
Error: missing file arg (cm3): A filename was not included in an sccs command that requires one.
error opening dir: The specified path may not be a directory.
error writing name when booting: /etc/nodename must contain exactly one line with the name of the system and no blanks or returns.
esp0: data transfer overrun: This error appears when we attempt to mount a CD drive with an 8192 block size as opposed to the Sun-standard 512 block size. Check with the drive manufacturer to see if the block size can be switched.
ether_hostton: Bad file number/Resource temporarily unavailable: These messages may be a result of a mis-matched nodename file. Make sure that the /etc/nodename entry matches the corresponding /etc/hostname.interface and /etc/inet/hosts files.
Event not found: The shell reports that a command matching the request cannot be found in the history buffer for the shell session. The history command shows the current contents of the history buffer.
Exec format error (ENOEXEC): This error usually means that the software was compiled for an architecture other than the one on which it finds itself. This may also happen if an expected binary compatibility package is not installed. The file command displays the expected architecture for the binary.
Failed to initialize adapter: If the adapter has been correctly identified, this means that the configuration of the adapter is incorrect. In particular, make sure to check the DMA settings.
Failed to receive login response: Initiator failed to receive a login Payload Data Unit (PDU) across the network. Verify that the network connection is working.
Failed to transfer login: Initiator failed to transfer a login Payload Data Unit (PDU) across the network. Verify that the network connection is working.
Fast access mmu miss: This is usually due to a hardware problem. Memory is a possible culprit, as are the system board and CPU. Check PROM Monitor Diagnostics for hardware diagnostics on OBP/Sparc systems.
File descriptor in bad state (EBADFD): The requested file descriptor does not refer to an open file or it refers to a file descriptor that is restricted to another purpose. (For example, a read request is made to a file descriptor that is open for writing only.)
File exists (EEXISTS): An existing file was targeted for a command that would have overwritten it improperly. For example, there may have been a request to overwrite a file while the csh noclobber option is set, or there may have been a request to set a link to the name of an existing file.
File locking deadlock (EDEADLOCK): Two processes deadlocked over a resource, such as a lock. This is a software programming bug.
File name too long (ENAMETOOLONG): The referenced file name is longer than the limit specified in /usr/include/limits.h.
File system full: The file system is full. (Error messages sometimes mean what they say.) If the message occurs during a login, the problem is likely the filesystem that includes the utmpx file (usually /var).
File too large (EFBIG): The file size has grown past what is allowed by the protocol or filesystem in question, or exceeds the resource limit (rlimit) for file size. The resource limit can be checked by running ulimit -a in Bourne or Korn shells or limit in C shell. Check the Resource Management page for additional information on managing resource limits.
Giving up: In the context of a SCSI command, this means that the timeout has been exceeded. This is usually due to a hardware or connection problem, but it can be caused by contention on the SCSI channel, or even a mis-match in timeout settings between the OS and the device in question.
Hardware address trying to be our address: Either we have two systems on our network with the same IP address, or we have snooping enabled on a device on the network.
Host is down (EHOSTDOWN): A connection attempt failed because the target system was unavailable.
HeaderDigest=... is required, can't accept: Device returned an improperly processed HeaderDigest. Verify that storage device digest settings are compatible with the initiator.
Host name local configuration error: sendmail wants to have a fully qualified domain name for the local host. It is good practice to include a fully qualified domain name in the hosts file entry for the local server.
Hypertransport Sync Flood occurred on last boot: Uncorrectable ECC error caused the last reboot. For x64 systems, check the service processor's System Event Log and BIOS log to identify the culprit.
Identifier removed (EIDRM): There is a problem accessing a file associated with messaging, semaphores or shared memory. Check the msgctl(2), semctl(2) or shmctl(2) man page for more details.
ieN Ethernet jammed: The number of successive failed transmission attempts has exceeded the threshold. Check whether the network is saturated or check for other network problems.
ieN no carrier: The carrier detect pin died during a packet transmission, resulting in a dropped packet. Check for loose connections and otherwise check the network.
If pipe/FIFO, don't sleep in stream head (ESTRPIPE): There is a problem with the STEAMS connection.
ifconfig: bad address: Check /etc/hostname.* to make sure that the entries match the hosts file. When this error occurs early in the boot process, make sure that the filesystem containing hostname.* and hosts is online at that stage of the boot process. If “files” is not the first entry in the “hosts” line of /etc/nsswitch.conf, the hostname lookup will not be possible until the interface comes online.
ifconfig: no such interface: Make sure that the /etc/hostname.interface file exists.
Illegal instruction: This error message means exactly what it says. This may come about because the binary is not compiled for this architecture (see “Exec format error” above), or it may come as a result of trying to run a data file as a program. If this appears during a boot, it means that the system is trying to boot from a non-boot device, that the boot information has become corrupted, or that the boot information is meant for a different architecture.
Illegal seek (ESPIPE): There is a problem with a pipe in the statement. A workaround suggested by Sun is to redirect the output of the source command to a scratch file, then process the file.
Initiator could not be successfully authenticated: Verify CHAP and/or RADIUS settings, as appropriate.
Initiator is not allowed access to the given target: Verify initiator name, masking and provisioning.
initiator name is required: The initiator name is improperly configured.
Interrupted system call (EINTR): An signal (like an interrupt or quit) was received before the system call had completed. (If we try to resume, we may error out as a result of this condition.)
Invalid argument (EINVAL): System cannot interpret a supplied parameter. Depending on the context, this may be an indication that the object named by the parameter is not set up properly.
Invalid null command: This may indicate that there were two pipes in a row (“||”) in the referenced command.
I/O error (EIO): This references a physical I/O fault. Depending on the context, it makes sense to replace the removable media, check all connections, run diagnostics on the referenced hardware or fsck the filesystem. If this error occurs during a write, we must assume that the data is corrupt.
Is a directory (EISDIR): We tried to treat a directory like a file.
iSCSI service or target is not currently operational: Run diagnostics on the storage device hardware; check storage device software configuration.
Kernel read error: savecore is unable to read the kernel data structures to produce a crash dump. This may indicate a hardware problem, especially a memory problem. This problem may accompany a BAD TRAP error.
Killed: This may happen as a result of a memory allocation attempt where either there is insufficient swap space or the stack and data segment size are in conflict. A “Killed” message may also appear when a program is sent a SIGKILL by other means, such as a kill command.
kmem_free block already free: This is a software programming bug, probably in a device driver.
ld.so.1 fatal: can't set protection on segment: Sun reports a case where this error occurred due to a lack of swap space. ld.so.1 complained because there was no segment on which to set protections.
ld.so.1 fatal: open failed: No such file or directory: The linker was unable to find the shared library in question. Make sure that LD_LIBRARY_PATH is set properly.
ld.so.1 fatal: relocation error: referenced symbol not found: The symbol referenced by the specified application was not found. This error most frequently occurs after installations or upgrades of shared libraries. ldd -d on the application will show its dependencies. Depending on the nature of the conflict, it may be resolvable by changing the LD_LIBRARY_PATH or installing an appropriate version of the shared library.
Link has been severed (ENOLINK): The connection to a remote machine has been severed, either by the remote process dying or a network problem.
Login incorrect: This error means that an appropriate username and password pair was not entered. This may be due to a problem with the passwd and shadow file, the naming service, or the user forgetting login credentials.
login redirection failed: Storage device attempted to redirect initiator to an invalid destination. Verify storage device redirection settings.
Memory Configuration Mismatch: Can be caused by damaged or unsupported DIMMs, or by running non-identical DIMMs within the same bank.
Message too long (EMSGSIZE): A message was sent that was larger than the internal message buffer.
Miscellaneous iSCSI initiator errors: Check the initiator.
Missing parameters (e.g, iSCSI initiator and/or target name): Verify that the initiator and target name are properly specified.
mount: ...already mounted... (EBUSY): Either the filesystem is mounted elsewhere, an active process has its working directory inside the mount point or the maximum number of mounts has been exceeded.
mount: giving up on...: The remote mount request was unsuccessful for more than the threshold number of retries. Check the network connection and make sure that the NFS server is sharing the directory to the client as expected.
mount: mount-point...does not exist: The directory specified as the mount point does not exist.
mount: the state of /dev/dsk/... is not okay: The filesystem should either be mounted read-only or fsck-ed.
Network dropped connection because of reset (ENETRESET): The remote host crashed or rebooted.
Network is down (ENETDOWN): A transport connection failed due to a dead network.
Network is unreachable (ENETUNREACH): Either there is no route to the network, or negative status information was received from intermediate network devices.
NFS getattr failed for server...RPC: Timed out: The NFS server has failing hardware. (For a server that is slow to respond, the NFS server not responding message would appear instead.)
nfs mount: Couldn't bind to reserved port: The NFS server has multiple network cards bound to the same subnet.
nfs mount: mount:...Device busy: An active process has a working directory inside the mount point.
NFS mount:...mounted OK: A backgrounded mount completed successfully. This may be an indication that the server response is poor, since otherwise the mount would have completed immediately and not required backgrounding.
NFS read failed for server: This is a permissions problem error message. In addition to checking the permissions on the NFS server, make sure that the permissions underneath the mount are acceptable. (Mount points should have 755 permissions to avoid odd permissioning behavior on mounted filesystems.)
nfs_server: bad getargs: The arguments are unrecognized or incorrect. This may be an indication of a network problem, or it may indicate a software configuration problem on the client.
NFS server ... not responding: The network connection to the NFS server is either slow or broken.
NFS server ... ok: The network connection to the NFS server has been restored. This is a followup to NFS server ... not responding.
nfs umount: ... is busy: An active process has a working directory inside the specified NFS mount. See the Device busy error message.
NFS write error on host ... No space left on device: If an NFS mount runs out of space, attempts to write to files on the share may corrupt or zero out those files.
NFS write failed for server ... RPC: Timed out: The filesystem is soft mounted, and response time is inadequate. Sun recommends that writable filesystems not be soft-mounted, as it can lead to data corruption.
No carrier-cable disconnected or hub disabled?: This error may manifest due to a physical networking problem or a configuration issue.
No child processes (ECHILD): An application attempted to communicate with a cooperating process that does not exist. Either the child exited improperly or failed to start.
No default media available: Drives contain no floppy or CD media to eject.
No directory! Logging in with home=/: The home directory either does not exist or is not permissioned such that the user can use it. If home directories are automounted, it may be necessary to troubleshoot the automounter.
no driver found for device: A driver has been disabled while the device is still attached. Depending on the type of device, cfgadm, drvconfig, devfsadm or a reconfiguration reboot (boot -r) may be required. Check the System Administration Guide: Devices and File Systems document.
No message of desired type (ENOMSG): Something attempted to receive a message of a type that does not exist on the message queue. See the msgsnd(2) and msgrcv(2) man pages.
No record locks available (ENOLCK): Any of several different locking subsystems, including fcntl(2), NFS lockd and mail, may yield this message when no more locks are available.
No route to host (EHOSTUNREACH): In practice, this message is not distinguishable from Network is unreachable.
No shell Connection closed: The shell specified for the user is either unavailable or illegal. Make sure it is listed in /etc/shells and that it exists. It may be necessary to change the passwd entry for this user to assign a valid shell.
No space left on device (ENOSPC): The disk, tape or diskette is full.
No such device (ENODEV): An operation was attempted on an inappropriate or nonexistent device. Make sure that it exists in /devices and /dev. The drvconfig or boot -r commands can be used to regenerate many /devices entries.
No such device or address (ENXIO): I/O has been attempted to a device that does not exist or that exists beyond the limits of the device. Make sure that the device in question is powered up and connected properly, including the correct SCSI ID.
No such file or directory (ENOENT): The file or path name does not exist on the system. Make sure that the relevant filesystems are mounted and that the expected files and/or directories exist.
No such process (ESRCH): The process does not exist on the system. It may have finished prior to the attempt to reference it.
No such user ... cron entries not created: Even though a file exists in /var/spool/cron/crontabs for this username, the username is not present in the passwd database.
No utmpx entry: The filesystem containing the utmpx file is full. This may need to be resolved in single-user mode, since logins will not be permitted.
Not a data message (EBADMSG): Data has come to the head of a STREAMS queue that cannot be processed. See the man pages for read(2), getmsg(2) and ioctl(2).
Not a directory (ENOTDIR): A non-directory was specified as an argument where a directory is required.
Not a stream device (ENOTSTR): The file descriptor used as a target for the putmsg(2) or getmsg(2) is not a STREAMS device.
Not a UFS filesystem: The boot device is improperly defined. For x86, boot the system with the Configuration Assistant/boot CD and identify the disk from which to boot. For PROM-based systems, set the boot-device properly in the PROM environment variables.
Not enough space (ENOMEM): Insufficient swap space available.
Not found: The specified command could not be found. Check the spelling and the PATH.
Not login shell: Use exit to get out of non-login shells. (The logout command can only be used from login shells.)
Not on system console: Direct root logins are only permitted on the system console unless otherwise specified in /etc/default/login.
Not owner (EPERM): Action attempted that can only be performed by object owner or the superuser.
Not supported (ENOTSUP): A requested application feature is not available in the current version, though it may be expected in a future release.
Object is remote (EREMOTE): We tried to share a resource not on the local machine.
Operation already in progress (EALREADY): An operation was already in progress on a non-blocking object.
Operation canceled (ECANCELED): The asynchronous operation was canceled before completion.
Operation not applicable (ENOSYS): No system support exists for this operation.
Operation not supported on transport endpoint (EOPNOTSUPP): Tried to accept a connection on a datagram transport endpoint.
Operation now in progress (EINPROGRESS): Operation in progress on a non-blocking object.
Option not supported by protocol (ENOPROTOOPT): A bad option or level was specified.
Out of memory: System is running out of virtual memory (including swap space). See “Not enough space” as well.
Out of stream resources (ENOSR): No STEAMS queues or no STREAMS head data structures available during a STREAMS open.
Overlapping swap volume: Make sure that the additional swap volumes have unique names.
Package not installed (ENOPKG): The attempted system call belongs to a package that is not installed on this system.
Paired DIMMs Mismatch: Checksum mismatch between two DIMMs in a pair. Can be caused by damaged or non-identical DIMMs.
Panic – boot: Could not mount filesystem: (During a Jumpstart) The Jumpstart boot process is unable to get to the install image. Make sure that the Jumpstart configurations and file shares are correct.
Panic ... valloc'd past tmpptes: May occur if maxusers is set to an absurdly high number. It should not be set past the number of MB of RAM or 4096, whichever is smaller.
Permission denied (EACCES): The attempted file access is forbidden due to filesystem permissions.
Protocol family not supported (EPFNOSUPPORT): The protocol has not been implemented on this system.
Protocol not supported (EPROTONOSUPPORT): The protocol has not been configured for this system. Check the protocols database (/etc/inet/protocols by default).
Protocol wrong type for socket (EPROTOTYPE): Application programming error or misconfigured protocols. The requested protocol does not support the requested socket type. Make sure that the protocols database matches with the corresponding entries in /usr/include/sys/socket.h.
quotactl: open Is a directory: A directory named “quota” can cause edquota to fail. Such directories should be renamed.
RADIUS packet authentication failed: Re-set the RADIUS shared secret.
Read error from network: Connection reset by peer: The remote system crashed or rebooted during an rsh or rlogin session.
Read-only file system (EROFS): We can't change stuff on filesystems that are mounted read-only.
received invalid login response: Storage device response was unexpected. Verify initiator authentication settings.
Requested iSCSI version range is not supported by the target: The initiator's iSCSI version is not supported by the target storage device. Check the compatibility lists. See if firmware or driver upgrades would be sufficient.
Requested ITN does not exist at this address: The iSCSI target name (ITN) is not accessible. Verify the initiator discovery information and storage device configuration.
Requested ITN has been removed and no forwarding address is provided: The requested iSCSI target name is no longer accessible. Verify the initiator discovery information and storage device configuration.
Resource temporarily unavailable (EAGAIN): fork(2) cannot create a new process due to a lack of resources. These resources may include limits on active processes (see the Resource Management page) or a lack of swap space.
Restartable system call (ESTART): The system call has been interrupted in a restartable state.
Result too large (ERANGE): This is a programming or data input error. The result of a calculation is not representable in the defined data type. The matherr(3M) facility may be helpful in debugging the problem.
ROOT LOGIN ...: Someone has just logged in as root or su-ed to root.
RPC: Program not registered: Make sure that the requested service is available.
rx framing error: This error usually indicates a problem with the network hardware. Framing errors are types of CRC errors, which are usually caused by physical media problems.
SCSI bus DATA IN phase parity error: This is a problem related to SCSI hardware or connections. It may have to do with hardware that is not qualified for attachment to Sun servers, connections with cables that are flaky or too long (total length more than 6 meters), bad terminators or flaky power supplies. See the SCSI transport failed: reason 'reset' message as well.
SCSI transport failed: reason 'reset': The system sent data that was never received due to a SCSI reset. This may occur due to conflicting SCSI IDs, hardware that is not qualified for attachment to Sun servers, connections with cables that are flaky or too long (total length more than 6 meters), bad terminators or flaky power supplies. These issues have also been observed on systems where the highest capacity DIMMs are not in the lowest numbered slots. Disk arrays wth read-ahead caches can sometimes also cause this problem; turn off the caching to see if the problem goes away. Non-obvious SCSI ID conflicts may be diagnosed using the PROM monitor probe-scsi-all command. (See OBP Command Line Diagnostics for more details.) These errors may also happen when the SCSI device and the server are set to different SCSI timeout thresholds.
Segmentation Fault: These can be produced as a result of programming errors or improperly set rlimit resource settings. (See Resource Management for how to check and adjust resource settings.) Segmentation faults are an indication that the program has attempted to access an area of memory that is protected or does not exist. Programming causes for segmentation faults include dereferencing a null pointer and indexing past the bounds of an array.
setmnt: Cannot open /etc/mnttab for writing: The system is unable to write to /etc/mnttab. This may be caused by the /etc directory being mounted read-only (which can happen during certain types of boot problems).
share_nfs: /home: Operation not applicable: A local filesystem is mounted on /home, which is usually reserved for use by the automounter.
skipping LIST command – no active base: A LIST command is present without an associated BASE command. (cachefspack)
Socket type not supported (ESOCKTNOSUPPORT): The socket type's support has not been configured for this system.
Soft error rate ... during writing was too high: The number of soft errors on a tape device have exceeded the threshold. It may be due to a dirty head, bad media or a faulty tape drive.
Software caused connection abort (ECONNABORTED): The connection was aborted within the local host machine.
Stale NFS file handle (ESTALE): The file or directory on the NFS server is no longer available. It may have been removed or replaced. A remount may be needed to force a renegotiation of file handles.
statd: cannot talk to statd: statd has left remnants in the /var/statmon/sm and /var/statmon/sm.bak directories. Files named after inactive hosts should be removed, and statd and lockd should be restarted.
su: No shell: The default shell for root is improper. It may have been set to a nonexistent program or an illegal shell. This problem has been known to occur when an extra space is appended to the “root” line of the passwd file. The passwd file will need to be repaired while booted from CDROM or network.
syncing file system: The kernel is updating the superblocks before taking the system down or in the wake of a panic.
System booting after fatal error FATAL: This can be caused by UPA address parity errors, Master queue overflows or DTAG parity errors. This is going to be due to a bad CPU or possibly a bad system board.
tar: ...: No such file or directory: The specified target (which defaults to TAPE) is not available. This may be due to a hardware problem with the tape drive or connections, or to a misspecified target.
tar: directory checksum error: The checksum of the files read from tape do not match the checksum in the header block. This may be due to an incorrectly specified block size or a bad piece of tape media.
tar: tape write error: A physical write error has occurred on the tar target.
Target hardware or software error: Run diagnostics on the storage device hardware; check storage device software configuration.
Target has insufficient session, connection or other resources: Check storage device settings. Check with storage device vendor to see if resource settings can be increased or capacity can be otherwise increased.
target protocol group tag mismatch: Initiator and target had a Target Portal Group Tag (TPGT) mismatch. Verify TPGT discovery settings on initiator and storage device.
Text file busy (ETXTBSY): An attempt was made to execute a file that was open for writing.
The SCSI bus is hung: The likely cause is a conflict in SCSI target numbers. See the SCSI transport failed: reason 'reset' message as well.
Timeout waiting for ARP/RARP packet: Indicates a network connection problem while booting from the network. This problem can sometimes be observed on subnets containing multiple servers willing to answer a RARP request, which can result in a server without a bootparams file receiving a request. (We have had good luck moving Jumpstart targets to an isolated subnet for initial installations.)
Timer expired: The timer for a STREAMS ioctl has expired. The cause is device specific, and may be related to a flaky hardware, driver failure or an inappropriately short timeout threshold.
Too many links (EMLINK): A file has too many hard links associated with it. Use soft links instead.
Too many open files (EMFILE): A process has exceeded the limit on the number of open files per process. (See the Resource Management page for methods to monitor and manage these limits.)
Transport endpoint is already connected (EISCONN): Connection request made on an already connected transport endpoint.
Transport endpoint is not connected (ENOTCONN): The endpoint is not connected and/or an address was not specified.
Trap 3E: These are caused by a bad boot disk superblock. This may have been caused by a failing disk, faulty disk connections, software misconfiguration or duplicate SCSI addresses. Check the possible hardware and SCSI configuration issues before attempting to recover the superblock using the methods listed under BAD SUPER BLOCK above.
Too Many Arguments: This is a variant of the C shell's Arguments too long message, except that this time the problem may be the number rather than the length of arguments.
unable to connect to target: Initiator unable to establish a network connection. This message typically accompanied by an error number from /usr/include/sys/errno.h.
unable to get shared objects: The executable may be corrupt or in an unrecognized format.
unable to initialize authentication: Verify that initiator authentication settings are properly configured.
unable to make login pdu: Initiator could not make a login Payload Data Unit (PDU) based on the initiator and storage device settings. Reset target login parameters and other settings as required.
unable to schedule enumeration: Initiator unable to enumerate the LUNs on the target. LUN enumeration can be forced via the devfsadm -i iscsi command.
unable to set [authentication|ipsec|password|remote authentication|username]: Verify that initiator authentication settings are properly configured.
uname: error writing name when booting: /etc/nodename must contain exactly one line with the name of the system and no blanks or returns.
Unknown service: Either the service is not listed in the services database (/etc/services by default), or the permissions for the services database are set so that the user cannot read it.
Value too large for defined data type (EOVERFLOW): Argument improperly formatted for the structure allocated to it.
WARNING: /tmp: File system full, swap space limit exceeded: Virtual memory has filled up. A reboot is recommended after we have figured out which process is hogging all the memory and/or swap, since the system may be in an unstable state.
WARNING: TOD clock not initialized: It is likely that the system clock's battery is dead.
Watchdog Reset: This usually indicates a hardware problem. (See the Watchdog Resets page for a complete discussion.)
Window Underflow: These errors sometimes accompany a trap, especially at boot time. Some program attempted access of a register window that was not accessible from that processor. These errors may occur when differently sized DIMMs are improperly used together, or when cache memory has gone bad. If mismatched memory is not the problem, the CPU or system board will need to be replaced.
wrong magic number: See “Corrupt label” above.
you are not authorized to use: A configuration file (eg at.deny or cron.deny) forbids access to this service.

Tuesday, March 26, 2013

Solaris SPARC Boot Sequence

The following represents a summary of the boot process for a Solaris 2.x system on Sparc hardware.

Power On: Depending on the system involved, you may see some output on a serial terminal immediately after power on. This may take the form of a Hardware Power ON message on a large Enterprise server, or a "'" or "," in the case of an older Ultra system. These indications will not be present on a monitor connected directly to the server.

POST: If the PROM diag-switch? parameter is set to true, output from the POST (Power On Self Test) will be viewable on a serial terminal. The PROM diag-level parameter determines the extent of the POST tests. (See the Hardware Diagnostics page for more information on these settings.) If a serial terminal is not connected, a prtdiag -v will show the results of the POST once the system has booted. If a keyboard is connected, it will beep and the keyboard lights will flash during POST. If the POST fails, an error indication may be displayed following the failure.

Init System: The "Init System" process can be broken down into several discrete parts:

OBP: If diag-switch? is set, an Entering OBP message will be seen on a serial terminal. The MMU (memory management unit) is enabled.
NVRAM: If use-nvramrc? is set to true, read the NVRAMRC. This may contain information about boot devices, especially where the boot disk has been encapsulated with VxVM or DiskSuite.
Probe All: This includes checking for SCSI or other disk drives and devices.
Install Console: At this point, a directly connected monitor and keyboard will become active, or the serial port will become the system console access. If a keyboard is connected to the system, the lights will flash again during this step.
Banner: The PROM banner will be displayed. This banner includes a logo, system type, PROM revision level, the ethernet address, and the hostid.
Create Device Tree: The hardware device tree will be built. This device tree can be explored using PROM monitor commands at the ok> prompt, or by using prtconf once the system has been booted.

Extended Diagnostics: If diag-switch? and diag-level are set, additional diagnostics will appear on the system console.

auto-boot?: If the auto-boot? PROM parameter is set, the boot process will begin. Otherwise, the system will drop to the ok> PROM monitor prompt, or (if sunmon-compat? and security-mode are set) the > security prompt.

The boot process will use the boot-device and boot-file PROM parameters unless diag-switch? is set. In this case, the boot process will use the diag-device and diag-file.

bootblk: The OBP (Open Boot PROM) program loads the bootblk primary boot program from the boot-device (or diag-device, if diag-switch? is set). If the bootblk is not present or needs to be regenerated, it can be installed by running the installboot command after booting from a CDROM or the network. A copy of the bootblk is available at /usr/platform/`arch -k`/lib/fs/ufs/bootblk

ufsboot: The secondary boot program, /platform/`arch -k`/ufsboot is run. This program loads the kernel core image files. If this file is corrupted or missing, a bootblk: can't find the boot program or similar error message will be returned.

kernel: The kernel is loaded and run. For 32-bit Solaris systems, the relevant files are:

/platform/`arch -k`/kernel/unix
/kernel/genunix

For 64-bit Solaris systems, the files are:

/platform/`arch -k`/kernel/sparcV9/unix
/kernel/genunix

As part of the kernel loading process, the kernel banner is displayed to the screen. This includes the kernel version number (including patch level, if appropriate) and the copyright notice.

The kernel initializes itself and begins loading modules, reading the files with the ufsboot program until it has loaded enough modules to mount the root filesystem itself. At that point, ufsboot is unmapped and the kernel uses its own drivers. If the system complains about not being able to write to the root filesystem, it is stuck in this part of the boot process.

The boot -a command singlesteps through this portion of the boot process. This can be a useful diagnostic procedure if the kernel is not loading properly.

/etc/system: The /etc/system file is read by the kernel, and the system parameters are set.

The following types of customization are available in the /etc/system file:

moddir: Changes path of kernel modules.
forceload: Forces loading of a kernel module.
exclude: Excludes a particular kernel module.
rootfs: Specify the system type for the root file system. (ufs is the default.)
rootdev: Specify the physical device path for root.
set: Set the value of a tuneable system parameter.

If the /etc/system file is edited, it is strongly recommended that a copy of the working file be made to a well-known location. In the event that the new /etc/system file renders the system unbootable, it might be possible to bring the system up with a boot -a command that specifies the old file. If this has not been done, the system may need to be booted from CD or network so that the file can be mounted and edited.

kernel initialized: The kernel creates PID 0 ( sched). The sched process is sometimes called the "swapper."

init: The kernel starts PID 1 (init).

init: The init process reads the /etc/inittab and /etc/default/init and follows the instructions in those files.

Some of the entries in the /etc/inittab are:

fs: sysinit (usually /etc/rcS)
is: default init level (usually 3, sometimes 2)
s#: script associated with a run level (usually /sbin/rc#)

rc scripts: The rc scripts execute the files in the /etc/rc#.d directories. They are run by the /sbin/rc# scripts, each of which corresponds to a run level.

Debugging can often be done on these scripts by adding echo lines to a script to print either a "I got this far" message or to print out the value of a problematic variable.

Relocation

I'm seeing enough complaints about format incompatibilities with the old Princeton University pages that I am going to start relocating them to this blog over the next few weeks.

I don't have any control over the old templates, and I really don't have any control over the way that IE keeps re-interpreting how it will render what was formerly perfectly fine HTML. At least on the blog I can control the horizontal and the vertical.

Monday, March 25, 2013

From Techie to Boss: Transitioning to Leadership

If you liked Solaris Troubleshooting Handbook, please try my latest book, "From Techie to Boss."

This book is aimed at people who have recently been promoted to team lead or technical manager. I've tried to provide a good summary of advice for new managers. I welcome your comments!