Monday, June 24, 2013

Solaris Volume Manager (DiskSuite)

Solaris Volume Manager (formerly known as DiskSuite) provides a way to mirror, stripe or RAID-5 local disks. New functionality is constantly being added to the base software. A full discussion is beyond the scope of this article, so we will focus on the most common cases, how to set them up, how to manage them and how to maintain them. Additional information is available in the Solaris Volume Manager Administration Guide.

State Database

Solaris Volume Manager uses a state database to store its configuration and state information. (State information refers to the condition of the devices.) Multiple replicas are required for redundancy. At least four should be created on at least two different physical disk devices. It is much better to have at least six replicas on at least three different physical disks, spread across multiple controller channels, if possible.

In the event that the state databases disagree, a majority of configured state databases determines which version of reality is correct. This is why it is important to configure multiple replicas. A minimum of three database replicas must be available in order to boot without human assistance, so it makes sense to create database replicas liberally. They don't take up much space, and there is very little overhead associated with their maintenance. On JBOD (Just a Bunch Of Disks) arrays, I recommend at least two replicas on each disk device.

State database replicas consume between 4 and 16 MB of space, and should ideally be placed on a partition specifically set aside for that purpose. In the event that state database information is lost, it is possible to lose the data stored on the managed disks, so the database replicas should be spread over as much of the disk infrastructure as possible.

State database locations are recorded in /etc/opt/SUNWmd/mddb.cf. Depending on their condition, repair may or may not be possible. Metadevices (the objects which Solaris Volume Manager manipulates) may be placed on a partition with a state database if the state database is there first. The initial state databases can be created by specifying the slices on which they will live as follows:
metadb -a -f -c 2 slice-name1 slice-name2

Because pre-existing partitions are not usable for creating database replicas, it is frequently the case that we will steal space from swap to create a small partition for the replicas. To do so, we need to boot to single-user mode, use swap -d to unmount all swap, and format to re-partition the swap partition, freeing up space for a separate partition for the database replicas. Since the replicas are small, very few cylinders will be required.

Metadevice Management

The basic types of metadevices are:
  • Simple: Stripes or concatenations--consist only of physical slices.
  • Mirror: Multiple copies on simple metadevies (submirrors).
  • RAID5: Composed of multiple slices; includes distributed parity.
  • Trans: Master metadevice plus logging device.

Solaris Volume Manager can build metadevices either by using partitions as the basic building blocks, or by dividing a single large partition into soft partitions. Soft partitions are a way that SVM allows us to carve a single disk into more than 8 slices. We can either build soft partitions directly on a disk slice, or we can mirror (or RAID) slices, then carve up the resulting metadevice into soft partitions to build volumes.

Disksets are collections of disks that are managed together, in the same way that a Veritas Volume Manager (VxVM) disk group is managed together. Unlike in VxVM, SVM does not require us to explicitly specify a disk group. If Disksets are configured, we need to specify the set name for monitoring or management commands with a -s setname option. Disksets may be created as shared disksets, where multiple servers may be able to access them. (This is useful in an environment like Sun Cluster, for example.) In that case, we specify some hosts as mediators who determine who owns the diskset. (Note that disks added to shared disksets are re-partitioned in the expectation that we will use soft partitions.)

When metadevices need to be addressed by OS commands (like mkfs), we can reference them with device links of the form /dev/md/rdsk/d# or /dev/md/disksetname/rdsk/d

Here are the main command line commands within SVM:

CommandDescription
metaclearDeletes active metadevices and hot spare pools.
metadbManages state database replicas.
metadetachDetaches a metadevice from a mirror or a logging device from a trans-metadevice.
metahsManages hot spares and hot spare pools.
metainitConfigures metadevices.
metaofflineTakes submirrors offline.
metaonlinePlaces submirrors online.
metaparamModifies metadevice parameters.
metarenameRenames and switches metadevice names.
metareplaceReplaces slices of submirrors and RAID5 metadevices.
metarootSets up system files for mirroring root.
metasetAdministers disksets.
metastatCheck metadevice health and state.
metattachAttaches a metadevice to a mirror or a log to a trans-metadevice.

Here is how to perform several common types of operations in Solaris Volume Manager:

OperationProcedure
Create state database replicas.metadb -a -f -c 2 c#t0d#s# c#t1d#s#
Mirror the root partition.
Create a metadevice for the root partition:metainit -f d0 1 1 c#t0d#s#
Create a metadevice for the root mirror partition.metainit d1 1 1 c#t1d#s#
Set up a 1-sided mirrormetainit d2 -m d0
Edit the vfstab and system files.metaroot d2
lockfs -fa
reboot
Attach the root mirror.metattach d2 d1
Mirror the swap partion.
Create metadevices for the swap partition and mirror.metainit -f d5 1 1 c#t0d#s#
metainit -f d6 1 1 c#t1d#s#
Attach submirror to mirror.metattach d7 d6
Edit vfstab to mount swap mirror as a swap device.Use root entry as a template.
Create a striped metadevice.metainit d# stripes slices c#t#d#s#...
Create a striped metadevice with a non-default interlace size.Add an -i interlacek option
Concatenate slices.metainit d# #slices 1 c#t#d#s# 1 c#t#d#s#...
Create a soft partition metadevice.metainit dsource# -p dnew# size
Create a RAID5 metadevice.metainit d# -r c#t#d#s# c#t#d#s# c#t#d#s#...
Manage Hot Spares
Create a hot spare pool.metainit hsp001 c#t#d#s#...
Add a slice to a pool.metahs -a hsp### /dev/dsk/c#t#d#s#
Add a slice to all pools.metahs -a all /dev/dsk/c#t#d#s#
Diskset Management
Deport a diskset.metaset -s setname -r
Import a diskset.metaset -s setname -t -f
Add hosts to a shared diskset.metaset -s setname -a -h hostname1 hostname2
Add mediators to a shared diskset.metaset -s setname -a -m hostname1 hostname2
Add devices to a shared diskset.metaset -s setname -a /dev/did/rdsk/d# /dev/did/rdsk/d#
Check diskset status.metaset

Solaris Volume Manager Monitoring

Solaris Volume Manager provides facilities for monitoring its metadevices. In particular, the metadb command monitors the database replicas, and the metastat command monitors the metadevices and hot spares.

Status messages that may be reported by metastat for a disk mirror include:

  • Okay: No errors, functioning correctly.
  • Resyncing: Actively being resynced following error detection or maintenance.
  • Maintenance: I/O or open error; all reads and writes have been discontinued.
  • Last Erred: I/O or open errors encountered, but no other copies available.

Hot spare status messages reported by metastat are:

  • Available: Ready to accept failover.
  • In-Use: Other slices have failed onto this device.
  • Attention: Problem with hot spare or pool.

Solaris Volume Manager Maintenance

Solaris Volume Manager is very reliable. As long as it is not misconfigured, there should be relatively little maintenance to be performed on Volume Manager itself. If the Volume Manager database is lost, however, it may need to be rebuilt in order to recover access to the data. To recover a system configuration:
Make a backup copy of /etc/opt/SUNWmd/md.cf.
Re-create the state databases:
metadb -a -f -c 2 c#t#d#s# c#t#d#s#
Copy md.cf to md.tab
Edit the md.tab so that all mirrors are one-way mirrors, RAID5 devices recreated with -k (to prevent re-initialization).
Verify the md.tab configuration validity:
metainit -n -a
Re-create the configuration:
metainit -a
Re-attach any mirrors:
metattach dmirror# dsubmirror#
Verify that things are okay:
metastat

More frequently, Solaris Volume Manager will be needed to deal with replacing a failed piece of hardware. To replace a disk which is spitting errors, but has not failed yet (as in Example 10-2): Add database replicas to unaffected disks until at least three exist outside of the failing disk. Remove any replicas from the failing disk: metadb metadb -d c#t#d#s# Detach and remove submirrors and hot spares on the failing disk from their mirrors and pools: metadetach dmirror# dsubmirror# metaclear -r dsubmirror# metahs -d hsp# c#t#d#s# [If the boot disk is being replaced, find the /devices name of the boot disk mirror]: ls -l /dev/rdsk/c#t#d#s0 [If the removed disk is a fibre channel disk, remove the /dev/dsk and /dev/rdsk links for the device.] Physically replace the disk. This may involve shutting down the system if the disk is not hot-swappable. Re-build any /dev and /devices links: drvconfig; disks or boot -r Format and re-partition the disk appropriately. Re-add any removed database replicas: metadb -a -c #databases c#t#d#s# Re-create and re-attach any removed submirrors: metainit dsubmirror# 1 1 c#t#d#s# metattach dmirror# dsubmirror# Re-create any removed hot spares. Replacing a failed disk (as in Example 10-3) is a similar procedure. The differences are: Remove database replicas and hot spares as above; submirrors will not be removable. After replacing the disk as above, replace the submirrors with metareplace: metareplace -e dmirror# c#t#d#s# Barring a misconfiguration, Solaris Volume Manager is a tremendous tool for increasing the reliability and redundancy of a server. More important, it allows us to postpone maintenance for a hard drive failure until the next maintenance window. The metastat tool is quite useful for identifying and diagnosing problems. Along with iostat -Ee, we can often catch problems before they reach a point where the disk has actually failed. Example 10-2 shows how to replace a failing (but not yet failed) mirrored disk. (In this case, we were able to hot-swap the disk, so no reboot was necessary. Since the disks were SCSI, we also did not need to remove or rebuild any /dev links.)

Replacing a Failing Disk with Solaris Volume Manager

# metastat
d0: Mirror
Submirror 0: d1
State: Okay
Submirror 1: d2
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 20484288 blocks

d1: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t0d0s0 0 No Okay


d2: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t1d0s0 0 No Okay
...
# iostat -E
sd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0707 Serial No: 3HZ...
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
sd1 Soft Errors: 593 Hard Errors: 28 Transport Errors: 1
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0707 Serial No: 3HZ...
Size: 73.40GB <73400057856 bytes>
Media Error: 24 Device Not Ready: 0 No Device: 1 Recoverable: 593
Illegal Request: 0 Predictive Failure Analysis: 1
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
a p luo 16 1034 /dev/dsk/c0t1d0s3
a p luo 1050 1034 /dev/dsk/c0t1d0s3
a p luo 2084 1034 /dev/dsk/c0t1d0s3

# metadb -d c0t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
# metadetach d40 d42
d40: submirror d42 is detached
# metaclear -r d42
d42: Concat/Stripe is cleared
...
# metadetach d0 d2
d0: submirror d2 is detached
# metaclear -r d2
d2: Concat/Stripe is cleared
...
[Disk hot-swapped. No reboot or device reconfiguration necessary for this replacement]
...
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c0t1d0
/pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number): 0
selecting c0t0d0
[disk formatted]


FORMAT MENU:
...
format> part


PARTITION MENU:
0 - change `0' partition
...
print - display the current table
label - write partition map and label to the disk
! - execute , then return
quit
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
...
partition> q


FORMAT MENU:
...
format> di


AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c0t1d0
/pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number)[0]: 1
selecting c0t1d0
[disk formatted]
format> part
...
[sd1 partitioned to match sd0's layout]
...
partition> 7
Part Tag Flag Cylinders Size Blocks
7 unassigned wm 0 0 (0/0/0) 0

Enter partition id tag[unassigned]:
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 4835
Enter partition size[0b, 0c, 0.00mb, 0.00gb]: 9252c
partition> la
Ready to label disk, continue? y

partition> pr
Current partition table (unnamed):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
...
partition> q
...
# metadb -a -c 3 c0t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
a u 16 1034 /dev/dsk/c0t1d0s3
a u 1050 1034 /dev/dsk/c0t1d0s3
a u 2084 1034 /dev/dsk/c0t1d0s3
# metainit d2 1 1 c0t1d0s0
d2: Concat/Stripe is setup
cnjcascade1#metattach d0 d2
d0: submirror d2 is attached
[Re-create and attach the remainder of the submirrors.]
...
# metastat
d0: Mirror
Submirror 0: d1
State: Okay
Submirror 1: d2
State: Resyncing
Resync in progress: 10 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 20484288 blocks

d1: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t0d0s0 0 No Okay


d2: Submirror of d0
State: Resyncing
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t1d0s0 0 No Okay

It is important to format the replacement disk to match the cylinder layout of the disk that is being replaced. If this is not done, mirrors and stripes will not rebuild properly.

When you replace a disk that has already failed, there is no ability to remove the submirrors. Instead, the metareplace -e command is used to re-sync the mirror onto the new disk.

Replacing a Failed Disk with Solaris Volume Manager

# iostat -E
...
sd1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 5
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0507 Serial No: 3HZ7Z3CJ00007505
Size: 73.40GB <73400057856 bytes>
...
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c1t0d0s3
a p luo 1050 1034 /dev/dsk/c1t0d0s3
W p l 16 1034 /dev/dsk/c1t1d0s3
W p l 1050 1034 /dev/dsk/c1t1d0s3
a p luo 16 1034 /dev/dsk/c1t2d0s3
a p luo 1050 1034 /dev/dsk/c1t2d0s3
a p luo 16 1034 /dev/dsk/c1t3d0s3
a p luo 1050 1034 /dev/dsk/c1t3d0s3
# metadb -d /dev/dsk/c1t1d0s3
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c1t1d0
/pci@1c,600000/scsi@2/sd@1,0
2. c1t2d0
/pci@1c,600000/scsi@2/sd@2,0
3. c1t3d0
/pci@1c,600000/scsi@2/sd@3,0
Specify disk (enter its number): 0
selecting c1t0d0
[disk formatted]


FORMAT MENU:
...
partition - select (define) a partition table
...
format> part


PARTITION MENU:
...
print - display the current table
...
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
1 swap wu 2013 - 2214 1003.69MB (202/0/0) 2055552
2 backup wm 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 2215 - 2217 14.91MB (3/0/0) 30528
4 unassigned wm 2218 - 5035 13.67GB (2818/0/0) 28675968
5 unassigned wm 5036 - 12080 34.18GB (7045/0/0) 71689920
6 var wm 12081 - 12684 2.93GB (604/0/0) 6146304
7 home wm 12685 - 14086 6.80GB (1402/0/0) 14266752

partition> q


FORMAT MENU:
disk - select a disk
...
format> di


AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c1t1d0
/pci@1c,600000/scsi@2/sd@1,0
2. c1t2d0
/pci@1c,600000/scsi@2/sd@2,0
3. c1t3d0
/pci@1c,600000/scsi@2/sd@3,0
Specify disk (enter its number)[0]: 1
format> part


PARTITION MENU:
...
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 25 129.19MB (26/0/0) 264576
1 swap wu 26 - 51 129.19MB (26/0/0) 264576
2 backup wu 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 52 - 14086 68.10GB (14035/0/0) 142820160
7 unassigned wm 0 0 (0/0/0) 0

...
partition> 7
Part Tag Flag Cylinders Size Blocks
7 unassigned wm 0 0 (0/0/0) 0

Enter partition id tag[unassigned]: home
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 12685
Enter partition size[0b, 0c, 0.00mb, 0.00gb]: 1402c
partition> pr
Current partition table (unnamed):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
1 swap wu 2013 - 2214 1003.69MB (202/0/0) 2055552
2 backup wu 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 2215 - 2217 14.91MB (3/0/0) 30528
4 unassigned wm 2218 - 5035 13.67GB (2818/0/0) 28675968
5 unassigned wm 5036 - 12080 34.18GB (7045/0/0) 71689920
6 var wm 12081 - 12684 2.93GB (604/0/0) 6146304
7 home wm 12685 - 14086 6.80GB (1402/0/0) 14266752

partition> la
Ready to label disk, continue? y

partition> q
...
# metastat
...
d19: Mirror
Submirror 0: d17
State: Okay
Submirror 1: d18
State: Needs maintenance
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 14266752 blocks

d17: Submirror of d19
State: Okay
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s7 0 No Okay


d18: Submirror of d19
State: Needs maintenance
Invoke: metareplace d19 c1t1d0s7
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s7 0 No Maintenance
...
# metareplace -e d19 c1t1d0s7
d19: device c1t1d0s7 is enabled
# metareplace -e d16 c1t1d0s6
d16: device c1t1d0s6 is enabled
# metareplace -e d13 c1t1d0s5
d13: device c1t1d0s5 is enabled
# metareplace -e d10 c1t1d0s4
d10: device c1t1d0s4 is enabled
# metareplace -e d2 c1t1d0s0
d2: device c1t1d0s0 is enabled
# metastat
...
d19: Mirror
Submirror 0: d17
State: Okay
Submirror 1: d18
State: Resyncing
Resync in progress: 10 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 14266752 blocks

d17: Submirror of d19
State: Okay
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s7 0 No Okay


d18: Submirror of d19
State: Resyncing
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s7 0 No Resyncing
...
# metadb -a -c 2 c1t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c1t0d0s3
a p luo 1050 1034 /dev/dsk/c1t0d0s3
a u 16 1034 /dev/dsk/c1t1d0s3
a u 1050 1034 /dev/dsk/c1t1d0s3
a p luo 16 1034 /dev/dsk/c1t2d0s3
a p luo 1050 1034 /dev/dsk/c1t2d0s3
a p luo 16 1034 /dev/dsk/c1t3d0s3
a p luo 1050 1034 /dev/dsk/c1t3d0s3

Thursday, June 20, 2013

inittab

he /etc/inittab file plays a crucial role in the boot sequence.

For versions of Solaris prior to version 10, the /etc/inittab was edited manually. Solaris 10+ manages the /etc/inittab through SMF. The Solaris 10 inittab should not be edited directly

The default Solaris 10 inittab contains the following:

  1. ap::sysinit:/sbin/autopush -f /etc/iu.ap
  2. sp::sysinit:/sbin/soconfig -f /etc/sock2path
  3. smf::sysinit:/lib/svc/bin/svc.startd >/dev/msglog 2<>/dev/msglog
  4. p3:s1234:powerfail:/usr/sbin/shutdown -y -i5 -g0 >/dev/msglog 2<>/dev/...

The lines accomplish the following:

  1. Initializes Streams
  2. Configures socket transport providers
  3. Initializes SMF master restarter
  4. Describes a power fail shutdown

In particular, the default keyword is not used any more in Solaris 10. Instead, the default run level is determined within the SMF profile.

When the init process is started, it first sets environment variables set in the /etc/default/init file; by default, only TIMEZONE is set. Then init executes process entries from the inittab that have sysinit set, and transfers control of the startup process to svc.startd.

Solaris 8 and 9

The line entries in the inittab file have the following format:

id:runlevel:action:process

Here the id is a two-character unique identifier, runlevel indicates the run level involved, action indicates how the process is to be run, and process is the command to be executed.

At boot time, all entries with runlevel "sysinit" are run. Once these processes are run, the system moves towards the init level indicated by the "initdefault" line. For a default inittab, the line is:

is:3:initdefault:

(This indicates a default runlevel of 3.)

By default, the first script run from the inittab file is /etc/bcheckrc, which checks the state of the root and /usr filesystems. The line controlling this script has the following form:

fs::sysinit:/sbin/bcheckrc >/dev/console 2>&1 </dev/console

The inittab also controls what happens at each runlevel. For example, the default entry for runlevel 2 is:

s2:23:wait:/sbin/rc2 >/dev/console 2>&1 </dev/console

The action field of each entry will contain one of the following keywords:

  • powerfail: The system has received a "powerfail" signal.
  • wait: Wait for the command to be completed before proceeding.
  • respawn: Restart the command.

Wednesday, June 19, 2013

Veritas Volume Manager Notes

Veritas has long since been purchased by Symantec, but its products continue to be sold under the Veritas name. Over time, we can expect that some of the products will have name changes to reflect the new ownership.

Veritas produces volume and file system software that allows for extremely flexible and straightforward management of a system's disk storage resources. Now that ZFS is providing much of this same functionality from inside the OS, it will be interesting to see how well Veritas is able to hold on to its installed base.

In Veritas Volume Manager (VxVM) terminology, physical disks are assigned a diskname and imported into collections known as disk groups. Physical disks are divided into a potentially large number of arbitrarily sized, contiguous chunks of disk space known as subdisks. These subdisks are combined into volumes, which are presented to the operating system in the same way as a slice of a physical disk is.

Volumes can be striped, mirrored or RAID-5'ed. Mirrored volumes are made up of equally-sized collections of subdisks known as plexes. Each plex is a mirror copy of the data in the volume. The Veritas File System (VxFS) is an extent-based file system with advanced logging, snapshotting, and performance features.

VxVM provides dynamic multipathing (DMP) support, which means that it takes care of path redundancy where it is available. If new paths or disk devices are added, one of the steps to be taken is to run vxdctl enable to scan the devices, update the VxVM device list, and update the DMP database. In cases where we need to override DMP support (usually in favor of an alternate multipathing software like EMC Powerpath), we can run vxddladm addforeign.

Here are some procedures to carry out several common VxVM operations. VxVM has a Java-based GUI interface as well, but I always find it easiest to use the command line.

Standard VxVM Operations

OperationProcedure
Create a volume: (length specified in sectors, KB, MB or GB) vxassist -g dg-name make vol-name length(skmg)
Create a striped volume (add options for a stripe layout): layout=stripe diskname1 diskname2 ...
Remove a volume (after unmounting and removing from vfstab): vxstop vol-name
then
vxassist -g dg-name remove volume vol-name
or
vxedit -rf rm vol-name
Create a VxFS file system: mkfs -F vxfs -o largefiles /dev/vx/rdsk/dg-name/vol-name
Snapshot a VxFS file system to an empty volume:mount -F vxfs -o snapof=orig-vol empty-vol mount-point
Display disk group free space:vxdg -g dg-name free
Display the maximum size volume that can be created: vxassist -g dg-name maxsize [attributes]
List physical disks: vxdisk list
Print VxVM configuration: vxprint -ht
Add a disk to VxVM: vxdiskadm (follow menu prompts)
or
vxdiskadd disk-name
Bring newly attached disks under VxVM control (it may be necessary to use format or fmthard to label the disk before the vxdiskconfig): drvconfig; disks
vxdiskconfig
vxdctl enable
Scan devices, update VxVM device list, reconfigure DMP: vxdctl enable
Scan devices on OS device tree, initiate dynamic reconfig of multipathed disks. vxdisk scandisks
Reset a disabled vxconfigd daemon: vxconfigd -kr reset
Manage hot spares: vxdiskadm (follow menu options and prompts)
vxedit set spare=[off|on] vxvm-disk-name
Rename disks: vxedit rename old-disk-name new-disk-name
Rename subdisks: vxsd mv old-subdisk-name new-subdisk-name
Monitor volume performance: vxstat
Re-size a volume (but not the file system):vxassist growto|growby|shrinkto|shrinkby volume-name length[s|m|k|g]
Resize a volume, including the file system: vxresize -F vxfs volume-name new-size[s|m|k|g]
Change a volume's layout: vxassist relayout volume-name layout=layout

The progress of many VxVM tasks can be tracked by setting the -t flag at the time the command is run: utility -t taskflag. If the task flag is set, we can use vxtask to list, monitor, pause, resume, abort or set the task labeled by the tasktag.

Physical disks which are added to VxVM control can either be initialized (made into a native VxVM disk) or encapsulated (disk slice/partition structure is preserved). In general, disks should only be encapsulated if there is data on the slices that needs to be preserved, or if it is the boot disk. (Boot disks must be encapsulated.) Even if there is data currently on a non-boot disk, it is best to back up the data, initialize the disk, create the file systems, and restore the data.

When a disk is initialized, the VxVM-specific information is placed in a reserved location on the disk known as a private region. The public region is the portion of the disk where the data will reside.

VxVM disks can be added as one of several different categories of disks:

  • sliced: Public and private regions are on separate physical partitions. (Usually s3 is the private region and s4 is the public region, but encapsulated boot disks are the reverse.)
  • simple: Public and private regions are on the same disk area.
  • cdsdisk: (Cross-Platform Data Sharing) This is the default, and allows disks to be shared across OS platforms. This type is not suitable for boot, swap or root disks.

If there is a VxFS license for the system, as many file systems as possible should be created as VxFS file systems to take advantage of VxFS's logging, performance and reliability features.

At the time of this writing, ZFS is not an appropriate file system for use on top of VxVM volumes. Sun warns that running ZFS on VxVM volumes can cause severe performance penalties, and that it is possible that ZFS mirrors and RAID sets would be laid out in a way that compromises reliability.

VxVM Maintenance

The first step in any VxVM maintenance session is to run vxprint -ht to check the state of the devices and configurations for all VxVM objects. (A specific volume can be specified with vxprint -ht volume-name.) This section includes a list of procedures for dealing with some of the most common problems. (Depending on the naming scheme of a VxVM installation, many of the below commands may require a -g dg-name option to specify the disk group.)

  • Volumes which are not starting up properly will be listed as DISABLED or DETACHED. A volume recovery can be attempted with the vxrecover -s volume-name command.
  • If all plexes of a mirror volume are listed as STALE, place the volume in maintenance mode, view the plexes and decide which plex to use for the recovery:
    vxvol maint volume-name (The volume state will be DETACHED.)
    vxprint -ht volume-name
    vxinfo volume-name (Display additional information about unstartable plexes.)
    vxmend off plex-name (Offline bad plexes.)
    vxmend on plex-name (Online a plex as STALE rather than DISABLED.)
    vxvol start volume-name (Revive stale plexes.)
    vxplex att volume-name plex-name (Recover a stale plex.)
  • If, after the above procedure, the volume still is not started, we can force a plex to a “clean” state. If the plex is in a RECOVER state and the volume will not start, use a -f option on the vxvol command:
    vxmend fix clean plex-name
    vxvol start volume-name
    vxplex att volume-name plex-name
  • If a subdisk status is listing as NDEV even when the disk is listed as available with vxdisk list the problem can sometimes be resolved by running
    vxdg deport dgname; vxdg import dgname
    to re-initialize the disk group.
  • To remove a disk:
    Copy the data elsewhere if possible.
    Unmount file systems from the disk or unmirror plexes that use the disk.
    vxvol stop volume-name (Stop volumes on the disk.)
    vxdg -g dg-name rmdisk disk-name (Remove disk from its disk group.)
    vxdisk offline disk-name (Offline the disk.)
    vxdiskunsetup c#t#d# (Remove the disk from VxVM control.)
  • To replace a failed disk other than the boot disk:
    In vxdiskadm, choose option 4: Remove a disk for replacement. When prompted, chose “none” for the disk to replace it.
    Physically remove and replace the disk. (A reboot may be necessary if the disk is not hot-swappable.) In the case of a fibre channel disk, it may be necessary to remove the /dev/dsk and /dev/rdsk links and rebuild them with
    drvconfig; disks
    or a reconfiguration reboot.
    In vxdiskadm, choose option 5: Replace a failed or removed disk. Follow the prompts and replace the disk with the appropriate disk.
  • To replace a failed boot disk:
    Use the eeprom command at the root prompt or the printenv command at the ok> prompt to make sure that the nvram=devalias and boot-device parameters are set to allow a boot from the mirror of the boot disk. If the boot paths are not set up properly for both mirrors of the boot disk, it may be necessary to move the mirror disk physically to the boot disk's location. Alternatively, the devalias command at the ok> prompt can set the mirror disk path correctly, then use nvstore to write the change to the nvram. (It is sometimes necessary to nvunalias aliasname to remove an alias from the nvramrc, then
    nvalias aliasname devicepath
    to set the new alias, then
    nvstore
    to write the changes to nvram.)
    In short, set up the system so that it will boot from the boot disk's mirror.
    Repeat the steps above to replace the failed disk.
  • Clearing a "Failing" Flag from a Disk:
    First make sure that there really is not a hardware problem, or that the problem has been resolved. Then,
    vxedit set failing=off disk-name
  • Clearing an IOFAIL state from a Plex:
    First make sure that the hardware problem with the plex has been resolved. Then,
    vxmend -g dgname -o force off plexname
    vxmend -g dgname on plexname
    vxmend -g dgname fix clean plexname
    vxrecover -s volname

VxVM Resetting Plex State

soltest/etc/vx > vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO
EX NAME ASSOC VC PERMS MODE STATE
SR NAME KSTATE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED IOFAIL 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxmend -g testdg -o force off vol53-01
soltest/etc/vx > vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED OFFLINE 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxmend -g testdg on vol53-01
soltest/etc/vx > vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED STALE 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxmend -g testdg fix clean vol53-01
soltest/etc/vx > !vxprint
vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED CLEAN 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxrecover -s vol53
soltest/etc/vx > !vxprint
vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - ENABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 ENABLED ACTIVE 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA

VxVM Mirroring

Most volume manager availability configuration is centered around mirroring. While RAID-5 is a possible option, it is infrequently used due to the parity calculation overhead and the relatively low cost of hardware-based RAID-5 devices.

In particular, the boot device must be mirrored; it cannot be part of a RAID-5 configuration. To mirror the boot disk:

  • eeprom use-nvramrc?=true
    Before mirroring the boot disk, set use-nvramrc? to true in the EEPROM settings. If you forget, you will have to go in and manually set up the boot path for your boot mirror disk. (See “To replace a failed boot disk” in the “VxVM Maintenance” section for the procedure.) It is much easier if you set the parameter properly before mirroring the disk!
  • The boot disk must be encapsulated, preferably in the bootdg disk group. (The bootdg disk group membership used to be required for the boot disk. It is still a standard, and there is no real reason to violate it.)
  • If possible, the boot mirror should be cylinder-aligned with the boot disk. (This means that the partition layout should be the same as that for the boot disk.) It is preferred that 1-2MB of unpartitioned space be left at either the very beginning or the very end of the cylinder list for the VxVM private region. Ideally, slices 3 and 4 should be left unconfigured for VxVM's use as its public and private region. (If the cylinders are aligned, it will make OS and VxVM upgrades easier in the future.)
  • (Before bringing the boot mirror into the bootdg disk group, I usually run an installboot command on that disk to install the boot block in slice 0. This should no longer be necessary; vxrootmir should take care of this for us. I have run into circumstances in the past where vxrootmir has not set up the boot block properly; Veritas reports that those bugs have long since been fixed.)
  • Mirrors of the root disk must be configured with "sliced" format and should live in the bootdg disk group. They cannot be configured with cdsdisk format. If necessary, remove the disk and re-add it in vxdiskadm.
  • In vxdiskadm, choose option 6: Mirror Volumes on a Disk. Follow the prompts from the utility. It will call vxrootmir under the covers to take care of the boot disk setup portion of the operation.
  • When the process is done, attempt to boot from the boot mirror. (Check the EEPROM devalias settings to see which device alias has been assigned to the boot mirror, and run boot device-alias from the ok> prompt.

Procedure to create a Mirrored-Stripe Volume: (A mirrored-stripe volume mirrors several striped plexes—it is better to set up a Striped-Mirror Volume.)

  • vxassist -g dg-name make volume length layout=mirror-stripe Creating a Striped-Mirror Volume: (Striped-mirror volumes are layered volumes which stripes across underlaying mirror volumes.)
  • vxassist -g dg-name make volume length layout=stripe-mirror

Removing a plex from a mirror:

  • vxplex -g dg-name -o rm dis plex-name Removing a mirror from a volume:
  • vxassist -g dg-name remove mirror volume-name

Removing a mirror and all associated subdisks:

  • vxplex -o rm dis volume-name

Dissociating a plex from a mirror (to provide a snapshot):

  • vxplex dis volume-name
  • vxmake -U gen vol new-volume-name plex=plex-name (Creating a new volume with a dissociated plex.)
  • vxvol start new-volume-name
  • vxvol stop new-volume-name (To re-associate this plex with the old volume.)
  • vxplex dis plex-name
  • vxplex att old-volume-name plex-name
  • vxedit rm new-volume-name

Removing a Root Disk Mirror:

  • vxplex -o rm dis rootvol-02 swapvol-02 [other root disk volumes]
  • /etc/vx/bin/vxunroot

Tuesday, June 18, 2013

Recovery Strategies

Besides cost, the key business continuity drivers for a recovery solution are the Recovery Point Objective and the Recovery Time Objective.

Recovery Point Objective

The Recovery Point Objective (RPO) refers to the recovery point in time. Another way to think of this is that the RPO specifies the maximum allowable time delay between a data commit on the production side and the replication of this data to the recovery site.

It is probably easiest to think of RPO in terms of the amount of allowable data loss. The RPO is frequently expressed in terms of its relation to the time at which replication stops, as in “less than 5 minutes of data loss.”

Recovery Time Objective

The second major business driver is the Recovery Time Objective (RTO). This is the amount of time it will take us to recover from a disaster. Depending on the context, this may refer only to the technical steps required to bring up services on the recovery system. Usually, however, it refers to the amount of time that the service will be unavailable, including time to discover that an outage has occurred, the time required to decide to fail over, the time to get staff in place to perform the recovery, and then the amount of time to bring up services at the recovery site.

The costs associated with different RPO and RTO values will be determined by the type of application and its business purpose. Some applications may be able to tolerate unplanned outages of up to days without incurring substantial costs. Other applications may cause significant business-side problems with even minor amounts of unscheduled downtime.

Different applications and environments have different tolerances for RPO and RTO. Some applications might be able to tolerate a potential data loss of days or even weeks; some may not be able to tolerate any data loss at all. Some applications can remain unavailable long enough for us to purchase a new system and restore from tape; some cannot.

Recovery Strategies

There are several different strategies for recovering an application. Choosing a strategy will almost always involve an investment in hardware, software, and implementation time. If a strategy is chosen that does not support the business RPO and RTO requirements, an expensive re-tooling may be necessary.

Many types of replication solutions can be implemented at a server, disk storage, or storage network level. Each has unique advantages and disadvantages. Server replication tends to be cheapest, but also involves using server cycles to manage the replication. Storage network replication is extremely flexible, but can be more difficult to configure. Disk storage replication tends to be rock solid, but is usually limited in terms of supported hardware for the replication target.

Regardless where we choose to implement our data replication solution, we will still face a lot of the same issues. One issue that needs to be addressed is re-silvering of a replication solution that has been partitioned for some amount of time. Ideally, only the changed sections of the disks will need to be re-replicated. Some less sophisticated solutions require a re-silvering of the entire storage area, which can take a long time and soak up a lot of bandwidth. Re-silvering is an issue that needs to be investigaged during the product evaluation.

Continuity Planning

Continuity planning should be done during the initial architecture and design phases for each service. If the service is not designed to accommodate a natural recovery, it will be expensive and difficult to retrofit a recovery mechanism.

The type of recovery that is appropriate for each service will depend on the importance of the service and what the tolerance for downtime is for that service.

There are five generally-recognized approaches to recovery architecture:

  • Server Replacement: Some services are run on standard server images with very little local customization. Such servers may most easily be recovered by replacing them with standard hardware and standard server images.
  • Backup and Restore: Where there is a fair amount of tolerance for downtime on a service, it may be acceptable to rely on hardware replacement combined with restores from backups.
  • Shared Nothing Failover: Some services are largely data-independent and do not require frequent data replication. In such cases, it might make sense to have an appropriately configured replacement at a recovery site. (One example may be an application server that pulls its data from a database. Aside from copying configuration changes, replication of the main server may not be necessary.)
  • Replication and Failover: Several different replication technologies exist, each with different strengths and weaknesses. Array-based, SAN-based, file system-based or file-based technologies allow replication of data on a targeted basis. Synchronous replication techniques prevent data loss at the cost of performance and geographic dispersion. Asynchronous replication techniques permit relatively small amounts of data loss in order to preserve performance or allow replication across large distances. Failover techniques range from nearly instantaneous automated solutions to administrator-invoked scripts to involved manual checklists.
  • Live Active-Active Stretch Clusters: Some services can be provided by active servers in multiple locations, where failover happens by client configurations. Some examples include DNS services (failover by resolv.conf lists), SMTP gateway servers (failover by MX record), web servers (failover by DNS load balancing), and some market data services (failover by client configuration). Such services should almost never be down. (Stretch clusters are clusters where the members are located at geographically dispersed locations.)
Which of these recovery approaches is appropriate to a given situation will depend on the cost of downtime on the service, as well as the particular characteristics of the service's architecture.

Causes of Recovery Failure

Janco released a study outlining the most frequent causes of a recovery failure:
  • Failure of the backup or replication solution. If the a copy of the data is not available, we will not be able to recover.
  • Unidentified failure modes. The recovery plan does not cover a type of failure.
  • Failure to train staff in recovery procedure. If people don't know how to carry out the plan, the work is wasted.
  • Lack of a communication plan. How do you communicate when your usual infrastructure is not available?
  • Insufficient backup power. Do you have enough capacity? How long will it run?
  • Failure to prioritize. What needs to be restored first? If you don't lay that out in advance, you will waste valuable time on recovering less critical services.
  • Unavailable disaster documentation. If your documentation is only available on the systems that have failed, you are stuck. Keep physical copies available in recovery locations.
  • Inadequate testing. Tests reveal weaknesses in the plan and also train staff to deal with a recovery situation in a timely way.
  • Unavailable passwords or access. If the recovery team does not have the permissions necessary to carry out the recovery, it will fail.
  • Plan is out of date. If the plan is not updated to reflect changes in the environment, the recovery will not succeed.

Recovery Business Practices

Janco also suggested several key business practices to improve the likelihood that you will survive a recovery:
  • Eliminate single points of failure.
  • Regularly update staff contact information, including assigned responsibilities.
  • Stay abreast of current events, such as weather and other emergency situations.
  • Plan for the worst case.
  • Document your plans and keep updated copies available in well-known, available locations.
  • Script what you can, and test your scripts.
  • Define priorities and thresholds.
  • Perform regular tests and make sure you can meet your RTO and RPO requirements.

Monday, June 17, 2013

PROM Environment Variables

PROM Environment variables can be set at either the root user prompt or the ok> prompt.

Most (but not all) PROM environment variables can be set with the /usr/sbin/eeprom command. When invoked by itself, it prints out the current environment variables. To use eeprom to set a variable, use the syntax:
/usr/sbin/eeprom variable_name=value

All PROM environment variables can be set at the ok> prompt. the printenv command prints out the current settings. The syntax for setting a variable is:
setenv variable_name value

Friday, June 14, 2013

System Memory Usage

mdb can be used to provide significant information about system memory usage. In particular, the ::memstat dcmd, and the leak and leakbuf walkers may be useful.

  • ::memstat displays a memory usage summary.
  • walk leak finds leaks with the same stack trace as a leaked bufctl or vmem_seg.
  • walk leakbuf walks buffers for leaks with the same stack trace as a leaked bufctl or vmem_seg.

memstat

> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 31563 246 12%
Anon 1523 11 1%
Exec and libs 416 3 0%
Page cache 70 0 0%
Free (cachelist) 78487 613 30%
Free (freelist) 146828 1147 57%
Total 258887 2022
Physical 254998 1992
In addition, there are several functions of interest that can be monitored by DTrace:

Memory Functions

Function NameDescription
page_exists() Tests for a page with a given vnode and offset.
page_find() Searches the hash list for a locked page that is known to have a given vnode and offset.
page_first() Finds the first page on the global page hash list.
page_free() Frees a page. If it has a vnode and offset, sent to the cachelist, otherwise sent to the freelist.
page_ismod() Checks whether a page has been modified.
page_isref() Checks whether a page has been referenced.
page_lock() Lock a page structure.
page_lookup() Find a page with the specified vnode and offset. If found on a free list, it will be moved from the freelist.
page_lookup_nowait() Finds a page representing the specified vnode and offset that is not locked and is not on the freelist.
page_needfree() Notifies the VM system that pages need to be freed.
page_next() Next page on the global hash list.
page_release() Unlock a page structure after unmapping it. Place it back on the cachelist if appropriate.
page_unlock() Unlock a page structure.

Kernel Memory Usage

Solaris kernel memory is used to provide space for kernel text, data and data structures. Most of the kernel's memory is nailed down and cannot be swapped.

For UltraSPARC and x64 systems, Solaris locks a translation mapping into the MMU's translation lookaside buffer (TLB) for the first 4MB of the kernel's text and data segments. By using large pages in this way, the number of kernel-related TLB entries is reduced, leaving more buffer resources for user code. This has resulted in tremendously improved performance for these environments.

When memory is allocated by the kernel, it is typically not released to the freelist unless a severe system memory shortfall occurs. If this happens, the kernel relinquishes any unused memory.

The kernel allocates memory to itself via the slab/kmem and vmem allocators. (A discussion of the internals of the allocators is beyond the scope of this book, but Chapter 11 of McDougall and Mauro discusses the allocators in detail.)

The kernel memory statistics can be tracked using sar -k, and probed using mdb's ::kmastat dcmd for an overall view of kernel memory allocation. The kstat utility allows us to examine a particular cache. Truncated versions of ::kmastat and kstat output are demonstrated here:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba fcp fctl nca lofs zfs random logindmux ptm cpc fcip sppp crypto nfs ] > ::kmastat
cache buf buf buf memory alloc alloc name size in use total in use succeed fail
------------------------- ------ ------ ------ --------- --------- -----
kmem_magazine_1 16 274 1016 16384 4569 0
...
bp_map_131072 131072 0 0 0 0 0
memseg_cache 112 0 0 0 0 0
mod_hash_entries 24 187 678 16384 408634 0
...
thread_cache 792 157 170 139264 75907 0
lwp_cache 904 157 171 155648 11537 0
turnstile_cache 64 299 381 24576 86758 0
cred_cache 148 50 106 16384 42752 0
rctl_cache 40 586 812 32768 541859 0
rctl_val_cache 64 1137 1651 106496 1148726 0
...
ufs_inode_cache 368 18526 102740 38256640 275296 0
...
process_cache 3040 38 56 172032 38758 0
...
zfs_znode_cache 192 0 0 0 0 0
------------------------- ------ ------ ------ --------- --------- -----
Total [static] 221184 150707 0
Total [hat_memload] 7397376 8417187 0
Total [kmem_msb] 1236992 362278 0
Total [kmem_va] 42991616 8893 0
Total [kmem_default] 152576000 112494417 0
Total [bp_map] 524288 3387 0
Total [kmem_tsb_default] 319488 83391 0
Total [hat_memload1] 245760 229486 0
Total [segkmem_ppa] 16384 127 0
Total [umem_np] 1048576 11204 0
Total [segkp] 11010048 30423 0
Total [pcisch2_dvma] 458752 8891868 0
Total [pcisch1_dvma] 98304 11 0
Total [ip_minor_arena] 64 13299 0
Total [spdsock] 64 1 0
Total [namefs_inodes] 64 21 0
------------------------- ------ ------ ------ --------- --------- -----
vmem memory memory memory alloc alloc name in use total import succeed fail
------------------------- --------- ---------- --------- --------- -----
heap 1099614298112 4398046511104 0 20207 0
vmem_metadata 6619136 6815744 6815744 752 0
vmem_seg 5578752 5578752 5578752 681 0
vmem_hash 722560 729088 729088 46 0
vmem_vmem 295800 346096 311296 106 0
...
ibcm_local_sid 0 4294967295 0 0 0
------------------------- --------- ---------- --------- --------- -----
> $Q
# kstat -n process_cache
module: unix instance: 0
name: process_cache class: kmem_cache
align 8
alloc 38785
alloc_fail 0
buf_avail 18
buf_constructed 12
buf_inuse 38
buf_max 64
buf_size 3040
buf_total 56
chunk_size 3040
crtime 28.796560304
depot_alloc 2955
depot_contention 0
depot_free 2965
empty_magazines 0
free 38811
full_magazines 3
hash_lookup_depth 1
hash_rescale 0
hash_size 64
magazine_size 3
slab_alloc 104
slab_create 9
slab_destroy 2
slab_free 54
slab_size 24576
snaptime 1233645.2648315
vmem_source 23

Enabling Kernel Memory Allocator Debug Flag

Certain aspects of the kernel memory allocation only become possible if the debug flags are enabled in kmdb at boot time, as demonstrated below:
ok boot kmdb -d
Loading kmdb...
Welcome to kmdb
[0]> kmem_flags/W 0x1f
kmem_flags: 0x0 = 0x1f
[0]> :c

If the system crashes while kmdb is loaded, it will drop to the kmdb prompt rather than the PROM monitor prompt. (This is intended to allow debugging to continue in the wake of a crash.) This is probably not the desired state for a production system, so it is recommended that kmdb be unloaded once debugging is complete.

0x1f sets all KMA flags. Individual flags can be set instead by using different values, but I have never run across a situation when it wasn't better to just have them all enabled.

Tuesday, June 11, 2013

Swap

The Solaris virtual memory system combines physical memory with available swap space via swapfs. If insufficient total virtual memory space is provided, new processes will be unable to open.

Swap space can be added, deleted or examined with the swap command. swap -l reports total and free space for each of the swap partitions or files that are available to the system. Note that this number does not reflect total available virtual memory space, since physical memory is not reflected in the output. swap -s reports the total available amount of virtual memory, as does sar -r.

If swap is mounted on /tmp via tmpfs, df -k /tmp will report on total available virtual memory space, both swap and physical. As large memory allocations are made, the amount of space available to tmpfs will decrease, meaning that the utilization percentages reported by df will be of limited use.

The DTrace Toolkit's swapinfo.d program prints out a summary of how virtual memory is currently being used:

Virtual Memory Summary

# /opt/DTT/Bin/swapinfo.d
RAM _______Total 2048 MB
RAM Unusable 25 MB
RAM Kernel 564 MB
RAM Locked 2 MB
RAM Used 189 MB
RAM Free 1266 MB

Disk _______Total 4004 MB
Disk Resv 69 MB
Disk Avail 3935 MB

Swap _______Total 5207 MB
Swap Resv 69 MB
Swap Avail 5138 MB
Swap (Minfree) 252 MB

Swapping

If the system is consistently below desfree of free memory (over a 30 second average), the memory scheduler will start to swap out processes. (ie, if both avefree and avefree30 are less than desfree, the swapper begins to look at processes.) Initially, the scheduler will look for processes that have been idle for maxslp seconds. (maxslp defaults to 20 seconds and can be tuned in /etc/system.) This swapping mode is known as soft swapping.

Swapping priorities are calculated for an LWP by the following formula:
epri = swapin_time - rss/(maxpgio/2) - pri
where swapin_time is the time since the thread was last swapped, rss is the amount of memory used by the LWPs process, and pri is the thread's priority.

If, in addition to being below desfree of free memory, there are two processes in the run queue and paging activity exceeds maxpgio, the system will commence hard swapping. In this state, the kernel unloads all modules and cache memory that is not currently active and starts swapping out processes sequentially until desfree of free memory is available.

Processes are not eligible for swapping if they are:

  • In the SYS or RT scheduling class.
  • Being executed or stopped by a signal.
  • Exiting.
  • Zombie.
  • A system thread.
  • Blocking a higher priority thread.

The DTrace Toolkit provides the anonpgpid.d script to attempt to identify the processes which are suffering the most when the system is hard swapping. While this may be interesting, if we are hard-swapping, we need to kill the culprit, not identify the victims. We are better off identifying which processes are consuming how much memory. prstat -s rss does a nice job of ranking processes by memory usage. (RSS stands for “resident set size, “ which is the amount of physical memory allocated to a process.)

Ranking Processes by Memory Usage

# prstat -s rss
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
213 daemon 19M 18M sleep 59 0 0:00:12 0.0% nfsmapid/4
7 root 9336K 8328K sleep 59 0 0:00:04 0.0% svc.startd/14
9 root 9248K 8188K sleep 59 0 0:00:07 0.0% svc.configd/15
517 root 9020K 5916K sleep 59 0 0:00:02 0.0% snmpd/1
321 root 9364K 5676K sleep 59 0 0:00:02 0.0% fmd/14
...
Total: 39 processes, 159 lwps, load averages: 0.00, 0.00, 0.00

We may also find ourselves swapping if we are running tmpfs and someone places a large file in /tmp. It takes some effort, but we have to educate our user community that /tmp is not scratch space. It is literally part of the virtual memory space. It may help matters to set up a directory called /scratch to allow people to unpack files or manipulate data.

Friday, May 24, 2013

Paging

Solaris uses both common types of paging in its virtual memory system. These types are swapping (swaps out all memory associated with a user process) and demand paging (swaps out the not recently used pages). Which method is used is determined by comparing the amount of available memory with several key parameters:
  • physmem: physmem is the total page count of physical memory.
  • lotsfree: The page scanner is woken up when available memory falls below lotsfree. The default value for this is physmem/64 (or 512 KB, whichever is greater); it can be tuned in the /etc/system file if necessary. The page scanner runs in demand paging mode by default. The initial scan rate is set by the kernel parameter slowscan (which is 100 by default).
  • minfree: Between lotsfree and minfree, the scan rate increases linearly between slowscan and fastscan. (fastscan is determined experimentally by the system as the maximum scan rate that can be supported by the system hardware. minfree is set to desfree/2, and desfree is set to lotsfree/2 by default.) Each page scanner will run for desscan pages. This parameter is dynamically set based on the scan rate.
  • maxpgio: maxpgio (default 40 or 60) limits the rate at which I/O is queued to the swap devices. It is set to 40 for x86 architectures and 60 for SPARC architectures. With modern hard drives, maxpgio can safely be set to 100 times the number of swap disks.
  • throttlefree: When free memory falls below throttlefree (default minfree), the page_create routines force the calling process to wait until free pages are available.
  • pageout_reserve: When free memory falls below this value (default throttlefree/2), only the page daemon and the scheduler are allowed memory allocations.

The page scanner operates by first freeing a usage flag on each page at a rate reported as "scan rate" in vmstat and sar -g. After handspreadpages additional pages have been read, the page scanner checks to see whether the usage flag has been reset. If not, the page is swapped out. (handspreadpages is set dynamically in current versions of Solaris. Its maximum value is pageout_new_spread.)

Solaris 8 introduced an improved algorithm for handling file system page caching (for file systems other than ZFS). This new architecture is known as the cyclical page cache. It is designed to remove most of the problems with virtual memory that were previously caused by the file system page cache.

In the new algorithm, the cache of unmapped/inactive file pages is located on a cachelist which functions as part of the freelist.

When a file page is mapped, it is mapped to the relevant page on the cachelist if it is already in memory. If the referenced page is not on the cachelist, it is mapped to a page on the freelist and the file page is read (or “paged”) into memory. Either way, mapped pages are moved to the segmap file cache. Once all other freelist pages are consumed, additional allocations are taken from the cachelist on a least recently accessed basis. With the new algorithm, file system cache only competes with itself for memory. It does not force applications to be swapped out of primary memory as sometimes happened with the earlier OS versions. As a result of these changes, vmstat reports statistics that are more in line with our intuition. In particular, scan rates will be near zero unless there is a systemwide shortage of available memory. (In the past, scan rates would reflect file caching activity, which is not really relevant to memory shortfalls.)

Every active memory page in Solaris is associated with a vnode (which is a mapping to a file) and an offset (the location within that file). This references the backing store for the memory location, and may represent an area on the swap device, or it may represent a location in a file system. All pages that are associated with a valid vnode and offset are placed on the global page hash list.

vmstat -p reports paging activity details for applications (executables), data (anonymous) and file system activity.

The parameters listed above can be viewed and set dynamically via mdb, as below:

# mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba fcp fctl nca lofs zfs random logindmux ptm cpc fcip sppp crypto nfs ]
> physmem/E
physmem:
physmem: 258887
> lotsfree/E
lotsfree:
lotsfree: 3984
> desfree/E
desfree:
desfree: 1992
> minfree/E
minfree:
minfree: 996
> throttlefree/E
throttlefree:
throttlefree: 996
> fastscan/E
fastscan:
fastscan: 127499
> slowscan/E
slowscan:
slowscan: 100
> handspreadpages/E
handspreadpages:
handspreadpages:127499
> pageout_new_spread/E
pageout_new_spread:
pageout_new_spread: 161760
> lotsfree/Z fa0
lotsfree: 0xf90 = 0xfa0
> lotsfree/E
lotsfree:
lotsfree: 4000

Wednesday, May 22, 2013

Segmentation Violations

Segmentation violations occur when a process references a memory address not mapped by any segment. The resulting SIGSEGV signal originates as a major page fault hardware exception identified by the processor and is translated by as_fault() in the address space layer.

When a process overflows its stack, a segmentation violation fault results. The kernel recognizes the violation and can extend the stack size, up to a configurable limit. In a multithreaded environment, the kernel does not keep track of each user thread's stack, so it cannot perform this function. The thread itself is responsible for stack SIGSEGV (stack overflow signal) handling.

(The SIGSEGV signal is sent by the threads library when an attempt is made to write to a write-protected page just beyond the end of the stack. This page is allocated as part of the stack creation request.)

It is often the case that segmentation faults occur because of resource restrictions on the size of a process's stack. See “Resource Management” for information about how to increase these limits.

See “Process Virtual Memory” for a more detailed description of the structure of a process's address space.

Monday, May 20, 2013

Measuring Memory Shortfalls

In the real world, memory shortfalls are much more devastating than having a CPU bottleneck. Two primary indicators of a RAM shortage are the scan rate and swap device activity. Here are some useful commands for monitoring both types of activity:

In both cases, the high activity rate can be due to something that does not have a consistently large impact on performance. The processes running on the system have to be examined to see how frequently they are run and what their impact is. It may be possible to re-work the program or run the process differently to reduce the amount of new data being read into memory.

(Virtual memory takes two shapes in a Unix system: physical memory and swap space. Physical memory usually comes in DIMM modules and is frequently called RAM. Swap space is a dedicated area of disk space that the operating system addresses almost as if it were physical memory. Since disk I/O is much slower than I/O to and from memory, we would prefer to use swap space as infrequently as possible. Memory address space refers to the range of addresses that can be assigned, or mapped, to virtual memory on the system. The bulk of an address space is not mapped at any given point in time.)

We have to weigh the costs and benefits of upgrading physical memory, especially to accommodate an infrequently scheduled process. If the cost is more important than the performance, we can use swap space to provide enough virtual memory space for the application to run. If adequate total virtual memory space is not provided, new processes will not be able to open. (The system may report "Not enough space" or "WARNING: /tmp: File system full, swap space limit exceeded.")

Swap space is usually only used when physical memory is too small to accommodate the system's memory requirements. At that time, space is freed in physical memory by paging (moving) it out to swap space. (See “Paging” below for a more complete discussion of the process.)

If inadequate physical memory is provided, the system will be so busy paging to swap that it will be unable to keep up with demand. (This state is known as "thrashing" and is characterized by heavy I/O on the swap device and horrendous performance. In this state, the scanner can use up to 80% of CPU.)

When this happens, we can use the vmstat -p command to examine whether the stress on the system is coming from executables, application data or file system traffic. This command displays the number of paging operations for each type of data.

Scan Rate


When available memory falls below certain thresholds, the system attempts to reclaim memory that is being used for other purposes. The page scanner is the program that runs through memory to see which pages can be made available by placing them on the free list. The scan rate is the number of times per second that the page scanner makes a pass through memory. (The “Paging” section later in this chapter discusses some details of the page scanner's operation.) The page scanning rate is the main tipoff that a system does not have enough physical memory. We can use sar -g or vmstat to look at the scan rate. vmstat 30 checks memory usage every 30 seconds. (Ignore the summary statistics on the first line.) If page/sr is much above zero for an extended time, your system may be running short of physical memory. (Shorter sampling periods may be used to get a feel for what is happening on a smaller time scale.)

A very low scan rate is a sure indicator that the system is not running short of physical memory. On the other hand, a high scan rate can be caused by transient issues, such as a process reading large amounts of uncached data. The processes on the system should be examined to see how much of a long-term impact they have on performance. Historical trends need to be examined with sar -g to make sure that the page scanner has not come on for a transient, non-recurring reason.

A nonzero scan rate is not necessarily an indication of a problem. Over time, memory is allocated for caching and other activities. Eventually, the amount of memory will reach the lotsfree memory level, and the pageout scanner will be invoked. For a more thorough discussion of the paging algorithm, see “Paging” below.

Swap Device Activity

The amount of disk activity on the swap device can be measured using iostat. iostat -xPnce provides information on disk activity on a partition-by-partition basis. sar -d provides similar information on a per-physical-device basis, and vmstat provides some usage information as well. Where Veritas Volume Manager is used, vxstat provides per-volume performance information.

If there are I/O's queued for the swap device, application paging is occurring. If there is significant, persistent, heavy I/O to the swap device, a RAM upgrade may be in order.

Process Memory Usage

The /usr/proc/bin/pmap command can help pin down which process is the memory hog. /usr/proc/bin/pmap -x PID prints out details of memory use by a process.

Summary statistics regarding process size can be found in the RSS column of ps -ly or top.

dbx, the debugging utility in the SunPro package, has extensive memory leak detection built in. The source code will need to be compiled with the -g flag by the appropriate SunPro compiler.

ipcs -mb shows memory statistics for shared memory. This may be useful when attempting to size memory to fit expected traffic.

Friday, May 17, 2013

vmstat

vmstat

The first line of vmstat represents a summary of information since boot time. To obtain useful real-time statistics, run vmstat with a time step (eg vmstat 30).

The vmstat output columns are as follows use the pagesize command to determine the size of the pages):

  • procs or kthr/r: Run queue length.
  • procs or kthr/b: Processes blocked while waiting for I/O.
  • procs or kthr/w: Idle processes which have been swapped.
  • memory/swap: Free, unreserved swap space (Kb).
  • memory/free: Free memory (Kb). (Note that this will grow until it reaches lotsfree, at which point the page scanner is started. See "Paging" for more details.)
  • page/re: Pages reclaimed from the free list. (If a page on the free list still contains data needed for a new request, it can be remapped.)
  • page/mf: Minor faults (page in memory, but not mapped). (If the page is still in memory, a minor fault remaps the page. It is comparable to the vflts value reported by sar -p.)
  • page/pi: Paged in from swap (Kb/s). (When a page is brought back from the swap device, the process will stop execution and wait. This may affect performance.)
  • page/po: Paged out to swap (Kb/s). (The page has been written and freed. This can be the result of activity by the pageout scanner, a file close, or fsflush.)
  • page/fr: Freed or destroyed (Kb/s). (This column reports the activity of the page scanner.)
  • page/de: Freed after writes (Kb/s). (These pages have been freed due to a pageout.)
  • page/sr: Scan rate (pages). Note that this number is not reported as a "rate," but as a total number of pages scanned.
  • disk/s#: Disk activity for disk # (I/O's per second).
  • faults/in: Interrupts (per second).
  • faults/sy: System calls (per second).
  • faults/cs: Context switches (per second).
  • cpu/us: User CPU time (%).
  • cpu/sy: Kernel CPU time (%).
  • cpu/id: Idle + I/O wait CPU time (%).

vmstat -i reports on hardware interrupts.

vmstat -s provides a summary of memory statistics, including statistics related to the DNLC, inode and rnode caches.

vmstat -S reports on swap-related statistics such as:

  • si: Swapped in (Kb/s).
  • so: Swap outs (Kb/s).

(Note that the man page for vmstat -s incorrectly describes the swap queue length. In Solaris 2, the swap queue length is the number of idle swapped-out processes. (In SunOS 4, this referred to the number of active swapped-out processes.)

Solaris 8

vmstat under Solaris 8 will report different statistics than would be expected under an earlier version of Solaris due to a different paging algorithm:
  • Page Reclaim rate higher.
  • Higher reported Free Memory: A large component of the filesystem cache is reported as free memory.
  • Low Scan Rates: Scan rates will be near zero unless there is a systemwide shortage of available memory.

vmstat -p reports paging activity details for applications (executables), data (anonymous) and filesystem activity.

Thursday, May 16, 2013

sar

sar

The word "sar" is used to refer to two related items:

  1. The system activity report package
  2. The system activity reporter

System Activity Report Package

This facility stores a great deal of performance data about a system. This information is invaluable when attempting to identify the source of a performance problem.

The Report Package can be enabled by uncommenting the appropriate lines in the sys crontab. The sa1 program stores performance data in the /var/adm/sa directory. sa2 writes reports from this data, and sadc is a more general version of sa1.

In practice, I do not find that the sa2-produced reports are terribly useful in most cases. Depending on the issue being examined, it may be sufficient to run sa1 at intervals that can be set in the sys crontab.

Alternatively, sar can be used on the command line to look at performance over different time slices or over a constricted period of time:

sar -A -o outfile 5 2000

(Here, "5" represents the time slice and "2000" represents the number of samples to be taken. "outfile" is the output file where the data will be stored.)

The data from this file can be read by using the "-f" option (see below).


System Activity Reporter

sar has several options that allow it to process the data collected by sa1 in different ways:
  • -a: Reports file system access statistics. Can be used to look at issues related to the DNLC.

    • iget/s: Rate of requests for inodes not in the DNLC. An iget will be issued for each path component of the file's path.

    • namei/s: Rate of file system path searches. (If the directory name is not in the DNLC, iget calls are made.)

    • dirbk/s: Rate of directory block reads.

  • -A: Reports all data.

  • -b: Buffer activity reporter:

    • bread/s, bwrit/s: Transfer rates (per second) between system buffers and block devices (such as disks).

    • lread/s, lwrit/s: System buffer access rates (per second).

    • %rcache, %wcache: Cache hit rates (%).

    • pread/s, pwrit/s: Transfer rates between system buffers and character devices.

  • -c: System call reporter:

    • scall/s: System call rate (per second).

    • sread/s, swrit/s, fork/s, exec/s: Call rate for these calls (per second).

    • rchar/s, wchar/s: Transfer rate (characters per second).

  • -d: Disk activity (actually, block device activity):

    • %busy: % of time servicing a transfer request.

    • avque: Average number of outstanding requests.

    • r+w/s: Rate of reads+writes (transfers per second).

    • blks/s: Rate of 512-byte blocks transferred (per second).

    • avwait: Average wait time (ms).

    • avserv: Average service time (ms). (For block devices, this includes seek rotation and data transfer times. Note that the iostat svc_t is equivalent to the avwait+avserv.)

  • -e HH:MM: CPU useage up to time specified.

  • -f filename: Use filename as the source for the binary sar data. The default is to use today's file from /var/adm/sa.

  • -g: Paging activity (see "Paging" for more details):

    • pgout/s: Page-outs (requests per second).

    • ppgout/s: Page-outs (pages per second).

    • pgfree/s: Pages freed by the page scanner (pages per second).

    • pgscan/s: Scan rate (pages per second).

    • %ufs_ipf: Percentage of UFS inodes removed from the free list while still pointing at reuseable memory pages. This is the same as the percentage of igets that force page flushes.


  • -i sec: Set the data collection interval to i seconds.

  • -k: Kernel memory allocation:

    • sml_mem: Amount of virtual memory available for the small pool (bytes). (Small requests are less than 256 bytes)

    • lg_mem: Amount of virtual memory available for the large pool (bytes). (512 bytes-4 Kb)

    • ovsz_alloc: Memory allocated to oversize requests (bytes). Oversize requests are dynamically allocated, so there is no pool. (Oversize requests are larger than 4 Kb)

    • alloc: Amount of memory allocated to a pool (bytes). The total KMA useage is the sum of these columns.

    • fail: Number of requests that failed.

  • -m: Message and semaphore activities.

    • msg/s, sema/s: Message and semaphore statistics (operations per second).

  • -o filename: Saves output to filename.

  • -p: Paging activities.

    • atch/s: Attaches (per second). (This is the number of page faults that are filled by reclaiming a page already in memory.)

    • pgin/s: Page-in requests (per second) to file systems.

    • ppgin/s: Page-ins (per second). (Multiple pages may be affected by a single request.)

    • pflt/s: Page faults from protection errors (per second).

    • vflts/s: Address translation page faults (per second). (This happens when a valid page is not in memory. It is comparable to the vmstat-reported page/mf value.)

    • slock/s: Faults caused by software lock requests that require physical I/O (per second).

  • -q: Run queue length and percentage of the time that the run queue is occupied.

  • -r: Unused memory pages and disk blocks.

    • freemem: Pages available for use (Use pagesize to determine the size of the pages).

    • freeswap: Disk blocks available in swap (512-byte blocks).

  • -s time: Start looking at data from time onward.

  • -u: CPU utilization.

    • %usr: User time.

    • %sys: System time.

    • %wio: Waiting for I/O (does not include time when another process could be schedule to the CPU).

    • %idle: Idle time.

  • -v: Status of process, inode, file tables.

    • proc-sz: Number of process entries (proc structures) currently in use, compared with max_nprocs.

    • inod-sz: Number of inodes in memory compared with the number currently allocated in the kernel.

    • file-sz: Number of entries in and size of the open file table in the kernel.

    • lock-sz: Shared memory record table entries currently used/allocated in the kernel. This size is reported as 0 for standards compliance (space is allocated dynamically for this purpose).

    • ov: Overflows between sampling points.

  • -w: System swapping and switching activity.

    • swpin/s, swpot/s, bswin/s, bswot/s: Number of LWP transfers or 512-byte blocks per second.

    • pswch/s: Process switches (per second).

  • -y: TTY device activity.

    • rawch/s, canch/s, outch/s: Input character rate, character rate processed by canonical queue, output character rate.

    • rcvin/s, xmtin/s, mdmin/s: Receive, transmit and modem interrupt rates.