My class, "Technology Manager's Survival Guide" is on the Friday afternoon training schedule at LOPSA-East on May 2 in New Brunswick, NJ.
And I'll be presenting a two-part class, "Leader's Survival Guide," in Jacksonville, FL on July 16 and 23.
Companion blog for Scott Cromar's Solaris Troubleshooting Handbook and Solaris Troubleshooting web site.
My class, "Technology Manager's Survival Guide" is on the Friday afternoon training schedule at LOPSA-East on May 2 in New Brunswick, NJ.
And I'll be presenting a two-part class, "Leader's Survival Guide," in Jacksonville, FL on July 16 and 23.
A common mistake by new monitoring administrators is to alert on everything. This is an ineffective strategy for several reasons. For starters, it may result in higher telecom charges for passing large numbers of alerts. Passing tons of irrelevant alerts will impact team morale. And, no matter how dedicated your team is, you are guaranteed to reach a state where alerts will start being ignored because "they're all garbage anyway."
For example, it is common for non-technical managers to want to send alerts to the systems team when system CPU hits 100%. But, from a technical perspective, this is absurd:
In order to be effective, a monitoring strategy needs to be thought out. You may end up monitoring a lot of things just to establish baselines or to view growth over time. Some things you monitor will need to be checked out right away. It is important to know which is which.
Historical information should be logged and retained for examination on an as-needed basis. It is wise to set up automated regular reports (distributed via email or web) to keep an eye on historical system trends, but there is no reason to send alerts on this sort of information.
Availability information should be characterized and handled in an appropriate way, probably through a tiered system of notifications. Depending on the urgency, it may show up on a monitoring console, be rolled up in a daily summary report, or paged out to the on-call person. Some common types of information in this category include:
(If possible, configure escalations into your alerting system, so that you are not dependent on a single person's cell phone for the availability of your entire enterprise. A typical escalation procedure would be for an unacknowledged alert to be sent up defined chain of escalation. For example, if the on-call person does not respond in 15 minutes, an alert may go to the entire group. If the alert is not acknowledged 15 minutes after that, the alert may go to the manager.)
In some environments, alerts are handled by a round-the-clock team that is sometimes called the Network Operations Center (NOC). The NOC will coordinate response to the issue, including an evaluation of the alert and any necessary escalations.
Before an alert is configured, the monitoring group should first make sure that the alert meets three important criteria. The alert should be:
uptime
command is reporting a time larger than the interval between monitoring sweeps, you can keep an eye on sudden, unexpected reboots.avserv
in Solaris, svctime
in Linux) > 20 ms for disk devices with more than 100 (r+w)/s, including NFS disk devices. This measures of I/O channel exhaustion. 20 ms is a very long time, so you will also want to keep an eye on trends on regular summary reports of sar -d
data.
In the event that the state databases disagree, a majority of configured state databases determines which version of reality is correct. This is why it is important to configure multiple replicas. A minimum of three database replicas must be available in order to boot without human assistance, so it makes sense to create database replicas liberally. They don't take up much space, and there is very little overhead associated with their maintenance. On JBOD (Just a Bunch Of Disks) arrays, I recommend at least two replicas on each disk device.
State database replicas consume between 4 and 16 MB of space, and should ideally be placed on a partition specifically set aside for that purpose. In the event that state database information is lost, it is possible to lose the data stored on the managed disks, so the database replicas should be spread over as much of the disk infrastructure as possible.
State database locations are recorded in /etc/opt/SUNWmd/mddb.cf
. Depending on their condition, repair may or may not be possible.
Metadevices (the objects which Solaris Volume Manager manipulates) may be placed on a partition with a state database if the state database is there first.
The initial state databases can be created by specifying the slices on which they will live as follows:
metadb -a -f -c 2 slice-name1 slice-name2
Because pre-existing partitions are not usable for creating database replicas, it is frequently the case that we will steal space from swap to create a small partition for the replicas. To do so, we need to boot to single-user mode, use swap -d
to unmount all swap, and format to re-partition the swap partition, freeing up space for a separate partition for the database replicas. Since the replicas are small, very few cylinders will be required.
Solaris Volume Manager can build metadevices either by using partitions as the basic building blocks, or by dividing a single large partition into soft partitions. Soft partitions are a way that SVM allows us to carve a single disk into more than 8 slices. We can either build soft partitions directly on a disk slice, or we can mirror (or RAID) slices, then carve up the resulting metadevice into soft partitions to build volumes.
Disksets are collections of disks that are managed together, in the same way that a Veritas Volume Manager (VxVM) disk group is managed together. Unlike in VxVM, SVM does not require us to explicitly specify a disk group. If Disksets are configured, we need to specify the set name for monitoring or management commands with a -s setname option. Disksets may be created as shared disksets, where multiple servers may be able to access them. (This is useful in an environment like Sun Cluster, for example.) In that case, we specify some hosts as mediators who determine who owns the diskset. (Note that disks added to shared disksets are re-partitioned in the expectation that we will use soft partitions.)
When metadevices need to be addressed by OS commands (like mkfs
), we can reference them with device links of the form /dev/md/rdsk/d#
or /dev/md/disksetname/rdsk/d
Here are the main command line commands within SVM:
Command | Description |
---|---|
metaclear | Deletes active metadevices and hot spare pools. |
metadb | Manages state database replicas. |
metadetach | Detaches a metadevice from a mirror or a logging device from a trans-metadevice. |
metahs | Manages hot spares and hot spare pools. |
metainit | Configures metadevices. |
metaoffline | Takes submirrors offline. |
metaonline | Places submirrors online. |
metaparam | Modifies metadevice parameters. |
metarename | Renames and switches metadevice names. |
metareplace | Replaces slices of submirrors and RAID5 metadevices. |
metaroot | Sets up system files for mirroring root. |
metaset | Administers disksets. |
metastat | Check metadevice health and state. |
metattach | Attaches a metadevice to a mirror or a log to a trans-metadevice. |
Here is how to perform several common types of operations in Solaris Volume Manager:
Operation | Procedure |
---|---|
Create state database replicas. | metadb -a -f -c 2 c#t0d#s# c#t1d#s# |
Mirror the root partition. | |
Create a metadevice for the root partition: | metainit -f d0 1 1 c#t0d#s# |
Create a metadevice for the root mirror partition. | metainit d1 1 1 c#t1d#s# |
Set up a 1-sided mirror | metainit d2 -m d0 |
Edit the vfstab and system files. | metaroot d2 |
Attach the root mirror. | metattach d2 d1 |
Mirror the swap partion. | |
Create metadevices for the swap partition and mirror. | metainit -f d5 1 1 c#t0d#s# |
Attach submirror to mirror. | metattach d7 d6 |
Edit vfstab to mount swap mirror as a swap device. | Use root entry as a template. |
Create a striped metadevice. | metainit d# stripes slices c#t#d#s#... |
Create a striped metadevice with a non-default interlace size. | Add an -i interlacek option |
Concatenate slices. | metainit d# #slices 1 c#t#d#s# 1 c#t#d#s#... |
Create a soft partition metadevice. | metainit dsource# -p dnew# size |
Create a RAID5 metadevice. | metainit d# -r c#t#d#s# c#t#d#s# c#t#d#s#... |
Manage Hot Spares | |
Create a hot spare pool. | metainit hsp001 c#t#d#s#... |
Add a slice to a pool. | metahs -a hsp### /dev/dsk/c#t#d#s# |
Add a slice to all pools. | metahs -a all /dev/dsk/c#t#d#s# |
Diskset Management | |
Deport a diskset. | metaset -s setname -r |
Import a diskset. | metaset -s setname -t -f |
Add hosts to a shared diskset. | metaset -s setname -a -h hostname1 hostname2 |
Add mediators to a shared diskset. | metaset -s setname -a -m hostname1 hostname2 |
Add devices to a shared diskset. | metaset -s setname -a /dev/did/rdsk/d# /dev/did/rdsk/d# |
Check diskset status. | metaset |
metadb
command monitors the database replicas, and the metastat
command monitors the metadevices and hot spares.
Status messages that may be reported by metastat for a disk mirror include:
Hot spare status messages reported by metastat are:
/etc/opt/SUNWmd/md.cf
.metadb -a -f -c 2 c#t#d#s# c#t#d#s#
md.cf
to md.tab
md.tab
so that all mirrors are one-way mirrors, RAID5 devices recreated with -k (to prevent re-initialization).metainit -n -a
metainit -a
metattach dmirror# dsubmirror#
metastat
More frequently, Solaris Volume Manager will be needed to deal with replacing a failed piece of hardware. To replace a disk which is spitting errors, but has not failed yet (as in Example 10-2): Add database replicas to unaffected disks until at least three exist outside of the failing disk. Remove any replicas from the failing disk: metadb metadb -d c#t#d#s# Detach and remove submirrors and hot spares on the failing disk from their mirrors and pools: metadetach dmirror# dsubmirror# metaclear -r dsubmirror# metahs -d hsp# c#t#d#s# [If the boot disk is being replaced, find the /devices name of the boot disk mirror]: ls -l /dev/rdsk/c#t#d#s0 [If the removed disk is a fibre channel disk, remove the /dev/dsk and /dev/rdsk links for the device.] Physically replace the disk. This may involve shutting down the system if the disk is not hot-swappable. Re-build any /dev and /devices links: drvconfig; disks or boot -r Format and re-partition the disk appropriately. Re-add any removed database replicas: metadb -a -c #databases c#t#d#s# Re-create and re-attach any removed submirrors: metainit dsubmirror# 1 1 c#t#d#s# metattach dmirror# dsubmirror# Re-create any removed hot spares. Replacing a failed disk (as in Example 10-3) is a similar procedure. The differences are: Remove database replicas and hot spares as above; submirrors will not be removable. After replacing the disk as above, replace the submirrors with metareplace: metareplace -e dmirror# c#t#d#s# Barring a misconfiguration, Solaris Volume Manager is a tremendous tool for increasing the reliability and redundancy of a server. More important, it allows us to postpone maintenance for a hard drive failure until the next maintenance window. The metastat tool is quite useful for identifying and diagnosing problems. Along with iostat -Ee, we can often catch problems before they reach a point where the disk has actually failed. Example 10-2 shows how to replace a failing (but not yet failed) mirrored disk. (In this case, we were able to hot-swap the disk, so no reboot was necessary. Since the disks were SCSI, we also did not need to remove or rebuild any /dev links.)
# metastat
d0: Mirror
Submirror 0: d1
State: Okay
Submirror 1: d2
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 20484288 blocks
d1: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t0d0s0 0 No Okay
d2: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t1d0s0 0 No Okay
...
# iostat -E
sd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0707 Serial No: 3HZ...
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
sd1 Soft Errors: 593 Hard Errors: 28 Transport Errors: 1
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0707 Serial No: 3HZ...
Size: 73.40GB <73400057856 bytes>
Media Error: 24 Device Not Ready: 0 No Device: 1 Recoverable: 593
Illegal Request: 0 Predictive Failure Analysis: 1
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
a p luo 16 1034 /dev/dsk/c0t1d0s3
a p luo 1050 1034 /dev/dsk/c0t1d0s3
a p luo 2084 1034 /dev/dsk/c0t1d0s3
# metadb -d c0t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
# metadetach d40 d42
d40: submirror d42 is detached
# metaclear -r d42
d42: Concat/Stripe is cleared
...
# metadetach d0 d2
d0: submirror d2 is detached
# metaclear -r d2
d2: Concat/Stripe is cleared
...
[Disk hot-swapped. No reboot or device reconfiguration necessary for this replacement]
...
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c0t1d0
/pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number): 0
selecting c0t0d0
[disk formatted]
FORMAT MENU:
...
format> part
PARTITION MENU:
0 - change `0' partition
...
print - display the current table
label - write partition map and label to the disk
! - execute , then return
quit
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
...
partition> q
FORMAT MENU:
...
format> di
AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c0t1d0
/pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number)[0]: 1
selecting c0t1d0
[disk formatted]
format> part
...
[sd1 partitioned to match sd0's layout]
...
partition> 7
Part Tag Flag Cylinders Size Blocks
7 unassigned wm 0 0 (0/0/0) 0
Enter partition id tag[unassigned]:
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 4835
Enter partition size[0b, 0c, 0.00mb, 0.00gb]: 9252c
partition> la
Ready to label disk, continue? y
partition> pr
Current partition table (unnamed):
Total disk cylinders available: 14087 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
...
partition> q
...
# metadb -a -c 3 c0t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
a u 16 1034 /dev/dsk/c0t1d0s3
a u 1050 1034 /dev/dsk/c0t1d0s3
a u 2084 1034 /dev/dsk/c0t1d0s3
# metainit d2 1 1 c0t1d0s0
d2: Concat/Stripe is setup
cnjcascade1#metattach d0 d2
d0: submirror d2 is attached
[Re-create and attach the remainder of the submirrors.]
...
# metastat
d0: Mirror
Submirror 0: d1
State: Okay
Submirror 1: d2
State: Resyncing
Resync in progress: 10 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 20484288 blocks
d1: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t0d0s0 0 No Okay
d2: Submirror of d0
State: Resyncing
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t1d0s0 0 No Okay
It is important to format the replacement disk to match the cylinder layout of the disk that is being replaced. If this is not done, mirrors and stripes will not rebuild properly.
When you replace a disk that has already failed, there is no ability to remove the submirrors. Instead, the metareplace -e
command is used to re-sync the mirror onto the new disk.
# iostat -E
...
sd1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 5
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0507 Serial No: 3HZ7Z3CJ00007505
Size: 73.40GB <73400057856 bytes>
...
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c1t0d0s3
a p luo 1050 1034 /dev/dsk/c1t0d0s3
W p l 16 1034 /dev/dsk/c1t1d0s3
W p l 1050 1034 /dev/dsk/c1t1d0s3
a p luo 16 1034 /dev/dsk/c1t2d0s3
a p luo 1050 1034 /dev/dsk/c1t2d0s3
a p luo 16 1034 /dev/dsk/c1t3d0s3
a p luo 1050 1034 /dev/dsk/c1t3d0s3
# metadb -d /dev/dsk/c1t1d0s3
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c1t1d0
/pci@1c,600000/scsi@2/sd@1,0
2. c1t2d0
/pci@1c,600000/scsi@2/sd@2,0
3. c1t3d0
/pci@1c,600000/scsi@2/sd@3,0
Specify disk (enter its number): 0
selecting c1t0d0
[disk formatted]
FORMAT MENU:
...
partition - select (define) a partition table
...
format> part
PARTITION MENU:
...
print - display the current table
...
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
1 swap wu 2013 - 2214 1003.69MB (202/0/0) 2055552
2 backup wm 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 2215 - 2217 14.91MB (3/0/0) 30528
4 unassigned wm 2218 - 5035 13.67GB (2818/0/0) 28675968
5 unassigned wm 5036 - 12080 34.18GB (7045/0/0) 71689920
6 var wm 12081 - 12684 2.93GB (604/0/0) 6146304
7 home wm 12685 - 14086 6.80GB (1402/0/0) 14266752
partition> q
FORMAT MENU:
disk - select a disk
...
format> di
AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c1t1d0
/pci@1c,600000/scsi@2/sd@1,0
2. c1t2d0
/pci@1c,600000/scsi@2/sd@2,0
3. c1t3d0
/pci@1c,600000/scsi@2/sd@3,0
Specify disk (enter its number)[0]: 1
format> part
PARTITION MENU:
...
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 25 129.19MB (26/0/0) 264576
1 swap wu 26 - 51 129.19MB (26/0/0) 264576
2 backup wu 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 52 - 14086 68.10GB (14035/0/0) 142820160
7 unassigned wm 0 0 (0/0/0) 0
...
partition> 7
Part Tag Flag Cylinders Size Blocks
7 unassigned wm 0 0 (0/0/0) 0
Enter partition id tag[unassigned]: home
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 12685
Enter partition size[0b, 0c, 0.00mb, 0.00gb]: 1402c
partition> pr
Current partition table (unnamed):
Total disk cylinders available: 14087 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
1 swap wu 2013 - 2214 1003.69MB (202/0/0) 2055552
2 backup wu 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 2215 - 2217 14.91MB (3/0/0) 30528
4 unassigned wm 2218 - 5035 13.67GB (2818/0/0) 28675968
5 unassigned wm 5036 - 12080 34.18GB (7045/0/0) 71689920
6 var wm 12081 - 12684 2.93GB (604/0/0) 6146304
7 home wm 12685 - 14086 6.80GB (1402/0/0) 14266752
partition> la
Ready to label disk, continue? y
partition> q
...
# metastat
...
d19: Mirror
Submirror 0: d17
State: Okay
Submirror 1: d18
State: Needs maintenance
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 14266752 blocks
d17: Submirror of d19
State: Okay
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s7 0 No Okay
d18: Submirror of d19
State: Needs maintenance
Invoke: metareplace d19 c1t1d0s7
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s7 0 No Maintenance
...
# metareplace -e d19 c1t1d0s7
d19: device c1t1d0s7 is enabled
# metareplace -e d16 c1t1d0s6
d16: device c1t1d0s6 is enabled
# metareplace -e d13 c1t1d0s5
d13: device c1t1d0s5 is enabled
# metareplace -e d10 c1t1d0s4
d10: device c1t1d0s4 is enabled
# metareplace -e d2 c1t1d0s0
d2: device c1t1d0s0 is enabled
# metastat
...
d19: Mirror
Submirror 0: d17
State: Okay
Submirror 1: d18
State: Resyncing
Resync in progress: 10 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 14266752 blocks
d17: Submirror of d19
State: Okay
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s7 0 No Okay
d18: Submirror of d19
State: Resyncing
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s7 0 No Resyncing
...
# metadb -a -c 2 c1t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c1t0d0s3
a p luo 1050 1034 /dev/dsk/c1t0d0s3
a u 16 1034 /dev/dsk/c1t1d0s3
a u 1050 1034 /dev/dsk/c1t1d0s3
a p luo 16 1034 /dev/dsk/c1t2d0s3
a p luo 1050 1034 /dev/dsk/c1t2d0s3
a p luo 16 1034 /dev/dsk/c1t3d0s3
a p luo 1050 1034 /dev/dsk/c1t3d0s3
/etc/inittab
file plays a crucial role in the
boot sequence
.
For versions of Solaris prior to version 10, the
/etc/inittab
was edited manually. Solaris 10+ manages
the /etc/inittab
through SMF.
The Solaris 10 inittab
should not be edited directly
The default Solaris 10 inittab
contains the following:
The lines accomplish the following:
In particular, the default
keyword is not used any more in
Solaris 10. Instead, the default run level is determined within the
SMF profile.
When the init
process is started, it first sets environment
variables set in the /etc/default/init
file; by default, only
TIMEZONE
is set. Then init
executes process
entries from the inittab
that have sysinit
set, and transfers control of the startup process to svc.startd
.
The line entries in the inittab
file have the following
format:
id:runlevel:action:process
Here the id is a two-character unique identifier, runlevel indicates the run level involved, action indicates how the process is to be run, and process is the command to be executed.
At boot time, all entries with runlevel "sysinit" are run.
Once these processes are run, the system moves towards the
init level indicated by the "initdefault" line. For a default
inittab
, the line is:
is:3:initdefault:
(This indicates a default runlevel of 3.)
By default, the first script run from the inittab
file is /etc/bcheckrc
, which checks the state of
the root and /usr
filesystems. The line controlling
this script has the following form:
fs::sysinit:/sbin/bcheckrc >/dev/console 2>&1 </dev/console
The inittab
also controls what happens at each
runlevel.
For example, the default entry for runlevel 2 is:
s2:23:wait:/sbin/rc2 >/dev/console 2>&1 </dev/console
The action field of each entry will contain one of the following
keywords:
Veritas has long since been purchased by Symantec, but its products continue to be sold under the Veritas name. Over time, we can expect that some of the products will have name changes to reflect the new ownership.
Veritas produces volume and file system software that allows for extremely flexible and straightforward management of a system's disk storage resources. Now that ZFS is providing much of this same functionality from inside the OS, it will be interesting to see how well Veritas is able to hold on to its installed base.
In Veritas Volume Manager (VxVM) terminology, physical disks are assigned a diskname and imported into collections known as disk groups. Physical disks are divided into a potentially large number of arbitrarily sized, contiguous chunks of disk space known as subdisks. These subdisks are combined into volumes, which are presented to the operating system in the same way as a slice of a physical disk is.
Volumes can be striped, mirrored or RAID-5'ed. Mirrored volumes are made up of equally-sized collections of subdisks known as plexes. Each plex is a mirror copy of the data in the volume. The Veritas File System (VxFS) is an extent-based file system with advanced logging, snapshotting, and performance features.
VxVM provides dynamic multipathing (DMP) support, which means that it takes care of path redundancy where it
is available. If new paths or disk devices are added, one of the steps to be taken is to run vxdctl enable to scan
the devices, update the VxVM device list, and update the DMP database. In cases where we need to override DMP
support (usually in favor of an alternate multipathing software like EMC Powerpath), we can run vxddladm
addforeign
.
Here are some procedures to carry out several common VxVM operations. VxVM has a Java-based GUI interface as well, but I always find it easiest to use the command line.
Operation | Procedure |
---|---|
Create a volume: (length specified in sectors, KB, MB or GB) | vxassist -g dg-name make vol-name length(skmg) |
Create a striped volume (add options for a stripe layout): | layout=stripe diskname1 diskname2 ... |
Remove a volume (after unmounting and removing from vfstab): | vxstop vol-namethen vxassist -g dg-name remove volume vol-nameor vxedit -rf rm vol-name | Create a VxFS file system: | mkfs -F vxfs -o largefiles /dev/vx/rdsk/ dg-name/vol-name |
Snapshot a VxFS file system to an empty volume: | mount -F vxfs -o snapof=orig-vol empty-vol mount-point |
Display disk group free space: | vxdg -g dg-name free |
Display the maximum size volume that can be created: | vxassist -g dg-name maxsize [attributes] |
List physical disks: | vxdisk list |
Print VxVM configuration: | vxprint -ht |
Add a disk to VxVM: | vxdiskadm (follow menu prompts)or vxdiskadd disk-name |
Bring newly attached disks under VxVM control (it may be necessary to use
format or fmthard to label the disk before the vxdiskconfig): |
drvconfig; disks |
Scan devices, update VxVM device list, reconfigure DMP: | vxdctl enable |
Scan devices on OS device tree, initiate dynamic reconfig of multipathed disks. | vxdisk scandisks |
Reset a disabled vxconfigd daemon: | vxconfigd -kr reset |
Manage hot spares: | vxdiskadm (follow menu options and prompts)vxedit set spare= [off|on] vxvm-disk-name |
Rename disks: | vxedit rename old-disk-name new-disk-name |
Rename subdisks: | vxsd mv old-subdisk-name new-subdisk-name |
Monitor volume performance: | vxstat |
Re-size a volume (but not the file system): | vxassist growto|growby|shrinkto|shrinkby volume-name length[s|m|k|g] |
Resize a volume, including the file system: | vxresize -F vxfs volume-name new-size[s|m|k|g] |
Change a volume's layout: | vxassist relayout volume-name layout= layout |
The progress of many VxVM tasks can be tracked by setting the -t flag at the time the command is run: utility -t taskflag. If the task flag is set, we can use vxtask to list, monitor, pause, resume, abort or set the task labeled by the tasktag.
Physical disks which are added to VxVM control can either be initialized (made into a native VxVM disk) or encapsulated (disk slice/partition structure is preserved). In general, disks should only be encapsulated if there is data on the slices that needs to be preserved, or if it is the boot disk. (Boot disks must be encapsulated.) Even if there is data currently on a non-boot disk, it is best to back up the data, initialize the disk, create the file systems, and restore the data.
When a disk is initialized, the VxVM-specific information is placed in a reserved location on the disk known as a private region. The public region is the portion of the disk where the data will reside.
VxVM disks can be added as one of several different categories of disks:
If there is a VxFS license for the system, as many file systems as possible should be created as VxFS file systems to take advantage of VxFS's logging, performance and reliability features.
At the time of this writing, ZFS is not an appropriate file system for use on top of VxVM volumes. Sun warns that running ZFS on VxVM volumes can cause severe performance penalties, and that it is possible that ZFS mirrors and RAID sets would be laid out in a way that compromises reliability.
DISABLED
or DETACHED
. A volume recovery
can be attempted with the vxrecover -s volume-name command.STALE
, place the volume in maintenance mode, view the plexes
and decide which plex to use for the recovery:vxvol maint
volume-name (The volume state will be DETACHED
.)vxprint -ht
volume-namevxinfo
volume-name (Display additional information about unstartable plexes.)vxmend off
plex-name (Offline bad plexes.)vxmend on
plex-name (Online a plex as STALE
rather than DISABLED
.)vxvol start
volume-name (Revive stale plexes.)vxplex att
volume-name plex-name (Recover a stale plex.)RECOVER
state and the volume will not start, use a -f
option on the vxvol
command:vxmend fix clean
plex-namevxvol start
volume-namevxplex att
volume-name plex-namevxdg deport
dgname; vxdg import
dgnamevxvol stop
volume-name (Stop volumes on the disk.)vxdg -g
dg-name rmdisk disk-name (Remove disk from its disk group.)vxdisk offline
disk-name (Offline the disk.)vxdiskunsetup c#t#d#
(Remove the disk from VxVM control.)vxdiskadm
, choose option 4: Remove a disk for replacement. When prompted, chose “none”
for the disk to replace it.drvconfig; disks
vxdiskadm
, choose option 5: Replace a failed or removed disk. Follow the prompts and
replace the disk with the appropriate disk.eeprom
command at the root prompt or the printenv
command at the ok>
prompt to make sure that
the nvram=devalias
and boot-device
parameters are set to allow a boot from the mirror of the boot
disk. If the boot paths are not set up properly for both mirrors of the boot disk, it may be necessary to move the
mirror disk physically to the boot disk's location. Alternatively, the devalias command at the ok> prompt can
set the mirror disk path correctly, then use nvstore to write the change to the nvram. (It is sometimes
necessary to nvunalias aliasname
to remove an alias from the nvramrc
, then nvalias aliasname devicepath
nvstore
vxedit set failing=off
disk-namevxmend -g
dgname -o force off
plexnamevxmend -g
dgname on
plexnamevxmend -g
dgname fix clean
plexnamevxrecover -s
volname
soltest/etc/vx > vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO
EX NAME ASSOC VC PERMS MODE STATE
SR NAME KSTATE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED IOFAIL 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxmend -g testdg -o force off vol53-01
soltest/etc/vx > vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED OFFLINE 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxmend -g testdg on vol53-01
soltest/etc/vx > vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED STALE 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxmend -g testdg fix clean vol53-01
soltest/etc/vx > !vxprint
vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED CLEAN 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxrecover -s vol53
soltest/etc/vx > !vxprint
vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
v vol53 - ENABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 ENABLED ACTIVE 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
In particular, the boot device must be mirrored; it cannot be part of a RAID-5 configuration. To mirror the boot disk:
eeprom use-nvramrc?=true
use-nvramrc?
to true in the EEPROM settings. If you
forget, you will have to go in and manually set up the boot path for your boot mirror disk. (See
“To replace a failed boot disk” in the “VxVM Maintenance” section for the procedure.) It is
much easier if you set the parameter properly before mirroring the disk!vxdiskadm
, choose option 6: Mirror Volumes on a Disk. Follow the prompts from the utility. It
will call vxrootmir
under the covers to take care of the boot disk setup portion of the operation.Procedure to create a Mirrored-Stripe Volume: (A mirrored-stripe volume mirrors several striped plexes—it is better to set up a Striped-Mirror Volume.)
xassist -g
dg-name make
volume length layout=mirror-stripe
Creating a Striped-Mirror Volume: (Striped-mirror volumes are layered volumes which stripes across underlaying
mirror volumes.)vxassist -g
dg-name make
volume length layout=stripe-mirror
Removing a plex from a mirror:
vxplex -g
dg-name -o rm dis
plex-name
Removing a mirror from a volume:
vxassist -g
dg-name remove mirror
volume-name
Removing a mirror and all associated subdisks:
vxplex -o rm dis
volume-nameDissociating a plex from a mirror (to provide a snapshot):
vxplex dis
volume-name
vxmake -U gen vol
new-volume-name plex=
plex-name (Creating a new volume with a dissociated
plex.)
vxvol start
new-volume-name
vxvol stop
new-volume-name (To re-associate this plex with the old volume.)
vxplex dis
plex-name
vxplex att
old-volume-name plex-name
vxedit rm
new-volume-name
Removing a Root Disk Mirror:
vxplex -o rm dis
rootvol-02 swapvol-02 [other root disk volumes]
/etc/vx/bin/vxunroot
It is probably easiest to think of RPO in terms of the amount of allowable data loss. The RPO is frequently expressed in terms of its relation to the time at which replication stops, as in “less than 5 minutes of data loss.”
The costs associated with different RPO and RTO values will be determined by the type of application and its business purpose. Some applications may be able to tolerate unplanned outages of up to days without incurring substantial costs. Other applications may cause significant business-side problems with even minor amounts of unscheduled downtime.
Different applications and environments have different tolerances for RPO and RTO. Some applications might be able to tolerate a potential data loss of days or even weeks; some may not be able to tolerate any data loss at all. Some applications can remain unavailable long enough for us to purchase a new system and restore from tape; some cannot.
Many types of replication solutions can be implemented at a server, disk storage, or storage network level. Each has unique advantages and disadvantages. Server replication tends to be cheapest, but also involves using server cycles to manage the replication. Storage network replication is extremely flexible, but can be more difficult to configure. Disk storage replication tends to be rock solid, but is usually limited in terms of supported hardware for the replication target.
Regardless where we choose to implement our data replication solution, we will still face a lot of the same issues. One issue that needs to be addressed is re-silvering of a replication solution that has been partitioned for some amount of time. Ideally, only the changed sections of the disks will need to be re-replicated. Some less sophisticated solutions require a re-silvering of the entire storage area, which can take a long time and soak up a lot of bandwidth. Re-silvering is an issue that needs to be investigaged during the product evaluation.
The type of recovery that is appropriate for each service will depend on the importance of the service and what the tolerance for downtime is for that service.
There are five generally-recognized approaches to recovery architecture:
ok>
prompt.
Most (but not all) PROM environment variables can be set with the
/usr/sbin/eeprom
command. When invoked by itself, it
prints out the current environment variables. To use
eeprom
to set a variable, use the syntax:
/usr/sbin/eeprom variable_name=value
All PROM environment variables can be set at the ok>
prompt.
the printenv
command prints out the current settings. The
syntax for setting a variable is:
setenv variable_name value