Monday, June 24, 2013

Solaris Volume Manager (DiskSuite)

Solaris Volume Manager (formerly known as DiskSuite) provides a way to mirror, stripe or RAID-5 local disks. New functionality is constantly being added to the base software. A full discussion is beyond the scope of this article, so we will focus on the most common cases, how to set them up, how to manage them and how to maintain them. Additional information is available in the Solaris Volume Manager Administration Guide.

State Database

Solaris Volume Manager uses a state database to store its configuration and state information. (State information refers to the condition of the devices.) Multiple replicas are required for redundancy. At least four should be created on at least two different physical disk devices. It is much better to have at least six replicas on at least three different physical disks, spread across multiple controller channels, if possible.

In the event that the state databases disagree, a majority of configured state databases determines which version of reality is correct. This is why it is important to configure multiple replicas. A minimum of three database replicas must be available in order to boot without human assistance, so it makes sense to create database replicas liberally. They don't take up much space, and there is very little overhead associated with their maintenance. On JBOD (Just a Bunch Of Disks) arrays, I recommend at least two replicas on each disk device.

State database replicas consume between 4 and 16 MB of space, and should ideally be placed on a partition specifically set aside for that purpose. In the event that state database information is lost, it is possible to lose the data stored on the managed disks, so the database replicas should be spread over as much of the disk infrastructure as possible.

State database locations are recorded in /etc/opt/SUNWmd/mddb.cf. Depending on their condition, repair may or may not be possible. Metadevices (the objects which Solaris Volume Manager manipulates) may be placed on a partition with a state database if the state database is there first. The initial state databases can be created by specifying the slices on which they will live as follows:
metadb -a -f -c 2 slice-name1 slice-name2

Because pre-existing partitions are not usable for creating database replicas, it is frequently the case that we will steal space from swap to create a small partition for the replicas. To do so, we need to boot to single-user mode, use swap -d to unmount all swap, and format to re-partition the swap partition, freeing up space for a separate partition for the database replicas. Since the replicas are small, very few cylinders will be required.

Metadevice Management

The basic types of metadevices are:
  • Simple: Stripes or concatenations--consist only of physical slices.
  • Mirror: Multiple copies on simple metadevies (submirrors).
  • RAID5: Composed of multiple slices; includes distributed parity.
  • Trans: Master metadevice plus logging device.

Solaris Volume Manager can build metadevices either by using partitions as the basic building blocks, or by dividing a single large partition into soft partitions. Soft partitions are a way that SVM allows us to carve a single disk into more than 8 slices. We can either build soft partitions directly on a disk slice, or we can mirror (or RAID) slices, then carve up the resulting metadevice into soft partitions to build volumes.

Disksets are collections of disks that are managed together, in the same way that a Veritas Volume Manager (VxVM) disk group is managed together. Unlike in VxVM, SVM does not require us to explicitly specify a disk group. If Disksets are configured, we need to specify the set name for monitoring or management commands with a -s setname option. Disksets may be created as shared disksets, where multiple servers may be able to access them. (This is useful in an environment like Sun Cluster, for example.) In that case, we specify some hosts as mediators who determine who owns the diskset. (Note that disks added to shared disksets are re-partitioned in the expectation that we will use soft partitions.)

When metadevices need to be addressed by OS commands (like mkfs), we can reference them with device links of the form /dev/md/rdsk/d# or /dev/md/disksetname/rdsk/d

Here are the main command line commands within SVM:

CommandDescription
metaclearDeletes active metadevices and hot spare pools.
metadbManages state database replicas.
metadetachDetaches a metadevice from a mirror or a logging device from a trans-metadevice.
metahsManages hot spares and hot spare pools.
metainitConfigures metadevices.
metaofflineTakes submirrors offline.
metaonlinePlaces submirrors online.
metaparamModifies metadevice parameters.
metarenameRenames and switches metadevice names.
metareplaceReplaces slices of submirrors and RAID5 metadevices.
metarootSets up system files for mirroring root.
metasetAdministers disksets.
metastatCheck metadevice health and state.
metattachAttaches a metadevice to a mirror or a log to a trans-metadevice.

Here is how to perform several common types of operations in Solaris Volume Manager:

OperationProcedure
Create state database replicas.metadb -a -f -c 2 c#t0d#s# c#t1d#s#
Mirror the root partition.
Create a metadevice for the root partition:metainit -f d0 1 1 c#t0d#s#
Create a metadevice for the root mirror partition.metainit d1 1 1 c#t1d#s#
Set up a 1-sided mirrormetainit d2 -m d0
Edit the vfstab and system files.metaroot d2
lockfs -fa
reboot
Attach the root mirror.metattach d2 d1
Mirror the swap partion.
Create metadevices for the swap partition and mirror.metainit -f d5 1 1 c#t0d#s#
metainit -f d6 1 1 c#t1d#s#
Attach submirror to mirror.metattach d7 d6
Edit vfstab to mount swap mirror as a swap device.Use root entry as a template.
Create a striped metadevice.metainit d# stripes slices c#t#d#s#...
Create a striped metadevice with a non-default interlace size.Add an -i interlacek option
Concatenate slices.metainit d# #slices 1 c#t#d#s# 1 c#t#d#s#...
Create a soft partition metadevice.metainit dsource# -p dnew# size
Create a RAID5 metadevice.metainit d# -r c#t#d#s# c#t#d#s# c#t#d#s#...
Manage Hot Spares
Create a hot spare pool.metainit hsp001 c#t#d#s#...
Add a slice to a pool.metahs -a hsp### /dev/dsk/c#t#d#s#
Add a slice to all pools.metahs -a all /dev/dsk/c#t#d#s#
Diskset Management
Deport a diskset.metaset -s setname -r
Import a diskset.metaset -s setname -t -f
Add hosts to a shared diskset.metaset -s setname -a -h hostname1 hostname2
Add mediators to a shared diskset.metaset -s setname -a -m hostname1 hostname2
Add devices to a shared diskset.metaset -s setname -a /dev/did/rdsk/d# /dev/did/rdsk/d#
Check diskset status.metaset

Solaris Volume Manager Monitoring

Solaris Volume Manager provides facilities for monitoring its metadevices. In particular, the metadb command monitors the database replicas, and the metastat command monitors the metadevices and hot spares.

Status messages that may be reported by metastat for a disk mirror include:

  • Okay: No errors, functioning correctly.
  • Resyncing: Actively being resynced following error detection or maintenance.
  • Maintenance: I/O or open error; all reads and writes have been discontinued.
  • Last Erred: I/O or open errors encountered, but no other copies available.

Hot spare status messages reported by metastat are:

  • Available: Ready to accept failover.
  • In-Use: Other slices have failed onto this device.
  • Attention: Problem with hot spare or pool.

Solaris Volume Manager Maintenance

Solaris Volume Manager is very reliable. As long as it is not misconfigured, there should be relatively little maintenance to be performed on Volume Manager itself. If the Volume Manager database is lost, however, it may need to be rebuilt in order to recover access to the data. To recover a system configuration:
Make a backup copy of /etc/opt/SUNWmd/md.cf.
Re-create the state databases:
metadb -a -f -c 2 c#t#d#s# c#t#d#s#
Copy md.cf to md.tab
Edit the md.tab so that all mirrors are one-way mirrors, RAID5 devices recreated with -k (to prevent re-initialization).
Verify the md.tab configuration validity:
metainit -n -a
Re-create the configuration:
metainit -a
Re-attach any mirrors:
metattach dmirror# dsubmirror#
Verify that things are okay:
metastat

More frequently, Solaris Volume Manager will be needed to deal with replacing a failed piece of hardware. To replace a disk which is spitting errors, but has not failed yet (as in Example 10-2): Add database replicas to unaffected disks until at least three exist outside of the failing disk. Remove any replicas from the failing disk: metadb metadb -d c#t#d#s# Detach and remove submirrors and hot spares on the failing disk from their mirrors and pools: metadetach dmirror# dsubmirror# metaclear -r dsubmirror# metahs -d hsp# c#t#d#s# [If the boot disk is being replaced, find the /devices name of the boot disk mirror]: ls -l /dev/rdsk/c#t#d#s0 [If the removed disk is a fibre channel disk, remove the /dev/dsk and /dev/rdsk links for the device.] Physically replace the disk. This may involve shutting down the system if the disk is not hot-swappable. Re-build any /dev and /devices links: drvconfig; disks or boot -r Format and re-partition the disk appropriately. Re-add any removed database replicas: metadb -a -c #databases c#t#d#s# Re-create and re-attach any removed submirrors: metainit dsubmirror# 1 1 c#t#d#s# metattach dmirror# dsubmirror# Re-create any removed hot spares. Replacing a failed disk (as in Example 10-3) is a similar procedure. The differences are: Remove database replicas and hot spares as above; submirrors will not be removable. After replacing the disk as above, replace the submirrors with metareplace: metareplace -e dmirror# c#t#d#s# Barring a misconfiguration, Solaris Volume Manager is a tremendous tool for increasing the reliability and redundancy of a server. More important, it allows us to postpone maintenance for a hard drive failure until the next maintenance window. The metastat tool is quite useful for identifying and diagnosing problems. Along with iostat -Ee, we can often catch problems before they reach a point where the disk has actually failed. Example 10-2 shows how to replace a failing (but not yet failed) mirrored disk. (In this case, we were able to hot-swap the disk, so no reboot was necessary. Since the disks were SCSI, we also did not need to remove or rebuild any /dev links.)

Replacing a Failing Disk with Solaris Volume Manager

# metastat
d0: Mirror
Submirror 0: d1
State: Okay
Submirror 1: d2
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 20484288 blocks

d1: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t0d0s0 0 No Okay


d2: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t1d0s0 0 No Okay
...
# iostat -E
sd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0707 Serial No: 3HZ...
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
sd1 Soft Errors: 593 Hard Errors: 28 Transport Errors: 1
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0707 Serial No: 3HZ...
Size: 73.40GB <73400057856 bytes>
Media Error: 24 Device Not Ready: 0 No Device: 1 Recoverable: 593
Illegal Request: 0 Predictive Failure Analysis: 1
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
a p luo 16 1034 /dev/dsk/c0t1d0s3
a p luo 1050 1034 /dev/dsk/c0t1d0s3
a p luo 2084 1034 /dev/dsk/c0t1d0s3

# metadb -d c0t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
# metadetach d40 d42
d40: submirror d42 is detached
# metaclear -r d42
d42: Concat/Stripe is cleared
...
# metadetach d0 d2
d0: submirror d2 is detached
# metaclear -r d2
d2: Concat/Stripe is cleared
...
[Disk hot-swapped. No reboot or device reconfiguration necessary for this replacement]
...
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c0t1d0
/pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number): 0
selecting c0t0d0
[disk formatted]


FORMAT MENU:
...
format> part


PARTITION MENU:
0 - change `0' partition
...
print - display the current table
label - write partition map and label to the disk
! - execute , then return
quit
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
...
partition> q


FORMAT MENU:
...
format> di


AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c0t1d0
/pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number)[0]: 1
selecting c0t1d0
[disk formatted]
format> part
...
[sd1 partitioned to match sd0's layout]
...
partition> 7
Part Tag Flag Cylinders Size Blocks
7 unassigned wm 0 0 (0/0/0) 0

Enter partition id tag[unassigned]:
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 4835
Enter partition size[0b, 0c, 0.00mb, 0.00gb]: 9252c
partition> la
Ready to label disk, continue? y

partition> pr
Current partition table (unnamed):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
...
partition> q
...
# metadb -a -c 3 c0t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t0d0s3
a p luo 1050 1034 /dev/dsk/c0t0d0s3
a p luo 2084 1034 /dev/dsk/c0t0d0s3
a u 16 1034 /dev/dsk/c0t1d0s3
a u 1050 1034 /dev/dsk/c0t1d0s3
a u 2084 1034 /dev/dsk/c0t1d0s3
# metainit d2 1 1 c0t1d0s0
d2: Concat/Stripe is setup
cnjcascade1#metattach d0 d2
d0: submirror d2 is attached
[Re-create and attach the remainder of the submirrors.]
...
# metastat
d0: Mirror
Submirror 0: d1
State: Okay
Submirror 1: d2
State: Resyncing
Resync in progress: 10 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 20484288 blocks

d1: Submirror of d0
State: Okay
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t0d0s0 0 No Okay


d2: Submirror of d0
State: Resyncing
Size: 20484288 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c0t1d0s0 0 No Okay

It is important to format the replacement disk to match the cylinder layout of the disk that is being replaced. If this is not done, mirrors and stripes will not rebuild properly.

When you replace a disk that has already failed, there is no ability to remove the submirrors. Instead, the metareplace -e command is used to re-sync the mirror onto the new disk.

Replacing a Failed Disk with Solaris Volume Manager

# iostat -E
...
sd1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 5
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0507 Serial No: 3HZ7Z3CJ00007505
Size: 73.40GB <73400057856 bytes>
...
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c1t0d0s3
a p luo 1050 1034 /dev/dsk/c1t0d0s3
W p l 16 1034 /dev/dsk/c1t1d0s3
W p l 1050 1034 /dev/dsk/c1t1d0s3
a p luo 16 1034 /dev/dsk/c1t2d0s3
a p luo 1050 1034 /dev/dsk/c1t2d0s3
a p luo 16 1034 /dev/dsk/c1t3d0s3
a p luo 1050 1034 /dev/dsk/c1t3d0s3
# metadb -d /dev/dsk/c1t1d0s3
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c1t1d0
/pci@1c,600000/scsi@2/sd@1,0
2. c1t2d0
/pci@1c,600000/scsi@2/sd@2,0
3. c1t3d0
/pci@1c,600000/scsi@2/sd@3,0
Specify disk (enter its number): 0
selecting c1t0d0
[disk formatted]


FORMAT MENU:
...
partition - select (define) a partition table
...
format> part


PARTITION MENU:
...
print - display the current table
...
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
1 swap wu 2013 - 2214 1003.69MB (202/0/0) 2055552
2 backup wm 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 2215 - 2217 14.91MB (3/0/0) 30528
4 unassigned wm 2218 - 5035 13.67GB (2818/0/0) 28675968
5 unassigned wm 5036 - 12080 34.18GB (7045/0/0) 71689920
6 var wm 12081 - 12684 2.93GB (604/0/0) 6146304
7 home wm 12685 - 14086 6.80GB (1402/0/0) 14266752

partition> q


FORMAT MENU:
disk - select a disk
...
format> di


AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@1c,600000/scsi@2/sd@0,0
1. c1t1d0
/pci@1c,600000/scsi@2/sd@1,0
2. c1t2d0
/pci@1c,600000/scsi@2/sd@2,0
3. c1t3d0
/pci@1c,600000/scsi@2/sd@3,0
Specify disk (enter its number)[0]: 1
format> part


PARTITION MENU:
...
partition> pr
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 25 129.19MB (26/0/0) 264576
1 swap wu 26 - 51 129.19MB (26/0/0) 264576
2 backup wu 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 52 - 14086 68.10GB (14035/0/0) 142820160
7 unassigned wm 0 0 (0/0/0) 0

...
partition> 7
Part Tag Flag Cylinders Size Blocks
7 unassigned wm 0 0 (0/0/0) 0

Enter partition id tag[unassigned]: home
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 12685
Enter partition size[0b, 0c, 0.00mb, 0.00gb]: 1402c
partition> pr
Current partition table (unnamed):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2012 9.77GB (2013/0/0) 20484288
1 swap wu 2013 - 2214 1003.69MB (202/0/0) 2055552
2 backup wu 0 - 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 2215 - 2217 14.91MB (3/0/0) 30528
4 unassigned wm 2218 - 5035 13.67GB (2818/0/0) 28675968
5 unassigned wm 5036 - 12080 34.18GB (7045/0/0) 71689920
6 var wm 12081 - 12684 2.93GB (604/0/0) 6146304
7 home wm 12685 - 14086 6.80GB (1402/0/0) 14266752

partition> la
Ready to label disk, continue? y

partition> q
...
# metastat
...
d19: Mirror
Submirror 0: d17
State: Okay
Submirror 1: d18
State: Needs maintenance
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 14266752 blocks

d17: Submirror of d19
State: Okay
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s7 0 No Okay


d18: Submirror of d19
State: Needs maintenance
Invoke: metareplace d19 c1t1d0s7
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s7 0 No Maintenance
...
# metareplace -e d19 c1t1d0s7
d19: device c1t1d0s7 is enabled
# metareplace -e d16 c1t1d0s6
d16: device c1t1d0s6 is enabled
# metareplace -e d13 c1t1d0s5
d13: device c1t1d0s5 is enabled
# metareplace -e d10 c1t1d0s4
d10: device c1t1d0s4 is enabled
# metareplace -e d2 c1t1d0s0
d2: device c1t1d0s0 is enabled
# metastat
...
d19: Mirror
Submirror 0: d17
State: Okay
Submirror 1: d18
State: Resyncing
Resync in progress: 10 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 14266752 blocks

d17: Submirror of d19
State: Okay
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s7 0 No Okay


d18: Submirror of d19
State: Resyncing
Size: 14266752 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s7 0 No Resyncing
...
# metadb -a -c 2 c1t1d0s3
# metadb
flags first blk block count
a m p luo 16 1034 /dev/dsk/c1t0d0s3
a p luo 1050 1034 /dev/dsk/c1t0d0s3
a u 16 1034 /dev/dsk/c1t1d0s3
a u 1050 1034 /dev/dsk/c1t1d0s3
a p luo 16 1034 /dev/dsk/c1t2d0s3
a p luo 1050 1034 /dev/dsk/c1t2d0s3
a p luo 16 1034 /dev/dsk/c1t3d0s3
a p luo 1050 1034 /dev/dsk/c1t3d0s3

No comments: