Sun Cluster
Introduction
Sun Cluster 3.2 has the following features and limitations:
- Support for 2-16 nodes.
- Global device
capability--devices can be shared across the cluster.
- Global file system
--allows a file system to be accessed
simultaneously by all cluster nodes.
- Tight implementation with Solaris--The cluster framework
services have been implemented in the kernel.
- Application agent support.
- Tight integration with zones.
- Each node must run the same revision and update of the
Solaris OS.
- Two node clusters must have at least one
quorum device.
- Each cluster needs at least two separate private networks.
(Supported hardware, such as
ce
and bge
may use tagged VLANs to run private and public networks on the
same physical connection.)
- Each node's boot disk should include a 500M partition
mounted at
/globaldevices
prior to cluster
installation. At least 750M of swap is also required.
- Attached storage must be multiply connected to the nodes.
- ZFS is a supported file system and
volume manager. Veritas Volume Manager (VxVM) and Solaris
Volume Manager (SVM) are also supported volume managers.
- Veritas multipathing (vxdmp)
is not supported. Since vxdmp
must be enabled for current VxVM versions, it must be used
in conjunction with mpxio or another
similar solution like EMC's Powerpath.
- SMF services can be integrated into
the cluster, and all framework daemons are defined as SMF services
- PCI and SBus
based systems cannot be mixed in the same cluster.
- Boot devices cannot be on a disk that is shared with other
cluster nodes. Doing this may lead to a locked-up cluster
due to data fencing.
The overall health of the cluster may be monitored using the
cluster status
or scstat -v
commands.
Other useful options include:
scstat -g
: Resource group status
scstat -D
: Device group status
scstat -W
: Heartbeat status
scstat -i
: IPMP status
scstat -n
: Node status
Failover applications (also known as "cluster-unaware" applications
in the Sun Cluster documentation) are controlled by rgmd
(the resource group manager daemon). Each application has a data
service agent, which is the way that the cluster controls application
startups, shutdowns, and monitoring. Each application is typically
paired with an IP address, which will follow the application to the
new node when a failover occurs.
"Scalable" applications are able to run on
several nodes concurrently. The clustering software provides load
balancing and makes a single service IP address available for outside
entities to query the application.
"Cluster aware" applications take this one step further, and have
cluster awareness programmed into the application. Oracle RAC is
a good example of such an application.
All the nodes in the cluster may be shut down with cluster
shutdown -y -g0
. To boot a node outside of the cluster
(for troubleshooting or recovery operations, run boot -x
clsetup
is a menu-based utility that can be used to
perform a broad variety of configuration tasks, including configuration
of resources and resource groups.
Cluster Configuration
The cluster's configuration information is stored in global files
known as the "cluster configuration repository" (CCR). The cluster
framework files in /etc/cluster/ccr
should not be edited
manually; they should be managed via the administrative commands.
The cluster show
command displays the cluster configuration
in a nicely-formatted report.
The CCR contains:
- Names of the cluster and the nodes.
- The configuration of the cluster transport.
- Device group configuration.
- Nodes that can master each device group.
- NAS device information (if relevant).
- Data service parameter values and callback method paths.
- Disk ID (DID) configuration.
- Cluster status.
Some commands to directly maintain the CCR are:
ccradm
: Allows (among other things) a checksum
re-configuration of files in /etc/cluster/ccr
after
manual edits. (Do NOT edit these files manually unless there is
no other option. Even then, back up the original files.)
ccradm -i /etc/cluster/ccr/filename -o
scgdefs
: Brings new devices under cluster control
after they have been discovered by devfsadm
.
The scinstall
and clsetup
commands may
We have observed that the installation process may disrupt a previously
installed NTP configuration (even though the installation notes promise
that this will not happen). It may be worth using ntpq
to verify that NTP is still working properly after a cluster installation.
Resource Groups
Resource groups are collections of resources, including
data services. Examples of
resources include disk sets, virtual IP addresses, or server processes
like httpd
.
Resource groups may either be failover or scalable resource groups.
Failover resource groups allow groups of services to be started on a
node together if the active node fails. Scalable resource groups run
on several nodes at once.
The rgmd
is the Resource Group Management Daemon. It is
responsible for monitoring, stopping, and starting the resources within
the different resource groups.
Some common resource types are:
- SUNW.LogicalHostname: Logical IP address associated with a failover
service.
- SUNW.SharedAddress: Logical IP address shared between nodes running a
scalable resource group.
- SUNW.HAStoragePlus: Manages global raw devices, global file systems,
non-ZFS failover file systems, and failover ZFS zpools.
Resource groups also handle resource and resource group dependencies.
Sun Cluster allows services to start or stop in a particular order.
Dependencies are a particular type of resource property. The
r_properties
man page contains a list of resource
properties and their meanings. The rg_properties
man
page has similar information for resource groups. In particular,
the Resource_dependencies
property specifies something
on which the resource is dependent.
Some resource group cluster commands are:
clrt register resource-type
: Register
a resource type.
clrt register -n node1name,node2name resource-type
: Register
a resource type to specific nodes.
clrt unregister resource-type
: Unregister
a resource type.
clrt list -v
: List all resource types and their associated
node lists.
clrt show resource-type
: Display all
information for a resource type.
clrg create -n node1name,node2name rgname
: Create
a resource group.
clrg delete rgname
: Delete a resource group.
clrg set -p property-name rgname
: Set a property.
clrg show -v rgname
: Show resource group
information.
clrs create -t HAStoragePlus -g rgname -p AffinityOn=true
-p FilesystemMountPoints=/mountpoint resource-name
clrg online -M rgname
clrg switch -M -n nodename rgname
clrg offline rgname
: Offline the resource, but
leave it in a managed state.
clrg restart rgname
clrs disable resource-name
: Disable a resource and its
fault monitor.
clrs enable resource-name
: Re-enable a resource and its
fault monitor.
clrs clear -n nodename -f STOP_FAILED resource-name
clrs unmonitor resource-name
: Disable the fault monitor,
but leave resource running.
clrs monitor resource-name
: Re-enable the fault monitor
for a resource that is currently enabled.
clrg suspend rgname
: Preserves online status of group,
but does not continue monitoring.
clrg resume rgname
: Resumes monitoring of a suspended group
clrg status
: List status of resource groups.
clrs status -g rgname
Data Services
A data service agent is a set of components that allow a data service
to be monitored and fail over within the cluster. The agent includes
methods for starting, stopping, monitoring, or failing the data service.
It also includes a registration information file allowing the CCR to store
the information about these methods in the CCR. This information is
encapsulated as a resource type.
The fault monitors for a data sevice place the daemons under the control
of the process monitoring facility (rpc.pmfd
), and the service,
using client commands.
Public Network
The public network uses pnmd
(Public Network Management
Daemon) and the IPMP in.mpathd
daemon to monitor and control the
public network addresses.
IPMP should be used to provide failovers for the public network paths.
The health of the IPMP elements can be monitored with scstat -i
The clrslh
and clrssa
commands are used to
configure logical and shared hostnames, respectively.
clrslh create -g rgname logical-hostname
Private Network
The "private," or "cluster transport" network is used to provide a
heartbeat between the nodes so that they can determine which nodes are
available. The cluster transport network is also used for traffic related
to global devices.
While a 2-node cluster may use crossover cables to construct a private
network, switches should be used for anything more than two nodes.
(Ideally, separate switching equipment should be used for each path
so that there is no single point of failure.)
The default base IP address is 172.16.0.0, and private networks are
assigned subnets based on the results of the cluster setup.
Available network interfaces can be identified by using a combination of
dladm show-dev
and ifconfig
.
Private networks should be installed and configured using the
scinstall
command during cluster configuration. Make sure that
the interfaces in question are connected, but down and unplumbed before
configuration. The clsetup
command also has menu options to
guide you through the private network setup process.
Alternatively, something like the following command string can be used
to establish a private network:
clintr add nodename1:ifname
clintr add nodename2:ifname2
clintr add switchname
clintr add nodename1:ifname1,switchname
clintr add nodename2:ifname2,switchname
clintr status
The health of the heartbeat networks can be checked with the
scstat -W
command. The physical paths may be checked with
clintr status
or cluster status -t intr
.
Quorum
Sun Cluster uses a quorum voting system to prevent split-brain and
cluster amnesia. The Sun Cluster documentation refers to "failure fencing"
as the mechanism to prevent split-brain (where two nodes run the same
service at the same time, leading to potential data corruption).
"Amnesia" occurs when a change is made to the cluster while a node is down,
then that node attempts to bring up the cluster. This can result in the
changes being forgotten, hence the use of the word "amnesia."
One result of this is that the last node to leave a cluster when it is shut
down must be the first node to re-enter the cluster. Later in this section,
we will discuss ways of circumventing this protection.
Quorum voting is defined by allowing each device one vote. A quorum device
may be a cluster node, a specified external server running quorum software,
or a disk or NAS device. A majority of all defined quorum votes is required
in order to form a cluster. At least half of the quorum votes must be present
in order for cluster services to remain in operation. (If a node cannot contact
at least half of the quorum votes, it will panic. During the reboot, if a
majority cannot be contacted, the boot process will be frozen. Nodes that
are removed from the cluster due to a quorum problem also lose access to any
shared file systems. This is called
"data fencing" in the Sun Cluster documentation.)
- Quorum devices must be available to at least two nodes in the cluster.
- Disk quorum devices may also contain user data. (Note that if a ZFS disk
is used as a quorum device, it should be brought into the zpool before being
specified as a quorum device.)
- Sun recommends configuring n-1 quorum devices (the number of nodes minus 1).
Two node clusters must contain at least one quorum device.
- Disk quorum devices must be specified using the
DID names.
- Quorum disk devices should be at least as available as the storage
underlying the cluster resource groups.
Quorum status and configuration may be investigating using:
These commands report on the configured quorum votes, whether they are present,
and how many are required for a majority.
Quorum devices can be manipulated through the following commands:
clq add did-device-name
clq remove did-device-name
: (Only removes the device from
the quorum configuration. No data on the device is affected.)
clq enable did-device-name
clq disable did-device-name
: (Removes the quorum
device from the total list of available quorum votes. This might be
valuable if the device is down for maintenance.)
clq reset
: (Resets the configuration
to the default.)
By default, doubly-connected disk quorum devices use SCSI-2 locking.
Devices connected to more than two nodes use SCSI-3 locking. SCSI-3
offers persistent reservations, but SCSI-2 requires the use of emulation
software. The emulation software uses a 64-bit reservation key written
to a private area on the disk.
In either case, the cluster node that wins a race to the quorum device
attempts to remove the keys of any node that it is unable to contact,
which cuts that node off from the quorum device. As noted before,
any group of nodes that cannot communicate with at least half of the quorum
devices will panic, which prevents a cluster partition (split-brain).
In order to
add nodes to a 2-node cluster, it may be necessary to change the default
fencing with scdidadm -G prefer3
or cluster set -p
global_fencing=prefer3
, create a SCSI-3
quorum device with clq add
, then remove the SCSI-2 quorum device
with clq remove
.
NetApp filers and systems running the scqsd
daemon may also
be selected as quorum devices. NetApp filers use SCSI-3 locking over the
iSCSI protocol to perform their quorum functions.
The claccess deny-all
command may be used to deny all other nodes
access to the cluster. claccess allow nodename
re-enables
access for a node.
Purging Quorum Keys
CAUTION: Purging the keys from a quorum device may result in amnesia. It should
only be done after careful diagnostics have been done to verify why the cluster
is not coming up. This should never be done as long as the cluster is able to
come up. It may need to be done if the last node to leave the cluster is unable
to boot, leaving everyone else fenced out. In that case, boot one of the other
nodes to single-user mode, identify the quorum device, and:
For SCSI 2 disk reservations, the relevant command is pgre
,
which is located in /usr/cluster/lib/sc
:
pgre -c pgre_inkeys -d /dev/did/rdks/d#s2
(List the keys
in the quorum device.)
pgre -c pgre_scrub -d /dev/did/rdks/d#s2
(Remove the keys
from the quorum device.)
Similarly, for SCSI 3 disk reservations, the relevant command is scsi
:
scsi -c inkeys -d /dev/did/rdks/d#s2
(List the keys
in the quorum device.)
scsi -c scrub -d /dev/did/rdks/d#s2
(Remove the keys
from the quorum device.)
Global Storage
Sun Cluster provides a unique global device name for every disk, CD, and tape
drive in the cluster. The format of these global device names is
/dev/did/device-type
. (eg /dev/did/dsk/d2s3
)
(Note that the DIDs are a global naming system, which is separate from
the global device or global file system functionality.)
DIDs are componentsof SVM volumes, though VxVM does not recognize DID
device names as components of VxVM volumes.
DID disk devices, CD-ROM drives, tape drives, SVM volumes, and
VxVM
volumes may be used as global devices. A global device is physically accessed
by just one node at a time, but all other nodes may access the device by
communicating across the global transport network.
The file systems in /global/.devices
store the device files
for global devices on each node. These are mounted on mount points
of the form /global/.devices/node@nodeid
, where
nodeid is the identification number assigned to the node. These are visible
on all nodes. Symbolic links may be set up to the contents of these file
systems, if they are desired. Sun Cluster sets up some such links in
the /dev/global
directory.
Global file systems may be ufs, VxFS, or hsfs. To mount a file system as a
global file system, add a "global" mount option to the file system's vfstab
entry and remount. Alternatively, run a mount -o global...
command.
(Note that all nodes in the cluster should have the same vfstab
entry for all cluster file systems. This is true for both global and failover
file systems, though ZFS file systems
do not use the vfstab
at all.)
In the Sun Cluster documentation, global file systems are also known as
"cluster file systems" or "proxy file systems."
Note that global file systems are different from failover file systems.
The former are accessible from all nodes; the latter are only accessible
from the active node.
Maintaining Devices
New devices need to be read into the cluster configuration as well as
the OS. As usual, we should run something like devfsadm
or
drvconfig; disks
to create the /device and /dev links across
the cluster. Then we use the scgdevs
or scdidadm
command to add more disk devices to the cluster configuration.
Some useful options for scdidadm
are:
scdidadm -l
: Show local DIDs
scdidadm -L
: Show all cluster DIDs
scdidadm -r
: Rebuild DIDs
We should also clean up unused links from time to time with
devfsadm -C
and scdidadm -C
The status of device groups can be checked with scstat -D
.
Devices may be listed with cldev list -v
. They can be switched
to a different node via a cldg switch -n target-node dgname
command.
Monitoring for devices can be enabled and disabled by using commands like:
cldev monitor all
cldev unmonitor d#
cldev unmonitor -n nodename d#
cldev status -s Unmonitored
Parameters may be set on device groups using the cldg set
command, for example:
cldg set -p failback=false dgname
A device group can be taken offline or placed online with:
cldg offline dgname
cldg online dgname
VxVM-Specific Issues
Since vxdmp
cannot be disabled, we need to make sure
that VxVM can only see one path to each disk. This is usually done by
implementing mpxio or a third party product
like Powerpath. The order of installation for such an environment
would be:
- Install Solaris and patches.
- Install and configure multipathing software.
- Install and configure Sun Cluster.
- Install and configure VxVM
If VxVM disk groups are used by the cluster, all nodes attached to the
shared storage must have VxVM installed. Each vxio
number
in /etc/name_to_major
must also be the same on each node.
This can be checked (and fixed, if necessary) with the clvxvm
initialize
command. (A reboot may be necessary if the
/etc/name_to_major
file is changed.)
The clvxvm encapsulate
command should be used if the
boot drive is encapsulated (and mirrored) by VxVM. That way the
/global/.devices
information is set up properly.
The clsetup
"Device Groups" menu contains items to
register a VxVM disk group, unregister a device group, or
synchronize volume information for a disk group. We can also
re-synchronize with the cldg sync dgname
command.
Solaris Volume Manager-Specific Issues
Sun Cluster allows us to add metadb or partition information in the
/dev/did
format or in the usual format. In general:
- Use local format for boot drive mirroring in case we need to boot
outside the cluster framework.
- Use cluster format for shared disksets because otherwise we will
need to assume the same controller numbers on each node.
Configuration information is kept in the metadatabase replicas.
At least three local replicas are required to boot a node; these
should be put on their own partitions on the local disks. They
should be spread across controllers and disks to the degree possible.
Multiple replicas may be placed on each partition; they should be
spread out so that if any one disk fails, there will still be at
least three replicas left over, constituting at least half of the
total local replicas.
When disks are added to a shared diskset, database replicas are
automatically added. These will always be added to slice 7, where
they need to remain. If a disk containing replicas is removed,
the replicas must be removed using metadb
.
If fewer than 50% of the replicas in a diskset are available, the
diskset ceases to operate. If exactly 50% of the replicas are
available, the diskset will continue to operate, but will not
be able to be enabled or switched on another node.
A mediator can be assigned to a shared diskset. The mediator data
is contained within a Solaris process on each node and counts for
two votes in the diskset quorum voting.
Standard c#t#d#s#
naming should be used when creating
local metadb replicas, since it will make recovery easier if we
need to boot the node outside of a cluster context. On the other
hand, /dev/did/rdsk/d#s#
naming should be used for
shared disksets, since otherwise the paths will need to be
identical on all nodes.
Creating a new shared diskset involves the following steps:
(Create an empty diskset.)
metaset -s set-name -a -h node1-name node2-name
(Create a mediator.)
metaset -s set-name -a -m node1-name node2-name
(Add disks to the diskset.)
metaset -s set-name -a /dev/did/rdsk/d# /dev/did/rdsk/d#
(Check that the diskset is present in the cluster configuration.)
cldev list -v
cldg status
cldg show set-name
ZFS-Specific Issues
ZFS is only available as a Sun Cluster failover
file system, not as a global
file system. No vfstab
entries are required, since that
information is contained in the zpools.
No synchronization commands are required like in
VxVM; Sun Cluster takes care of the synchronization automatically.
Zones
Non-global zones may be treated as virtual nodes. Keep in mind that some
services, such as NFS, will not run in non-global zones.
Services can be failed over between zones, even zones on the same server.
Where possible, it is best to use full rather than sparse zones. Certain
types of failures within the non-global zone can cause a crash in the
global zone.
Configuration of cluster resources and resource groups must be performed
in the global zone. The rgmd
runs in the global zone.
To specify a non-global zone as a node, use the form
nodename:zonename
or specify
-n nodename -z zonename
Additional Reading