Sun Cluster
Introduction
Sun Cluster 3.2 has the following features and limitations:
- Support for 2-16 nodes.
- Global device capability--devices can be shared across the cluster.
- Global file system --allows a file system to be accessed simultaneously by all cluster nodes.
- Tight implementation with Solaris--The cluster framework services have been implemented in the kernel.
- Application agent support.
- Tight integration with zones.
- Each node must run the same revision and update of the Solaris OS.
- Two node clusters must have at least one quorum device.
- Each cluster needs at least two separate private networks.
(Supported hardware, such as
ce
andbge
may use tagged VLANs to run private and public networks on the same physical connection.) - Each node's boot disk should include a 500M partition
mounted at
/globaldevices
prior to cluster installation. At least 750M of swap is also required. - Attached storage must be multiply connected to the nodes.
- ZFS is a supported file system and volume manager. Veritas Volume Manager (VxVM) and Solaris Volume Manager (SVM) are also supported volume managers.
- Veritas multipathing (vxdmp) is not supported. Since vxdmp must be enabled for current VxVM versions, it must be used in conjunction with mpxio or another similar solution like EMC's Powerpath.
- SMF services can be integrated into the cluster, and all framework daemons are defined as SMF services
- PCI and SBus based systems cannot be mixed in the same cluster.
- Boot devices cannot be on a disk that is shared with other cluster nodes. Doing this may lead to a locked-up cluster due to data fencing.
The overall health of the cluster may be monitored using the
cluster status
or scstat -v
commands.
Other useful options include:
scstat -g
: Resource group statusscstat -D
: Device group statusscstat -W
: Heartbeat statusscstat -i
: IPMP statusscstat -n
: Node status
Failover applications (also known as "cluster-unaware" applications
in the Sun Cluster documentation) are controlled by rgmd
(the resource group manager daemon). Each application has a data
service agent, which is the way that the cluster controls application
startups, shutdowns, and monitoring. Each application is typically
paired with an IP address, which will follow the application to the
new node when a failover occurs.
"Scalable" applications are able to run on several nodes concurrently. The clustering software provides load balancing and makes a single service IP address available for outside entities to query the application.
"Cluster aware" applications take this one step further, and have cluster awareness programmed into the application. Oracle RAC is a good example of such an application.
All the nodes in the cluster may be shut down with cluster
shutdown -y -g0
. To boot a node outside of the cluster
(for troubleshooting or recovery operations, run boot -x
clsetup
is a menu-based utility that can be used to
perform a broad variety of configuration tasks, including configuration
of resources and resource groups.
Cluster Configuration
The cluster's configuration information is stored in global files
known as the "cluster configuration repository" (CCR). The cluster
framework files in /etc/cluster/ccr
should not be edited
manually; they should be managed via the administrative commands.
The cluster show
command displays the cluster configuration
in a nicely-formatted report.
The CCR contains:
- Names of the cluster and the nodes.
- The configuration of the cluster transport.
- Device group configuration.
- Nodes that can master each device group.
- NAS device information (if relevant).
- Data service parameter values and callback method paths.
- Disk ID (DID) configuration.
- Cluster status.
Some commands to directly maintain the CCR are:
ccradm
: Allows (among other things) a checksum re-configuration of files in/etc/cluster/ccr
after manual edits. (Do NOT edit these files manually unless there is no other option. Even then, back up the original files.)ccradm -i /etc/cluster/ccr/filename -o
scgdefs
: Brings new devices under cluster control after they have been discovered bydevfsadm
.
The scinstall
and clsetup
commands may
We have observed that the installation process may disrupt a previously
installed NTP configuration (even though the installation notes promise
that this will not happen). It may be worth using ntpq
to verify that NTP is still working properly after a cluster installation.
Resource Groups
Resource groups are collections of resources, including
data services. Examples of
resources include disk sets, virtual IP addresses, or server processes
like httpd
.
Resource groups may either be failover or scalable resource groups. Failover resource groups allow groups of services to be started on a node together if the active node fails. Scalable resource groups run on several nodes at once.
The rgmd
is the Resource Group Management Daemon. It is
responsible for monitoring, stopping, and starting the resources within
the different resource groups.
Some common resource types are:
- SUNW.LogicalHostname: Logical IP address associated with a failover service.
- SUNW.SharedAddress: Logical IP address shared between nodes running a scalable resource group.
- SUNW.HAStoragePlus: Manages global raw devices, global file systems, non-ZFS failover file systems, and failover ZFS zpools.
Resource groups also handle resource and resource group dependencies.
Sun Cluster allows services to start or stop in a particular order.
Dependencies are a particular type of resource property. The
r_properties
man page contains a list of resource
properties and their meanings. The rg_properties
man
page has similar information for resource groups. In particular,
the Resource_dependencies
property specifies something
on which the resource is dependent.
Some resource group cluster commands are:
clrt register resource-type
: Register a resource type.clrt register -n node1name,node2name resource-type
: Register a resource type to specific nodes.clrt unregister resource-type
: Unregister a resource type.clrt list -v
: List all resource types and their associated node lists.clrt show resource-type
: Display all information for a resource type.clrg create -n node1name,node2name rgname
: Create a resource group.clrg delete rgname
: Delete a resource group.clrg set -p property-name rgname
: Set a property.clrg show -v rgname
: Show resource group information.clrs create -t HAStoragePlus -g rgname -p AffinityOn=true -p FilesystemMountPoints=/mountpoint resource-name
clrg online -M rgname
clrg switch -M -n nodename rgname
clrg offline rgname
: Offline the resource, but leave it in a managed state.clrg restart rgname
clrs disable resource-name
: Disable a resource and its fault monitor.clrs enable resource-name
: Re-enable a resource and its fault monitor.clrs clear -n nodename -f STOP_FAILED resource-name
clrs unmonitor resource-name
: Disable the fault monitor, but leave resource running.clrs monitor resource-name
: Re-enable the fault monitor for a resource that is currently enabled.clrg suspend rgname
: Preserves online status of group, but does not continue monitoring.clrg resume rgname
: Resumes monitoring of a suspended groupclrg status
: List status of resource groups.clrs status -g rgname
Data Services
A data service agent is a set of components that allow a data service to be monitored and fail over within the cluster. The agent includes methods for starting, stopping, monitoring, or failing the data service. It also includes a registration information file allowing the CCR to store the information about these methods in the CCR. This information is encapsulated as a resource type.
The fault monitors for a data sevice place the daemons under the control
of the process monitoring facility (rpc.pmfd
), and the service,
using client commands.
Public Network
The public network uses pnmd
(Public Network Management
Daemon) and the IPMP in.mpathd
daemon to monitor and control the
public network addresses.
IPMP should be used to provide failovers for the public network paths.
The health of the IPMP elements can be monitored with scstat -i
The clrslh
and clrssa
commands are used to
configure logical and shared hostnames, respectively.
clrslh create -g rgname logical-hostname
Private Network
The "private," or "cluster transport" network is used to provide a heartbeat between the nodes so that they can determine which nodes are available. The cluster transport network is also used for traffic related to global devices.
While a 2-node cluster may use crossover cables to construct a private network, switches should be used for anything more than two nodes. (Ideally, separate switching equipment should be used for each path so that there is no single point of failure.)
The default base IP address is 172.16.0.0, and private networks are assigned subnets based on the results of the cluster setup.
Available network interfaces can be identified by using a combination of
dladm show-dev
and ifconfig
.
Private networks should be installed and configured using the
scinstall
command during cluster configuration. Make sure that
the interfaces in question are connected, but down and unplumbed before
configuration. The clsetup
command also has menu options to
guide you through the private network setup process.
Alternatively, something like the following command string can be used to establish a private network:
clintr add nodename1:ifname
clintr add nodename2:ifname2
clintr add switchname
clintr add nodename1:ifname1,switchname
clintr add nodename2:ifname2,switchname
clintr status
The health of the heartbeat networks can be checked with the
scstat -W
command. The physical paths may be checked with
clintr status
or cluster status -t intr
.
Quorum
Sun Cluster uses a quorum voting system to prevent split-brain and cluster amnesia. The Sun Cluster documentation refers to "failure fencing" as the mechanism to prevent split-brain (where two nodes run the same service at the same time, leading to potential data corruption).
"Amnesia" occurs when a change is made to the cluster while a node is down, then that node attempts to bring up the cluster. This can result in the changes being forgotten, hence the use of the word "amnesia."
One result of this is that the last node to leave a cluster when it is shut down must be the first node to re-enter the cluster. Later in this section, we will discuss ways of circumventing this protection.
Quorum voting is defined by allowing each device one vote. A quorum device may be a cluster node, a specified external server running quorum software, or a disk or NAS device. A majority of all defined quorum votes is required in order to form a cluster. At least half of the quorum votes must be present in order for cluster services to remain in operation. (If a node cannot contact at least half of the quorum votes, it will panic. During the reboot, if a majority cannot be contacted, the boot process will be frozen. Nodes that are removed from the cluster due to a quorum problem also lose access to any shared file systems. This is called "data fencing" in the Sun Cluster documentation.)
- Quorum devices must be available to at least two nodes in the cluster.
- Disk quorum devices may also contain user data. (Note that if a ZFS disk is used as a quorum device, it should be brought into the zpool before being specified as a quorum device.)
- Sun recommends configuring n-1 quorum devices (the number of nodes minus 1). Two node clusters must contain at least one quorum device.
- Disk quorum devices must be specified using the DID names.
- Quorum disk devices should be at least as available as the storage underlying the cluster resource groups.
Quorum status and configuration may be investigating using:
scstat -q
clq status
These commands report on the configured quorum votes, whether they are present, and how many are required for a majority.
Quorum devices can be manipulated through the following commands:
clq add did-device-name
clq remove did-device-name
: (Only removes the device from the quorum configuration. No data on the device is affected.)clq enable did-device-name
clq disable did-device-name
: (Removes the quorum device from the total list of available quorum votes. This might be valuable if the device is down for maintenance.)clq reset
: (Resets the configuration to the default.)
By default, doubly-connected disk quorum devices use SCSI-2 locking. Devices connected to more than two nodes use SCSI-3 locking. SCSI-3 offers persistent reservations, but SCSI-2 requires the use of emulation software. The emulation software uses a 64-bit reservation key written to a private area on the disk.
In either case, the cluster node that wins a race to the quorum device attempts to remove the keys of any node that it is unable to contact, which cuts that node off from the quorum device. As noted before, any group of nodes that cannot communicate with at least half of the quorum devices will panic, which prevents a cluster partition (split-brain).
In order to
add nodes to a 2-node cluster, it may be necessary to change the default
fencing with scdidadm -G prefer3
or cluster set -p
global_fencing=prefer3
, create a SCSI-3
quorum device with clq add
, then remove the SCSI-2 quorum device
with clq remove
.
NetApp filers and systems running the scqsd
daemon may also
be selected as quorum devices. NetApp filers use SCSI-3 locking over the
iSCSI protocol to perform their quorum functions.
The claccess deny-all
command may be used to deny all other nodes
access to the cluster. claccess allow nodename
re-enables
access for a node.
Purging Quorum Keys
CAUTION: Purging the keys from a quorum device may result in amnesia. It should only be done after careful diagnostics have been done to verify why the cluster is not coming up. This should never be done as long as the cluster is able to come up. It may need to be done if the last node to leave the cluster is unable to boot, leaving everyone else fenced out. In that case, boot one of the other nodes to single-user mode, identify the quorum device, and:
For SCSI 2 disk reservations, the relevant command is pgre
,
which is located in /usr/cluster/lib/sc
:
pgre -c pgre_inkeys -d /dev/did/rdks/d#s2
(List the keys in the quorum device.)pgre -c pgre_scrub -d /dev/did/rdks/d#s2
(Remove the keys from the quorum device.)
Similarly, for SCSI 3 disk reservations, the relevant command is scsi
:
scsi -c inkeys -d /dev/did/rdks/d#s2
(List the keys in the quorum device.)scsi -c scrub -d /dev/did/rdks/d#s2
(Remove the keys from the quorum device.)
Global Storage
Sun Cluster provides a unique global device name for every disk, CD, and tape
drive in the cluster. The format of these global device names is
/dev/did/device-type
. (eg /dev/did/dsk/d2s3
)
(Note that the DIDs are a global naming system, which is separate from
the global device or global file system functionality.)
DIDs are componentsof SVM volumes, though VxVM does not recognize DID device names as components of VxVM volumes.
DID disk devices, CD-ROM drives, tape drives, SVM volumes, and VxVM volumes may be used as global devices. A global device is physically accessed by just one node at a time, but all other nodes may access the device by communicating across the global transport network.
The file systems in /global/.devices
store the device files
for global devices on each node. These are mounted on mount points
of the form /global/.devices/node@nodeid
, where
nodeid is the identification number assigned to the node. These are visible
on all nodes. Symbolic links may be set up to the contents of these file
systems, if they are desired. Sun Cluster sets up some such links in
the /dev/global
directory.
Global file systems may be ufs, VxFS, or hsfs. To mount a file system as a
global file system, add a "global" mount option to the file system's vfstab
entry and remount. Alternatively, run a mount -o global...
command.
(Note that all nodes in the cluster should have the same vfstab
entry for all cluster file systems. This is true for both global and failover
file systems, though ZFS file systems
do not use the vfstab
at all.)
In the Sun Cluster documentation, global file systems are also known as "cluster file systems" or "proxy file systems."
Note that global file systems are different from failover file systems. The former are accessible from all nodes; the latter are only accessible from the active node.
Maintaining Devices
New devices need to be read into the cluster configuration as well as
the OS. As usual, we should run something like devfsadm
or
drvconfig; disks
to create the /device and /dev links across
the cluster. Then we use the scgdevs
or scdidadm
command to add more disk devices to the cluster configuration.
Some useful options for scdidadm
are:
scdidadm -l
: Show local DIDsscdidadm -L
: Show all cluster DIDsscdidadm -r
: Rebuild DIDs
We should also clean up unused links from time to time with
devfsadm -C
and scdidadm -C
The status of device groups can be checked with scstat -D
.
Devices may be listed with cldev list -v
. They can be switched
to a different node via a cldg switch -n target-node dgname
command.
Monitoring for devices can be enabled and disabled by using commands like:
cldev monitor all
cldev unmonitor d#
cldev unmonitor -n nodename d#
cldev status -s Unmonitored
Parameters may be set on device groups using the cldg set
command, for example:
cldg set -p failback=false dgname
A device group can be taken offline or placed online with:
cldg offline dgname
cldg online dgname
VxVM-Specific Issues
Since vxdmp
cannot be disabled, we need to make sure
that VxVM can only see one path to each disk. This is usually done by
implementing mpxio or a third party product
like Powerpath. The order of installation for such an environment
would be:
- Install Solaris and patches.
- Install and configure multipathing software.
- Install and configure Sun Cluster.
- Install and configure VxVM
If VxVM disk groups are used by the cluster, all nodes attached to the
shared storage must have VxVM installed. Each vxio
number
in /etc/name_to_major
must also be the same on each node.
This can be checked (and fixed, if necessary) with the clvxvm
initialize
command. (A reboot may be necessary if the
/etc/name_to_major
file is changed.)
The clvxvm encapsulate
command should be used if the
boot drive is encapsulated (and mirrored) by VxVM. That way the
/global/.devices
information is set up properly.
The clsetup
"Device Groups" menu contains items to
register a VxVM disk group, unregister a device group, or
synchronize volume information for a disk group. We can also
re-synchronize with the cldg sync dgname
command.
Solaris Volume Manager-Specific Issues
Sun Cluster allows us to add metadb or partition information in the
/dev/did
format or in the usual format. In general:
- Use local format for boot drive mirroring in case we need to boot outside the cluster framework.
- Use cluster format for shared disksets because otherwise we will need to assume the same controller numbers on each node.
Configuration information is kept in the metadatabase replicas. At least three local replicas are required to boot a node; these should be put on their own partitions on the local disks. They should be spread across controllers and disks to the degree possible. Multiple replicas may be placed on each partition; they should be spread out so that if any one disk fails, there will still be at least three replicas left over, constituting at least half of the total local replicas.
When disks are added to a shared diskset, database replicas are
automatically added. These will always be added to slice 7, where
they need to remain. If a disk containing replicas is removed,
the replicas must be removed using metadb
.
If fewer than 50% of the replicas in a diskset are available, the diskset ceases to operate. If exactly 50% of the replicas are available, the diskset will continue to operate, but will not be able to be enabled or switched on another node.
A mediator can be assigned to a shared diskset. The mediator data is contained within a Solaris process on each node and counts for two votes in the diskset quorum voting.
Standard c#t#d#s#
naming should be used when creating
local metadb replicas, since it will make recovery easier if we
need to boot the node outside of a cluster context. On the other
hand, /dev/did/rdsk/d#s#
naming should be used for
shared disksets, since otherwise the paths will need to be
identical on all nodes.
Creating a new shared diskset involves the following steps:
(Create an empty diskset.)
metaset -s set-name -a -h node1-name node2-name
(Create a mediator.)
metaset -s set-name -a -m node1-name node2-name
(Add disks to the diskset.)
metaset -s set-name -a /dev/did/rdsk/d# /dev/did/rdsk/d#
(Check that the diskset is present in the cluster configuration.)
cldev list -v
cldg status
cldg show set-name
ZFS-Specific Issues
ZFS is only available as a Sun Cluster failover
file system, not as a global
file system. No vfstab
entries are required, since that
information is contained in the zpools.
No synchronization commands are required like in
VxVM; Sun Cluster takes care of the synchronization automatically.
Zones
Non-global zones may be treated as virtual nodes. Keep in mind that some services, such as NFS, will not run in non-global zones.
Services can be failed over between zones, even zones on the same server. Where possible, it is best to use full rather than sparse zones. Certain types of failures within the non-global zone can cause a crash in the global zone.
Configuration of cluster resources and resource groups must be performed
in the global zone. The rgmd
runs in the global zone.
To specify a non-global zone as a node, use the form
nodename:zonename
or specify
-n nodename -z zonename
Additional Reading
No comments:
Post a Comment