Tuesday, May 07, 2013

Sun Cluster

Sun Cluster

Introduction

Sun Cluster 3.2 has the following features and limitations:

  • Support for 2-16 nodes.
  • Global device capability--devices can be shared across the cluster.
  • Global file system --allows a file system to be accessed simultaneously by all cluster nodes.
  • Tight implementation with Solaris--The cluster framework services have been implemented in the kernel.
  • Application agent support.
  • Tight integration with zones.
  • Each node must run the same revision and update of the Solaris OS.
  • Two node clusters must have at least one quorum device.
  • Each cluster needs at least two separate private networks. (Supported hardware, such as ce and bge may use tagged VLANs to run private and public networks on the same physical connection.)
  • Each node's boot disk should include a 500M partition mounted at /globaldevices prior to cluster installation. At least 750M of swap is also required.
  • Attached storage must be multiply connected to the nodes.
  • ZFS is a supported file system and volume manager. Veritas Volume Manager (VxVM) and Solaris Volume Manager (SVM) are also supported volume managers.
  • Veritas multipathing (vxdmp) is not supported. Since vxdmp must be enabled for current VxVM versions, it must be used in conjunction with mpxio or another similar solution like EMC's Powerpath.
  • SMF services can be integrated into the cluster, and all framework daemons are defined as SMF services
  • PCI and SBus based systems cannot be mixed in the same cluster.
  • Boot devices cannot be on a disk that is shared with other cluster nodes. Doing this may lead to a locked-up cluster due to data fencing.

The overall health of the cluster may be monitored using the cluster status or scstat -v commands. Other useful options include:

  • scstat -g: Resource group status
  • scstat -D: Device group status
  • scstat -W: Heartbeat status
  • scstat -i: IPMP status
  • scstat -n: Node status

Failover applications (also known as "cluster-unaware" applications in the Sun Cluster documentation) are controlled by rgmd (the resource group manager daemon). Each application has a data service agent, which is the way that the cluster controls application startups, shutdowns, and monitoring. Each application is typically paired with an IP address, which will follow the application to the new node when a failover occurs.

"Scalable" applications are able to run on several nodes concurrently. The clustering software provides load balancing and makes a single service IP address available for outside entities to query the application.

"Cluster aware" applications take this one step further, and have cluster awareness programmed into the application. Oracle RAC is a good example of such an application.

All the nodes in the cluster may be shut down with cluster shutdown -y -g0. To boot a node outside of the cluster (for troubleshooting or recovery operations, run boot -x

clsetup is a menu-based utility that can be used to perform a broad variety of configuration tasks, including configuration of resources and resource groups.

Cluster Configuration

The cluster's configuration information is stored in global files known as the "cluster configuration repository" (CCR). The cluster framework files in /etc/cluster/ccr should not be edited manually; they should be managed via the administrative commands.

The cluster show command displays the cluster configuration in a nicely-formatted report.

The CCR contains:

  • Names of the cluster and the nodes.
  • The configuration of the cluster transport.
  • Device group configuration.
  • Nodes that can master each device group.
  • NAS device information (if relevant).
  • Data service parameter values and callback method paths.
  • Disk ID (DID) configuration.
  • Cluster status.

Some commands to directly maintain the CCR are:

  • ccradm: Allows (among other things) a checksum re-configuration of files in /etc/cluster/ccr after manual edits. (Do NOT edit these files manually unless there is no other option. Even then, back up the original files.) ccradm -i /etc/cluster/ccr/filename -o
  • scgdefs: Brings new devices under cluster control after they have been discovered by devfsadm.

The scinstall and clsetup commands may

We have observed that the installation process may disrupt a previously installed NTP configuration (even though the installation notes promise that this will not happen). It may be worth using ntpq to verify that NTP is still working properly after a cluster installation.

Resource Groups

Resource groups are collections of resources, including data services. Examples of resources include disk sets, virtual IP addresses, or server processes like httpd.

Resource groups may either be failover or scalable resource groups. Failover resource groups allow groups of services to be started on a node together if the active node fails. Scalable resource groups run on several nodes at once.

The rgmd is the Resource Group Management Daemon. It is responsible for monitoring, stopping, and starting the resources within the different resource groups.

Some common resource types are:

  • SUNW.LogicalHostname: Logical IP address associated with a failover service.
  • SUNW.SharedAddress: Logical IP address shared between nodes running a scalable resource group.
  • SUNW.HAStoragePlus: Manages global raw devices, global file systems, non-ZFS failover file systems, and failover ZFS zpools.

Resource groups also handle resource and resource group dependencies. Sun Cluster allows services to start or stop in a particular order. Dependencies are a particular type of resource property. The r_properties man page contains a list of resource properties and their meanings. The rg_properties man page has similar information for resource groups. In particular, the Resource_dependencies property specifies something on which the resource is dependent.

Some resource group cluster commands are:

  • clrt register resource-type: Register a resource type.
  • clrt register -n node1name,node2name resource-type: Register a resource type to specific nodes.
  • clrt unregister resource-type: Unregister a resource type.
  • clrt list -v: List all resource types and their associated node lists.
  • clrt show resource-type: Display all information for a resource type.
  • clrg create -n node1name,node2name rgname: Create a resource group.
  • clrg delete rgname: Delete a resource group.
  • clrg set -p property-name rgname: Set a property.
  • clrg show -v rgname: Show resource group information.
  • clrs create -t HAStoragePlus -g rgname -p AffinityOn=true -p FilesystemMountPoints=/mountpoint resource-name
  • clrg online -M rgname
  • clrg switch -M -n nodename rgname
  • clrg offline rgname: Offline the resource, but leave it in a managed state.
  • clrg restart rgname
  • clrs disable resource-name: Disable a resource and its fault monitor.
  • clrs enable resource-name: Re-enable a resource and its fault monitor.
  • clrs clear -n nodename -f STOP_FAILED resource-name
  • clrs unmonitor resource-name: Disable the fault monitor, but leave resource running.
  • clrs monitor resource-name: Re-enable the fault monitor for a resource that is currently enabled.
  • clrg suspend rgname: Preserves online status of group, but does not continue monitoring.
  • clrg resume rgname: Resumes monitoring of a suspended group
  • clrg status: List status of resource groups.
  • clrs status -g rgname

Data Services

A data service agent is a set of components that allow a data service to be monitored and fail over within the cluster. The agent includes methods for starting, stopping, monitoring, or failing the data service. It also includes a registration information file allowing the CCR to store the information about these methods in the CCR. This information is encapsulated as a resource type.

The fault monitors for a data sevice place the daemons under the control of the process monitoring facility (rpc.pmfd), and the service, using client commands.

Public Network

The public network uses pnmd (Public Network Management Daemon) and the IPMP in.mpathd daemon to monitor and control the public network addresses.

IPMP should be used to provide failovers for the public network paths. The health of the IPMP elements can be monitored with scstat -i

The clrslh and clrssa commands are used to configure logical and shared hostnames, respectively.

  • clrslh create -g rgname logical-hostname

Private Network

The "private," or "cluster transport" network is used to provide a heartbeat between the nodes so that they can determine which nodes are available. The cluster transport network is also used for traffic related to global devices.

While a 2-node cluster may use crossover cables to construct a private network, switches should be used for anything more than two nodes. (Ideally, separate switching equipment should be used for each path so that there is no single point of failure.)

The default base IP address is 172.16.0.0, and private networks are assigned subnets based on the results of the cluster setup.

Available network interfaces can be identified by using a combination of dladm show-dev and ifconfig.

Private networks should be installed and configured using the scinstall command during cluster configuration. Make sure that the interfaces in question are connected, but down and unplumbed before configuration. The clsetup command also has menu options to guide you through the private network setup process.

Alternatively, something like the following command string can be used to establish a private network:

  • clintr add nodename1:ifname
  • clintr add nodename2:ifname2
  • clintr add switchname
  • clintr add nodename1:ifname1,switchname
  • clintr add nodename2:ifname2,switchname
  • clintr status

The health of the heartbeat networks can be checked with the scstat -W command. The physical paths may be checked with clintr status or cluster status -t intr.

Quorum

Sun Cluster uses a quorum voting system to prevent split-brain and cluster amnesia. The Sun Cluster documentation refers to "failure fencing" as the mechanism to prevent split-brain (where two nodes run the same service at the same time, leading to potential data corruption).

"Amnesia" occurs when a change is made to the cluster while a node is down, then that node attempts to bring up the cluster. This can result in the changes being forgotten, hence the use of the word "amnesia."

One result of this is that the last node to leave a cluster when it is shut down must be the first node to re-enter the cluster. Later in this section, we will discuss ways of circumventing this protection.

Quorum voting is defined by allowing each device one vote. A quorum device may be a cluster node, a specified external server running quorum software, or a disk or NAS device. A majority of all defined quorum votes is required in order to form a cluster. At least half of the quorum votes must be present in order for cluster services to remain in operation. (If a node cannot contact at least half of the quorum votes, it will panic. During the reboot, if a majority cannot be contacted, the boot process will be frozen. Nodes that are removed from the cluster due to a quorum problem also lose access to any shared file systems. This is called "data fencing" in the Sun Cluster documentation.)

  • Quorum devices must be available to at least two nodes in the cluster.
  • Disk quorum devices may also contain user data. (Note that if a ZFS disk is used as a quorum device, it should be brought into the zpool before being specified as a quorum device.)
  • Sun recommends configuring n-1 quorum devices (the number of nodes minus 1). Two node clusters must contain at least one quorum device.
  • Disk quorum devices must be specified using the DID names.
  • Quorum disk devices should be at least as available as the storage underlying the cluster resource groups.

Quorum status and configuration may be investigating using:

  • scstat -q
  • clq status

These commands report on the configured quorum votes, whether they are present, and how many are required for a majority.

Quorum devices can be manipulated through the following commands:

  • clq add did-device-name
  • clq remove did-device-name: (Only removes the device from the quorum configuration. No data on the device is affected.)
  • clq enable did-device-name
  • clq disable did-device-name: (Removes the quorum device from the total list of available quorum votes. This might be valuable if the device is down for maintenance.)
  • clq reset: (Resets the configuration to the default.)

By default, doubly-connected disk quorum devices use SCSI-2 locking. Devices connected to more than two nodes use SCSI-3 locking. SCSI-3 offers persistent reservations, but SCSI-2 requires the use of emulation software. The emulation software uses a 64-bit reservation key written to a private area on the disk.

In either case, the cluster node that wins a race to the quorum device attempts to remove the keys of any node that it is unable to contact, which cuts that node off from the quorum device. As noted before, any group of nodes that cannot communicate with at least half of the quorum devices will panic, which prevents a cluster partition (split-brain).

In order to add nodes to a 2-node cluster, it may be necessary to change the default fencing with scdidadm -G prefer3 or cluster set -p global_fencing=prefer3, create a SCSI-3 quorum device with clq add, then remove the SCSI-2 quorum device with clq remove.

NetApp filers and systems running the scqsd daemon may also be selected as quorum devices. NetApp filers use SCSI-3 locking over the iSCSI protocol to perform their quorum functions.

The claccess deny-all command may be used to deny all other nodes access to the cluster. claccess allow nodename re-enables access for a node.

Purging Quorum Keys

CAUTION: Purging the keys from a quorum device may result in amnesia. It should only be done after careful diagnostics have been done to verify why the cluster is not coming up. This should never be done as long as the cluster is able to come up. It may need to be done if the last node to leave the cluster is unable to boot, leaving everyone else fenced out. In that case, boot one of the other nodes to single-user mode, identify the quorum device, and:

For SCSI 2 disk reservations, the relevant command is pgre, which is located in /usr/cluster/lib/sc:

  • pgre -c pgre_inkeys -d /dev/did/rdks/d#s2 (List the keys in the quorum device.)
  • pgre -c pgre_scrub -d /dev/did/rdks/d#s2 (Remove the keys from the quorum device.)

Similarly, for SCSI 3 disk reservations, the relevant command is scsi:

  • scsi -c inkeys -d /dev/did/rdks/d#s2 (List the keys in the quorum device.)
  • scsi -c scrub -d /dev/did/rdks/d#s2 (Remove the keys from the quorum device.)

Global Storage

Sun Cluster provides a unique global device name for every disk, CD, and tape drive in the cluster. The format of these global device names is /dev/did/device-type. (eg /dev/did/dsk/d2s3) (Note that the DIDs are a global naming system, which is separate from the global device or global file system functionality.)

DIDs are componentsof SVM volumes, though VxVM does not recognize DID device names as components of VxVM volumes.

DID disk devices, CD-ROM drives, tape drives, SVM volumes, and VxVM volumes may be used as global devices. A global device is physically accessed by just one node at a time, but all other nodes may access the device by communicating across the global transport network.

The file systems in /global/.devices store the device files for global devices on each node. These are mounted on mount points of the form /global/.devices/node@nodeid, where nodeid is the identification number assigned to the node. These are visible on all nodes. Symbolic links may be set up to the contents of these file systems, if they are desired. Sun Cluster sets up some such links in the /dev/global directory.

Global file systems may be ufs, VxFS, or hsfs. To mount a file system as a global file system, add a "global" mount option to the file system's vfstab entry and remount. Alternatively, run a mount -o global... command.

(Note that all nodes in the cluster should have the same vfstab entry for all cluster file systems. This is true for both global and failover file systems, though ZFS file systems do not use the vfstab at all.)

In the Sun Cluster documentation, global file systems are also known as "cluster file systems" or "proxy file systems."

Note that global file systems are different from failover file systems. The former are accessible from all nodes; the latter are only accessible from the active node.

Maintaining Devices

New devices need to be read into the cluster configuration as well as the OS. As usual, we should run something like devfsadm or drvconfig; disks to create the /device and /dev links across the cluster. Then we use the scgdevs or scdidadm command to add more disk devices to the cluster configuration.

Some useful options for scdidadm are:

  • scdidadm -l: Show local DIDs
  • scdidadm -L: Show all cluster DIDs
  • scdidadm -r: Rebuild DIDs

We should also clean up unused links from time to time with devfsadm -C and scdidadm -C

The status of device groups can be checked with scstat -D. Devices may be listed with cldev list -v. They can be switched to a different node via a cldg switch -n target-node dgname command.

Monitoring for devices can be enabled and disabled by using commands like:

  • cldev monitor all
  • cldev unmonitor d#
  • cldev unmonitor -n nodename d#
  • cldev status -s Unmonitored

Parameters may be set on device groups using the cldg set command, for example:

  • cldg set -p failback=false dgname

A device group can be taken offline or placed online with:

  • cldg offline dgname
  • cldg online dgname

VxVM-Specific Issues

Since vxdmp cannot be disabled, we need to make sure that VxVM can only see one path to each disk. This is usually done by implementing mpxio or a third party product like Powerpath. The order of installation for such an environment would be:

  1. Install Solaris and patches.
  2. Install and configure multipathing software.
  3. Install and configure Sun Cluster.
  4. Install and configure VxVM

If VxVM disk groups are used by the cluster, all nodes attached to the shared storage must have VxVM installed. Each vxio number in /etc/name_to_major must also be the same on each node. This can be checked (and fixed, if necessary) with the clvxvm initialize command. (A reboot may be necessary if the /etc/name_to_major file is changed.)

The clvxvm encapsulate command should be used if the boot drive is encapsulated (and mirrored) by VxVM. That way the /global/.devices information is set up properly.

The clsetup "Device Groups" menu contains items to register a VxVM disk group, unregister a device group, or synchronize volume information for a disk group. We can also re-synchronize with the cldg sync dgname command.

Solaris Volume Manager-Specific Issues

Sun Cluster allows us to add metadb or partition information in the /dev/did format or in the usual format. In general:

  • Use local format for boot drive mirroring in case we need to boot outside the cluster framework.
  • Use cluster format for shared disksets because otherwise we will need to assume the same controller numbers on each node.

Configuration information is kept in the metadatabase replicas. At least three local replicas are required to boot a node; these should be put on their own partitions on the local disks. They should be spread across controllers and disks to the degree possible. Multiple replicas may be placed on each partition; they should be spread out so that if any one disk fails, there will still be at least three replicas left over, constituting at least half of the total local replicas.

When disks are added to a shared diskset, database replicas are automatically added. These will always be added to slice 7, where they need to remain. If a disk containing replicas is removed, the replicas must be removed using metadb.

If fewer than 50% of the replicas in a diskset are available, the diskset ceases to operate. If exactly 50% of the replicas are available, the diskset will continue to operate, but will not be able to be enabled or switched on another node.

A mediator can be assigned to a shared diskset. The mediator data is contained within a Solaris process on each node and counts for two votes in the diskset quorum voting.

Standard c#t#d#s# naming should be used when creating local metadb replicas, since it will make recovery easier if we need to boot the node outside of a cluster context. On the other hand, /dev/did/rdsk/d#s# naming should be used for shared disksets, since otherwise the paths will need to be identical on all nodes.

Creating a new shared diskset involves the following steps:
(Create an empty diskset.)
metaset -s set-name -a -h node1-name node2-name
(Create a mediator.)
metaset -s set-name -a -m node1-name node2-name
(Add disks to the diskset.)
metaset -s set-name -a /dev/did/rdsk/d# /dev/did/rdsk/d#
(Check that the diskset is present in the cluster configuration.)
cldev list -v
cldg status
cldg show set-name

ZFS-Specific Issues

ZFS is only available as a Sun Cluster failover file system, not as a global file system. No vfstab entries are required, since that information is contained in the zpools. No synchronization commands are required like in VxVM; Sun Cluster takes care of the synchronization automatically.

Zones

Non-global zones may be treated as virtual nodes. Keep in mind that some services, such as NFS, will not run in non-global zones.

Services can be failed over between zones, even zones on the same server. Where possible, it is best to use full rather than sparse zones. Certain types of failures within the non-global zone can cause a crash in the global zone.

Configuration of cluster resources and resource groups must be performed in the global zone. The rgmd runs in the global zone.

To specify a non-global zone as a node, use the form
nodename:zonename
or specify
-n nodename -z zonename

Additional Reading

No comments: