Sunday, April 27, 2008

Sun Cluster, HAStoragePlus and VxFS fsck

It turns out that when you have a decent number of VxFS file systems, there is a particularly nasty bug that can jump out and bite you in the *.

It turns out that Sun Cluster wants the vfstab entries for these file systems to be listed as being in the first fsck pass. This particular bug is Bug ID 6572900. Note that the bug description says that this fix does not work for all sites. Fortunately, the workaround seems to be working here.

This bug report is a good example of what I hate to see in bug reports from vendors (not just Sun). The status on the bug is reported as "fix delivered" (presumably the above workaround, since I don't see any patches), but the bug report notes that the workaround doesn't work for all sites. If customers are still having the problem, I don't see how we can claim that a fix has been delivered.

--Scott

Enclosure Based Naming and Sun Cluster

As I posted previously, we are working on getting our Sun Cluster installation running. We had originally hoped to use ZFS for this project, which would have made deployment much easier. Unfortunately, the risks associated with the bug discussed in my ZFS Difficulties post are just too high to allow us to deploy it for production use at this time.

Instead, we are moving to Veritas Volume Manager and File System (VxVM and VxFS) for control of the data directories and shared storage.

We hit a hitch associated with Enclosure Based Naming. It turns out that if your storage drivers return devices in a format including the WWN, Enclosure-Based Naming is mandatory. Since the names are assigned based on some obscure, secret algorithm, you are virtually guaranteed that the different systems in your cluster will have different Enclosure Based Names associated with them.

Sun Cluster, on the other hand, insists that all of your devices be named the same way on all systems in the cluster.

Fortunately, Sun Infodoc 215296 describes how to resolve this impasse for VxVM 4.1+.

First, boot the node that you will be fixing in non-cluster mode. You can do this with a
boot -x
command. Make sure that no disk groups are imported from the shared storage before proceeding.

The /etc/vx/disk.info file contains a mapping from the WWN-based device names (that you see in OS-based commands like format) to the Enclosure Based Names. You can edit this file directly, changing only the numerical part of the Enclosure Based Name that does not match between the different servers.

After making this change, and before you reboot, run the following command to re-configure the storage:
vxconfigd -k
(Note that volumes and disk groups will freak out at this point if you have them imported on the system, since they see the disks being migrated to a "different" location. You really want to have deported the disk groups before you start, or else you may find yourself doing plex recovery.)

You can verify that the changes took effect properly by running
vxdisk -e list
and matching up the disk names to your reference configuration.

Good luck!

--Scott

Friday, April 18, 2008

ZFS Difficulties

We've been evaluating ZFS as a replacement for VxVM and VxFS in some of our production clusters. We encountered some difficulties.

ZFS has supported our development environment for about a year now, and we have enjoyed the flexibility and feature set of ZFS--especially the snapshot management and the ease of volume management. The performance, however, has left something to be desired. We had hoped to demonstrate that we could get adequate performance on ZFS by being more aggressive with tuning.

During the initial testing, we did not use ZFS mirroring or RAIDZ, though we did separate the Oracle log and temp files into separate pools as suggested in the ZFS Best Practices Guide, and we did tweak several tuning parameters as suggested in the ZFS Evil Tuning Guide.

We were able to get performance that met our requirements, but we experienced a much more serious problem when our test system blew a CPU. One of our pools became corrupted, which resulted in a panicked server. This was unexpected; we had isolated our test environment on a non-global zone and used dataset to delegate the zpool administration to the zone. We had expected a corrupted zpool to cause problems that were isolated to the zone, not to cause problems to the entire server.

This server was part of a nascent Sun Cluster configuration that we were also testing; the failover server attempted to import the zpool and promptly panicked. Investigation with zdb -lv revealed that the metadata had become corrupted as a result of the processor failure and resulting panic. We opened a call with Sun, and I posted message with the details to the ZFS Discuss list.

The upshot is that Solaris 10u5 and earlier do not have a way to prevent a panic on zpool corruption. Nevada allows you to specify how the OS will react to zpool corruption, but this will not be brought forward to Solaris 10 until update 6.

We could have reduced the likelihood of zpool metadata corruption by having more than one zdev in each pool (allowing the metadata replicas to have copies on different zdevs). Sun tells us that this would have narrowed the window in which we could have corrupted the metadata, but that the problem itself will not be fixed until update 6 later this year.

Until we get this problem resolved, I don't see being able to use ZFS in a critical production environment. If you have any light to shine on any of these issues, please post a comment here or on the ZFS Discuss list.

--Scott

Update on Book Project

I'm in the process of re-writing a couple of the chapters, which is going slower than I would like. It is getting closer to finished. If you send me your email address, I will let you know when the book is released.

(NOTE: The 2011 re-print of this book is available from the printer at: https://www.createspace.com/3617377 )