ZFS has supported our development environment for about a year now, and we have enjoyed the flexibility and feature set of ZFS--especially the snapshot management and the ease of volume management. The performance, however, has left something to be desired. We had hoped to demonstrate that we could get adequate performance on ZFS by being more aggressive with tuning.
During the initial testing, we did not use ZFS mirroring or RAIDZ, though we did separate the Oracle log and temp files into separate pools as suggested in the ZFS Best Practices Guide, and we did tweak several tuning parameters as suggested in the ZFS Evil Tuning Guide.
We were able to get performance that met our requirements, but we experienced a much more serious problem when our test system blew a CPU. One of our pools became corrupted, which resulted in a panicked server. This was unexpected; we had isolated our test environment on a non-global zone and used
datasetto delegate the zpool administration to the zone. We had expected a corrupted zpool to cause problems that were isolated to the zone, not to cause problems to the entire server.
This server was part of a nascent Sun Cluster configuration that we were also testing; the failover server attempted to import the zpool and promptly panicked. Investigation with
zdb -lvrevealed that the metadata had become corrupted as a result of the processor failure and resulting panic. We opened a call with Sun, and I posted message with the details to the ZFS Discuss list.
The upshot is that Solaris 10u5 and earlier do not have a way to prevent a panic on zpool corruption. Nevada allows you to specify how the OS will react to zpool corruption, but this will not be brought forward to Solaris 10 until update 6.
We could have reduced the likelihood of zpool metadata corruption by having more than one zdev in each pool (allowing the metadata replicas to have copies on different zdevs). Sun tells us that this would have narrowed the window in which we could have corrupted the metadata, but that the problem itself will not be fixed until update 6 later this year.
Until we get this problem resolved, I don't see being able to use ZFS in a critical production environment. If you have any light to shine on any of these issues, please post a comment here or on the ZFS Discuss list.