ZFS was first publicly released in the 6/2006 distribution of
Solaris 10. Previous versions of Solaris 10 did not include ZFS.
ZFS is flexible, scalable and reliable. It is a POSIX-compliant
filesystem with several important features:
No separate filesystem creation step is required. The mount of the
filesystem is automatic and does not require vfstab maintenance.
Mounts are controlled via the mountpoint
attribute of each file system.
Pool Management
Members of a storage pool may either be hard drives
or slices of at least 128MB in size.
To create a mirrored pool:
zpool create -f
pool-name
mirror
c#t#d# c#t#d#
To check a pool's status, run:
zpool status -v
pool-name
To list existing pools:
zpool list
To remove a pool and free its resources:
zpool destroy
pool-name
A destroyed pool can sometimes be recovered as follows:
zpool import -D
Additional disks can be added to an existing
pool. When this happens in a mirrored or RAID Z
pool, the ZFS is resilvered to redistribute the data.
To add storage to an existing mirrored pool:
zpool add -f
pool-name
mirror
c#t#d# c#t#d#
Pools can be exported and imported to transfer
them between hosts.
zpool export
pool-name
zpool import
pool-name
Without a specified pool, the import
command lists available pools.
zpool import
To clear a pool's error count, run:
zpool clear
pool-name
Although virtual volumes (such as those from DiskSuite
or VxVM) can be used as base devices,
it is not recommended for performance reasons.
Filesystem Management
Similar filesystems should be grouped together
in hierarchies to make management easier. Naming
schemes should be thought out as well to make
it easier to group administrative commands for
similarly managed filesystems.
When a new pool is created, a new filesystem is
mounted at /pool-name.
To create another filesystem:
zfs create
pool-name/fs-name
To delete a filesystem:
zfs destroy
filesystem-name
To rename a ZFS filesystem:
zfs rename
old-name new-name
Properties are set via the zfs set
command.
To turn on compression:
zfs set compression=on
pool-name/filesystem-name
To share the filesystem via NFS:
zfs set sharenfs=on
pool-name/fs-name
zfs set sharenfs="
mount-options
"
pool-name/fs-name
Rather than editing the /etc/vfstab
:
zfs set mountpoint=
mountpoint-name pool-name/filesystem-name
Quotas are also set via the same command:
zfs set quota=
#gigG
pool-name/filesystem-name
RAID Levels
ZFS filesystems automatically stripe across all
top-level disk devices. (Mirrors and RAID-Z
devices are considered to be top-level devices.)
It is not recommended that RAID types be mixed
in a pool. (zpool
tries to prevent
this, but it can be forced with the -f
flag.)
The following RAID levels are supported:
- RAID-0 (striping)
- RAID-1 (mirror)
- RAID-Z (similar to RAID 5, but with variable-width
stripes to avoid the RAID 5 write hole)
- RAID-Z2
The zfs man page recommends 3-9 disks for RAID-Z
pools.
Performance Monitoring
ZFS performance management is handled differently
than with older generation file systems. In ZFS,
I/Os are scheduled similarly to
how jobs are scheduled on CPUs.
The ZFS I/O scheduler tracks a priority and a deadline for
each I/O. Within each deadline group, the I/Os are scheduled
in order of logical block address.
Writes are assigned lower priorities than reads,
which can help to avoid traffic jams where reads
are unable to be serviced because they are queued
behind writes. (If a read is issued for a write
that is still underway, the read will be executed
against the in-memory image and will not hit the
hard drive.)
In addition to scheduling, ZFS attempts to intelligently
prefetch information into memory. The algorithm tries
to pick information that is likely to be needed.
Any forward or backward linear access patterns are
picked up and used to perform the prefetch.
The zpool iostat
command can monitor
performance on ZFS objects:
- USED CAPACITY: Data currently stored
- AVAILABLE CAPACITY: Space available
- READ OPERATIONS: Number of operations
- WRITE OPERATIONS: Number of operations
- READ BANDWIDTH: Bandwidth of all read operations
- WRITE BANDWIDTH: Bandwidth of all
write operations
The health of an object can be monitored with
zpool status
Snapshots and Clones
To create a snapshot:
zfs snapshot
pool-name/filesystem-name@
snapshot-name
To clone a snapshot:
zfs clone
snapshot-name filesystem-name
To roll back to a snapshot:
zfs rollback
pool-name/filesystem-name@snapshot-name
zfs send
and zfs receive
allow
clones of filesystems to be sent to a development environment.
The difference between a snapshot and a clone is that a
clone is a writable, mountable copy of the file system.
This capability allows us to store multiple copies of
mostly-shared data in a very space-efficient way.
Each snapshot is accessible through the
.zfs/snapshot
in the /pool-name
directory. This can allow end users to recover their files
without system administrator intervention.
Zones
If the filesystem is created in the global zone
and added to the local zone via
zonecfg
,
it may be assigned to more than one zone unless
the mountpoint is set to legacy
.
zfs set mountpoint=legacy
pool-name/filesystem-name
To import a ZFS filesystem within a zone:
zonecfg -z
zone-name
add fs
set dir=
mount-point
set special=
pool-name/filesystem-name
set type=zfs
end
verify
commit
exit
Administrative rights for a filesystem can be granted
to a local zone:
zonecfg -z
zone-name
add dataset
set name=
pool-name/filesystem-name
end
commit
exit
Data Protection
ZFS is a transactional file system.
Data consistency is protected via
Copy-On-Write (COW). For each write request, a copy is
made of the specified block. All changes are made to
the copy. When the write is complete, all pointers are
changed to point to the new block.
Checksums are used to validate data during reads and writes.
The checksum algorithm is user-selectable. Checksumming
and data recovery is done at a filesystem level; it is not
visible to applications. If a block becomes corrupted on
a pool protected by mirroring or RAID, ZFS will identify the
correct data value and fix the corrupted value.
Raid protections are also
part of ZFS.
Scrubbing is an additional type
of data protection available on ZFS. This is a
mechanism that performs regular validation of
all data. Manual scrubbing
can be performed by:
zpool scrub
pool-name
The results can be viewed via:
zpool status
Any issues should be cleared with:
zpool clear
pool-name
The scrubbing operation walks through the pool
metadata to read each copy of each block. Each copy
is validated against its checksum and corrected if
it has become corrupted.
Hardware Maintenance
To replace a hard drive with another device, run:
zpool replace
pool-name old-disk new-disk
To offline a failing drive, run:
zpool offline
pool-name disk-name
(A -t
flag allows the disk to come back
online after a reboot.)
Once the drive has been physically replaced,
run the replace
command against the device:
zpool replace
pool-name device-name
After an offlined drive has been replaced, it can be
brought back online:
zpool online
pool-name disk-name
Firmware upgrades may cause the disk device ID to change.
ZFS should be able to update the device ID automatically,
assuming that the disk was not physically moved during the update.
If necessary, the pool can be exported and re-imported to
update the device IDs.
Troubleshooting ZFS
The three categories of errors experienced by ZFS are:
- missing devices: Missing devices
placed in a "faulted" state.
- damaged devices: Caused
by things like transient errors from the disk or
controller, driver bugs or accidental overwrites
(usually on misconfigured devices).
- data corruption: Data damage to top-level
devices; usually requires a restore. Since ZFS is transactional,
this only happens as a result of driver bugs, hardware
failure or filesystem misconfiguration.
It is important to check for all three categories of errors.
One type of problem is often connected to a problem from a different
family. Fixing a single problem is usually not sufficient.
Data integrity can be checked by running a manual scrubbing:
zpool scrub
pool-name
zpool status -v
pool-name
checks the status after the scrubbing is complete.
The status
command also reports on
recovery suggestions for any errors it finds. These
are reported in the action
section.
To diagnose a problem, use the output of
the status
command and the fmd
messages in /var/adm/messages
.
The config
section of the status
section reports the state of each device. The state
can be:
- ONLINE: Normal
- FAULTED: Missing, damaged, or mis-seated device
- DEGRADED: Device being resilvered
- UNAVAILABLE: Device cannot be opened
- OFFLINE: Administrative action
The status
command also reports
READ
, WRITE
or CHKSUM
errors.
To check if any problem pools exist, use
zpool status -x
This command only reports problem pools.
If a ZFS configuration becomes damaged, it can be
fixed by running export
and
import
.
Devices can fail for any of several reasons:
- "Bit rot:" Corruption caused by random environmental
effects.
- Misdirected Reads/Writes: Firmware or hardware
faults cause reads or writes to be addressed to the
wrong part of the disk.
- Administrative Error
- Intermittent, Sporadic or Temporary Outages: Caused
by flaky hardware or administrator error.
- Device Offline: Usually caused by administrative
action.
Once the problems have been fixed, transient errors
should be cleared:
zpool clear
pool-name
In the event of a panic-reboot loop caused by a
ZFS software bug, the system can be instructed to
boot without the ZFS filesystems:
boot -m milestone=none
When the system is up, remount / as rw and remove
the file /etc/zfs/zpool.cache
.
The remainder of the boot can proceed with the
svcadm milestone all
command. At that
point import the good pools. The damaged pools may
need to be re-initialized.
Scalability
The filesystem is 128-bit. 256 quadrillion zetabytes of
information is addressable. Directories can have up to
256 trillion entries. No limit exists on the number of
filesystems or files within a filesystem.
ZFS Recommendations
Because ZFS uses kernel addressable memory, we need to
make sure to allow enough system resources to take advantage
of its capabilities. We should run on a system with a
64-bit kernel, at least 1GB of physical memory, and adequate
swap space.
While slices are supported for creating storage pools, their
performance will not be adequate for production uses.
Mirrored configurations should be set up across multiple
controllers where possible to maximize performance and
redundancy.
Scrubbing should be
scheduled on a regular basis to identify problems before they
become serious.
When latency or other requirements are important, it makes
sense to separate them onto different pools with distinct
hard drives. For example, database log files should be
on separate pools from the data files.
Root pools are not yet supported in the Solaris 10
6/2006 release, though they are anticipated in a future release.
When they are used, it is best to put them on separate
pools from the other filesystems.
On filesystems with many file creations and deletions,
utilization should be kept under 80% to protect performance.
The recordsize
parameter can be tuned on
ZFS filesystems. When it is changed, it only affects
new files. zfs set recordsize=
size
tuning can help where large files (like database files) are accessed
via small, random reads and writes. The default is 128KB;
it can be set to any power of two between 512B and
128KB. Where the database uses a fixed block or record
size, the recordsize
should be set to match.
This should only be done for the filesystems actually
containing heavily-used database files.
In general, recordsize should be reduced when
iostat
regularly shows a throughput near
the maximum for the I/O channel. As with any tuning,
make a minimal change to a working system, monitor it
for long enough to understand the impact of the change,
and repeat the process if the improvement was not good
enough or reverse it if the effects were bad.
The
ZFS Evil Tuning Guide contains a number of tuning
methods that may or may not be appropriate to a particular installation.
As the document suggests, these tuning mechanisms will have to be
used carefully, since they are not appropriate to all installations.
For example, the Evil Tuning Guide provides instructions for:
-
Turning off file system checksums to reduce CPU usage. This is done on a per-file system
basis:
zfs set checksum=off filesystem
zfs set checksum='on | fletcher2 | fletcher4 | sha256' filesystem
-
Limiting the ARC size by setting
set zfs:zfs_arc_max
in /etc/system
on 8/07 and later.
- If the I/O includes multiple small reads,
the file prefetch can be turned off by setting
zfs:zfs_prefetch_disable
on 8/07 and later.
- If the I/O channel becomes saturated, the
device level prefetch can be turned off with
set zfs:zfs_vdev_cache_bshift = 13
in /etc/system
for 8/07 and
later
-
I/O concurrency can be tuned by setting
set zfs:zfs_vdev_max_pending = 10
in /etc/system
in 8/07 and later.
- If storage with an NVRAM cache is used,
cache flushes may be disabled with
set zfs:zfs_nocacheflush = 1
in /etc/system
for 11/06 and later.
-
ZIL intent logging can be disabled. (WARNING: Don't do this.)
-
Metadata compression can be disabled. (Read this section of the Evil Tuning Guide first--
you probably do not need to do this.)
Sun Cluster Integration
ZFS can be used as a failover-only file system with Sun Cluster
installations.
If it is deployed on disks also used by Sun Cluster,
do not deploy it on any Sun Cluster quorum disks. (A ZFS-owned disk
may be promoted to be a quorum disk on current Sun Cluster versions,
but adding a disk to a ZFS pool may result in quorum keys being overwritten.)
ZFS Internals
Max Bruning wrote an
excellent paper on how to examine the internals of a ZFS data structure.
(Look for the article on the ZFS On-Disk Data Walk.) The structure is defined in
ZFS On-Disk Specification.
Some key structures:
uberblock_t
: The starting point when examining a ZFS file
system. 128k array of 1k uberblock_t
structures, starting at
0x20000 bytes within a vdev label. Defined in
uts/common/fs/zfs/sys/uberblock_impl.h
Only one uberblock is
active at a time; the active uberblock can be found with
zdb -uuu zpool-name
blkptr_t
: Locates, describes, and verifies blocks on a disk.
Defined in uts/common/fs/zfs/sys/spa.h
.
dnode_phys_t
: Describes an object. Defined by
uts/common/fs/zfs/sys/dmu.h
objset_phys_t
: Describes a group of objects.
Defined by
uts/common/fs/zfs/sys/dmu_objset.h
- ZAP Objects: Blocks containing name/value pair attributes.
ZAP stands for ZFS Attribute Processor. Defined by
uts/common/fs/zfs/sys/zap_leaf.h
- Bonus Buffer Objects:
dsl_dir_phys_t
: Contained in a DSL directory
dnode_phys_t
; contains object ID for a DSL dataset
dnode_phys_t
dsl_dataset_phys_t
: Contained in a DSL dataset
dnode_phys_t
; contains a blkprt_t
pointing
indirectly at a second array of dnode_phys_t
for
objects within a ZFS file system.
znode_phys_t
: In the bonus buffer of dnode_phys_t
structures for files and directories; contains attributes of the file
or directory. Similar to a UFS inode in a ZFS context.