Thursday, April 04, 2013

Component Parts of Disk I/O

What we blithely call a "Disk I/O" is actually made up of several components, each of which may have an impact on overall performance. These layers may be broken down as follows for a typical I/O operation:

  • POSIX: Application calls a POSIX library interface. (These frequently map directly to system calls, except for the asynchronous interfaces. These latter work via pread and pwrite.)
  • System Call: The relevant node and vfs system calls are:
    vnode system calls:
    • close()
    • creat()
    • fsync()
    • ioctl()
    • link()
    • mkdir()
    • open()
    • read()
    • rename()
    • rmdir()
    • seek()
    • unlink()
    • write()
    vfs system calls:
    • mount()
    • statfs()
    • sync()
    • umount()
  • VOP: The vnode operations interface is the architectural layer between the system calls and the filesystems. DTrace provides the best way to examine this layer. Starting in version 0.96, the DTrace Toolkit's vopstat command allows direct monitoring at this level.
  • Filesystems: There is some discussion of filesystem tuning and filesystem caching at the bottom of this page. Further information on troubleshooting a particular filesystem is contained in each filesystem's web page. (This site contains pages for NFS, UFS and ZFS filesystems.)
  • Physical Disk I/O: This is the portion of the I/O that involves the transfer of data to or from the physical hardware. Traditionally, I/O troubleshooting focuses on this portion of the I/O process.

McDougall, Mauro and Gregg suggest that the best way to see if I/O is a problem at all is to look at the amount of time spent on library and system calls via DTrace.

For example, the DTrace Toolkit's procsystime utility tracks time spent on each system call. Similarly, the dtruss -t syscall -p PID command can examine the time spent on a particular system call for a process. The truss -D -p PID command also reveals the time spent by a process in I/O system calls, but it imposes a severe performance penalty.

If the system call statistics reveal a problem, we should look at the raw disk I/O performance.

Physical Disk I/O

The primary tool to use in troubleshooting disk I/O problems is iostat. sar -d provides useful historical context. vmstat can provide information about disk saturation. For Solaris 10 systems, dtrace can provide extremely fine-grained information about I/O performance and what is causing any utilization or saturation problems. The DTrace Toolkit provides a number of ready-to-use scripts to take advantage of DTrace's capabilities.

To start, use iostat -xn 30 during busy times to look at the I/O characteristics of your devices. Ignore the first bunch of output (the first group of output is summary statistics), and look at the output every 30 seconds. If you are seeing svc_t (service time) values of more than 20 ms on disks that are in use (more than, say, 10% busy), then the end user will see noticeably sluggish performance.

(With modern disk arrays that contain significant amounts of cache, it may be more useful to compare to service times during periods when no performance problems are experienced. If the reads and writes are largely hitting the cache on a fiber-attached disk array, average service times in the 3-5 ms range can be achieved. If you are seeing a large increase in service time during the problem periods, you may need to look at your disk array's monitoring features to identify whether or not more disk array cache would be useful. The most useful measurements to be used with modern disk arrays are the throughput measurements, since large up-front caches mask any other issues.)

Disk Utilization

If a disk is more than 60% busy over sustained periods of time, this can indicate overuse of that resource. The %b iostat statistic provides a reasonable measure for utilization of regular disk resources. (The same statistic can be viewed via iostat -D in Solaris 10.)

Utilization may not take into account the usage pattern, the fact that disk array utilization numbers are almost impossible to interpret correctly, or whether application effects are adequately handled by I/O caching. The service times are the key to seeing whether a high utilization is actually causing a problem.

The DTrace Toolkit provides a way to directly measure disk utilization via the iotop -CP command. This command shows UIDs, process IDs and device names, which can help identify a culprit. (The -C option provides a rolling output rather than having it clear at each time step. The -P option shows the %I/O utilization.)

Disk Saturation

A high disk saturation (as measured via iostat's %w) always causes some level of performance impact, since I/Os are forced to queue up. Even if the disk is not saturated now, it is useful to look at throughput numbers and compare them to the expected maximums to make sure that there is adequate head room for unusually high activity. (We can measure the maximum directly by doing something like a dd or mkfile and looking at the reported throughput.)

If iostat consistently reports %w > 5, the disk subsystem is too busy. In this case, one thing that can be done is to reduce the size of the wait queue by setting sd_max_throttle to 64. (The sd_max_throttle parameter determines how many jobs can be queued up on a single HBA, and is set to 256 by default. If the sd_max_throttle threshhold is exceeded, it will result in a transport failure error message.)

Reducing sd_max_throttle is a temporary quick fix. Its primary effect is to keep things from getting quite so backed up and spiraling out of control. One of the permanent remedies below needs to be implemented.

Another possible cause for a persistently high %w is SCSI starvation, where low SCSI ID devices receive a lower precedence than a higher-numbered device (such as a tape drive). (See the System Bus/SCSI page for more information.)

Another indication of a saturated disk I/O subsystem is when the procs/b section of vmstat persistently reports a number of blocked processes that is comparable to the run queue (procs|kthr/r). (The run queue is roughly comparable to the load average.)

The DTrace Toolkit's iotop -o 10 command shows disk I/O time summaries. Each process's UID, process ID and device names are shown, along with the number of nanoseconds of disk time spent. This can help us to identify the heavy hitters on a saturated disk.

Usage Pattern

It is useful to know whether our I/O is predominantly random or sequential. Sequential I/O is typical of large file reads and writes, and typically involves operating on one block immediately after its neighbor. With this type of I/O, there is little penalty associated with the disk drive head having to move to a new location. Random I/O, on the other hand, involves large numbers of seeks and rotations, and is usually much slower.

Disk I/O can be investigated to find out whether it is primarily random or sequential. If sar -d reports that (blks/s)/(r+w/s) < 16Kb (~32 blocks), the I/O is predominantly random. If the ratio is > 128Kb (~256 blocks), it is predominantly sequential. This analysis may be useful when examining alternative disk configurations.

The DTrace Toolkit provides us a way to directly measure seek times using the seeksize.d script. This script is a direct measurement of disk usage patterns. If there are large numbers of large seeks, it indicates that our physical drives are spending a lot of time moving heads around rather than reading or writing data.

To identify the culprit, the DTrace Toolkit contains a script called bitesize.d, which provides a graph of I/O sizes carried out by each process. If there are a large number of small I/Os, the pattern is predominantly random. If there are mostly large I/Os, the process is exhibiting sequential behavior.

DTrace also provides a way to track which files are accessed how often. The args2->fi_pathname value from the io provider gives us a handle into this. For example, we could use a one-liner like:

dtrace -n 'io:::start { printf("%6s %-12s %6s", pid, execname args[2]->fi_pathname); } '
to provide raw data for further processing, or we could use an aggregation to collect statistics. The DTrace Toolkit's iosnoop program provides a flexible way to collect this sort of information. (The -h option provides usage notes on how to use it.)

Disk Errors

iostat -eE reports on disk error counts since the last reboot. Keep in mind that several types of events (such as ejecting a CD or some volume manager operations) are counted in this output. Once these error messages rise above 10 in any category, further investigation is warranted.

Restructuring I/O

The usual solutions to a disk I/O problem are:

  • Check filesystem kernel tuning parameters to make sure that DNLC and inode caches are working appropriately. (See "Filesystem Caching" below.)
  • Spread out the I/O traffic across more disks. This can be done in hardware if the I/O subsystem includes a RAID controller, or in software by striping the filesystem (using Solaris Volume Management/DiskSuite, Veritas Volume Manager or ZFS), by splitting up the data across additional filesystems on other disks, or even splitting the data across other servers. (In extreme cases, you can even consider striping data over only the outermost cylinders of several otherwise empty disk drives in order to maximize throughput.) Cockroft recommends 128KB as a good stripe width for most applications. In an ideal world, the stripe width would be an integer divisor of the average I/O size to split the traffic over all disks in the stripe.
  • Redesign the problematic process to reduce the number of disk I/Os. (Caching is one frequently-used strategy, either via cachefs or application-specific caching.)
  • The write throttle can be adjusted to provide better performance if there are large amounts of sequential write activity. The parameters in question are ufs:ufs_HW and ufs:ufs_LW. These are very sensitive and should not be adjusted too far at one time. When ufs_WRITES is set to 1 (default), the write throttle is enabled. When the number of outstanding writes exceeds ufs_HW, writes are suspended until the number of outstanding writes drops below ufs_LW. Both can be increased where large amounts of sequential writes are occurring.
  • tune_t_fsflushr sets the number of seconds after which fsflush will run autoup dictates how frequently each bit of memory is checked. Setting fsflush to run less frequently can also reduce disk activity, but it does run the risk of losing data that has been written to memory. These parameters can be adjusted using adb while looking for an optimum value, then set the values in the /etc/system file.
  • Check for SCSI starvation, i.e., for busy high-numbered SCSI devices (such as tape drives) that have a higher priority than lower-numbered devices.
  • Database I/O should be done to raw disk partitions or direct unbuffered I/O.
  • In some cases, it may be worthwhile to move frequently-accessed data to the outer edge of a hard drive. In the outer cylinders, the read and write rates are higher.
  • It may be worthwhile to match observed and configured I/O sizes by tuning maxphys and maxcontig.

Filesystem Performance

Physical disk I/O is usually the focus of I/O troubleshooting sessions. McDougall, Mauro and Gregg suggest that it is more appropriate to focus on overall service times of I/O related system calls. (As noted above, the DTrace Toolkit's procsystime utility tracks time spent on each system call, and the dtruss -t syscall -p PID command can examine the time spent on a particular system call for a process. The pfilestat utility in the newer versions of the Toolkit also gives an indication of how much time a process spends on different I/O-related system calls.)

This approach allows end-to-end monitoring of the important portions of the I/O process. The traditional approach ignores performance problems introduced by the filesystem itself.

Filesystem latency may come from any of the following:

  • Disk I/O wait: This may be as short as zero, in the event of a read cache hit. For a synchronous I/O event, this can be reduced by restructuring disk storage or by altering caching parameters. Disk I/O wait can be monitored directly through dtrace, including through the iowait.d script.
  • Filesystem cache misses: These include block, buffer, metadata and name lookup caches. These may be adjustable by increasing the size of the relevant caches.
  • I/Os being broken into multiple pieces, incurring the penalty of addtional operations. This may be a result of the maximum cluster size for the filesystem or the OS.
  • Filesystem locking: Most filesystems have per-file reader/writer locks. This can be most significant when there is a large file (like a database file) where reads have to wait for writes to a different portion of the file. Direct I/O is a mechanism for bypassing this limitation.
  • Metadata updating: Creations, renames, deletions and some file extensions cause some extra latency to allow for updates to filesystem metadata.

The DTrace Toolkit's vopstat command allows monitoring of the number and duration of operations at the VOP level. (VOP is the architectural layer between the system calls and the filesystems, so it is at a high enough level to provide interesting information.)

Filesystem Caching

There are several types of cache used by the Solaris filesystems to cache name and attribute lookups. These are:

  • DNLC (Directory Name Lookup Cache): This cache stores vnode to path directory lookup information, preventing the need to perform directory lookups on the fly. (Solaris 7 and higher have removed a previous file path length restriction.)
  • inode cache: This cache stores logical metadata information about files (size, access time, etc). It is a linked list that stores the inodes and pointers to all pages that are part of that file and are currently in memory. The inode cache is dedicated for use by UFS.
  • rnode cache: This is maintained on NFS clients to store information about NFS-mounted nodes. In addition, an NFS attribute cache stores logical metadata information.
  • buffer cache: The buffer cache stores inode, indirect block and cylinder group-related disk I/O. This references the physical metadata (eg block placement in the filesystem), as opposed to the logical metadata that is stored in other caches.

(Note that cache statistics will be skewed by things that walk the directory tree like find.)

The block cache provides performance enhancement by using otherwise idle memory in the page cache to keep copies of recently requested information. Cache hits in the block cache obviously have a huge performance advantage.

ZFS uses an adaptive replacement cache (ARC) rather than using the page cache fo file data (like most other filesystems do).

Directory Name Lookup Cache

The DNLC stores directory vnode/path translation information. (Starting with Solaris 7, a previous path length restriction of 30 characters was lifted.)

sar -a reports on the activity of this cache. In this output, namei/s reports the name lookup rate and iget/s reports the number of directory lookups per second. Note that an iget is issued for each component of a file's path, so the hit rate cannot be calculated directly from the sar -a output. The sar -a output is useful, however, when looking at cache efficiency in a more holistic sense.

For our purposes, the most important number is the total name lookups line in the vmstat -s output, or the dir_hits and dir_misses statistics in kstat -n dnlcstats. If the cache hit percentage is not above 90%, the DNLC should be resized. (Unless the activity is such that we would not expect a good hit ratio, such as large numbers of file creations.)

DNLC size is determined by the ncsize kernel parameter. By default, this is set to (17xmaxusers)+90 (Solaris 2.5.1) or 4x(maxusers + max_nprocs)+320 (Solaris 2.6-10). It is not recommended that it be set any higher than a value which corresponds to a maxusers value of 2048 for Solaris 2.5.1-7 or 4096 for Solaris 8-10. This can be viewed via mdb -k by querying ncsize/D

To set ncsize, add a line to the /etc/system as follows:
set ncsize=10000

The DNLC can be disabled by setting ncsize to a negative number (Solaris 2.5.1-7) or a non-positive number (Solaris 8-10).

Inode Cache

The inode cache is a linked list that stores the inodes that have been accessed along with pointers to all pages that are part of that file and are currently in memory.

sar -g reports %ufs_ipg, which is the percentage of inodes that were overwritten while still having active pages in memory. If this number is consistently nonzero, the inode cache should be increased. By default, this number (ufs_ninode) is set to the same value as ncsize, unless otherwise specified in the /etc/system file. As with ncsize, it is not recommended that ufs_ninode be set any higher than a value which corresponds to a ncsize for a maxusers value of 2048 for Solaris 2.5.1-7 or 4096 for Solaris 8-10.

The vmstat -s command also contains summary information about the inode cache in the inode_cache section. Among other things, this section includes sizing and hit rate information.

(The inode cache can grow beyond the ufs_ninode limit. When this happens, unused inodes will be flushed from the linked list.)

netstat -k (up through Solaris9) or kstat -n ufs_inode_cache (after Solaris 8) also report on inode cache statistics.

While resizing the inode cache, it is important to remember that each inode will use about 300 bytes of kernel memory. Check your kernel memory size (perhaps with sar -k) when resizing the cache. Since ufs_ninode is just a limit, it can be resized on the fly with adb.

Rnode Cache

The information in the rnode cache is similar to that from the inode cache, except that it is maintained for NFS-mounted files. The default rnode cache size is 2xncsize, which is usually sufficient. Rnode cache statistics can be examined in the rnode_cache section of netstat -k or via the kstat command.

Buffer Cache

The buffer cache is used to store inode, indirect block and cylinder group-related disk I/O. The hit rate on this cache can be discovered by examining the biostat section of the output from netstat -k and comparing the buffer cache hits to the buffer cache lookups. This cache acts as a buffer between the inode cache and the physical disk devices.

Sun suggests tuning bufhwm in the /etc/system file if sar -b reports less than 90% hit rate on reads or 65% on writes.

Cockroft notes that performance problems can result from allowing the buffer cache to grow too large, resulting in kernel memory allocation starvation. The default setting for bufhwm allows the buffer to consume up to 2% of system memory, which may be excessive. The buffer cache can probably be limited to 8MB safely by setting bufhwm in the /etc/system file:
set bufhwm=8000

Obviously, the effects of such a change should be examined by checking the buffer cache hit rate sar -b.

Page Cache

The virtual memory on ultrasparc systems is carved into 8KB chunks known as "pages." When a file is read,it is first loaded into memory, a process known as "paging in." These are recorded in the virtual memory statistics, such as the pi column in vmstat.

Items that are paged into memory are cached there for a time. Since the same files are frequently accessed repeatedly, this caching can dramatically improve I/O performance. We would expect the size of the page cache from read and write operations to be limited by segmap_percent, which has a default of 12% of physical memory.

The page scanner's job is to free up memory caching items that have not been accessed recently. Pages are made available by placing them on the free list.

The size of the page cache and its components can be viewed by running mdb -k and using the ::memstat dcmd. The performance of the cache can be viewed with utilities available in the DTrace Toolkit; the rfileio and rfsio utilities provide cache hit rates.

The page cache is bypassed by using direct I/O.

Physical Disk Layout

The disk layout for a hard drive includes the following:

  • bootblock
  • superblock: Superblock contents can be reported via the fstyp -v /dev/dsk/* command.
  • inode list: The number of inodes for a filesystem is calculated based upon a presumption of an average file size of ~2 KB. If this is not a good assumption, the number of inodes can be set via the newfs -i or mkfs command.
  • data blocks

Inodes

Each inode contains the following information:
  • file type, permissions, etc
  • number of hard links to the file
  • UID
  • GID
  • byte size
  • array of block addresses:
    The first several block addresses are used for data storage. Other block addresses store indirect blocks, which point at arrays containing pointers to further data blocks. Each inode contains 12 direct block pointers and 3 indirect block pointers.
  • generation number (incremented each time the inode is re-used)
  • access time
  • modification time
  • change time
  • Number of sectors: This is kept to allow support for holey files, and can be reported via ls -s
  • Shadow inode location: This is used for ACLs (access control lists).

Using the indirection provided in the array of block addresses, files can be created that contain holes, or large sets of null-filled bytes.

Physical I/O

Disk I/Os include the following components:
  • I/O bus access: If the bus is busy, the request is queued by the driver. The information is reported by sar -d wait and %w and iostat -x avwait.
  • Bus transfer time: Arbitration time (which device gets to use the bus--see the System Bus/SCSI page), time to transfer the command (usually ~ 1.5 ms), data transfer time (in the case of a write).
  • Seek time: Time for the head to move to the proper cylinder. Average seek times are reported by hard drive manufacturers. Usage patterns and the layout of data on the disks will determine the number of seeks that are required.
  • Rotation time: Time for the correct sector to rotate under the head. This is usually calculated as 1/2 the time for a disk rotation. Rotation speeds (in RPM) are reported by hard drive manufacturers.
  • ITR time: Internal Throughput Rate. This is the amount of time required for a transfer between the hard drive's cache and the device media. The ITR time is the limiting factor for sequential I/O, and is reported by the hard drive manufacturer.
  • Reconnection time: After the data has been moved to/from the hard drive's internal cache, a connection with the host adapter must be completed. This is similar to the arbitration/ command transfer time discussed above.
  • Interrupt time: Time for the completion interrupt to be processed. This is very hard to measure, but high interrupt rates on the CPUs associated with this system board may be an indication of problems.

The disk's ITR rating and internal cache size can be critical when tuning maxcontig (maximum contiguous I/O size). Note: maxphys and maxcontig must be tuned at the same time. The unit of measurement for maxphys is bytes; maxcontig is in blocks.

maxcontig can be changed via the mkfs, newfs or tunefs commands.

By default, maxphys is set to 128KB for Sparc and 56KB for x86 systems. maxcontig should be set to the same size (but in blocks). We would tune these smaller for random I/O and larger for sequential I/O.

Direct I/O

Large sequential I/O can cause performance problems due to excessive use of the memory page cache. One way to avoid this problem is to use direct I/O on filesystems where large sequential I/Os are common.

Direct I/O is a mechanism for bypassing the memory page cache alltogether. It is enforced by the directio() function or by the forcedirectio option to mount.

VxFS enables direct I/O for large sequential operations. It determines which operations are "large" by comparing them to the vxtunefs parameter discovered_direct_iosz (default 256KB).

One problem that can emerge is that if large sequential I/Os are handed to VxFS as several smaller operations, caching will still occur. This problem can be alleviated by reducing discovered_direct_iosz to a level that prevents caching of the smaller operations. In particular, this can be a problem in OLTP environments.

Additional Resources

No comments: