Solaris Troubleshooting: CPU Loading

Wednesday, April 03, 2013

CPU Loading

Intuitively, the load average is an average over time of the number of processes in the run queue. uptime reports load averages over 1-, 5- and 15-minute intervals. Typically, load averages are divided by the number of CPU cores to find the load per CPU. Load averages above 1 per CPU indicate that the CPUs are fully utilized. Depending on the type of load and the I/O requirements, user-visible performance may not be affected until levels of 2 per CPU are reached. A general rule of thumb is that load averages that are persistently above 4 times the number of CPUs will result in sluggish performance.

Prior to Solaris 10, the calculation algorithm directly computed the load average by periodically sampling the length of the run queue. Since this measurement can be skewed by threads that enter and exit more quickly than the sampling interval, Solaris 10 altered the algorithm to use microstate accounting instead.

Solaris 10 applies an exponential decay algorithm to a combination of high-resolution usr, sys and thread wait times. The numbers are comparable to a traditional load average.

The load averages can be monitored intermittently via uptime or over extended time periods by looking at run queue lengths and the amount of time that the run queue is occupied via sar -q.

One issue to watch for is the number of processes that are blocked while waiting for I/O. Check the disk I/O page for information on monitoring this.

Solaris 10 allows us to directly monitor the amount of time threads wait for a processor via the prstat -mL command in the LAT category.

For non-NFS servers, another danger sign is when the system consistently spends more time in sys than usr mode. (nfsd operates in the kernel in sys mode.) MacDougall and Mauro comment that a typical usr/sys ratio is in the neighborhood of 70/30 on a reasonably loaded system.

Another issue to watch for is a high number of system calls per second per processor. With today's faster CPUs, 20,000 would represent a reasonable threshold. This can be monitored via sar -c. In particular, the large numbers of forks or execs may represent excessive context switching. (Slower processors will be able to handle fewer system calls per second.) Context switching is monitored by vmstat or mpstat.

Solaris Troubleshooting

Wednesday, April 03, 2013

CPU Loading

No comments:

Sponsor Links