Tuesday, April 16, 2013

Solaris Process Scheduling

In Solaris, highest priorities are scheduled first. Kernel thread scheduling information can be revealed with ps -elcL.

A process can exist in one of the following states: running, sleeping or ready.

Kernel Threads Model

The Solaris 10 kernel threads model consists of the following major objects:

  • kernel threads: This is what is scheduled/executed on a processor
  • user threads: The user-level thread state within a process.
  • process: The object that tracks the execution environment of a program.
  • lightweight process (lwp): Execution context for a user thread. Associates a user thread with a kernel thread.

In the Solaris 10 kernel, kernel services and tasks are executed as kernel threads. When a user thread is created, the associated lwp and kernel threads are also created and linked to the user thread.

(This single-level model was first introduced in Solaris 8's alternative threads library, which was made the default in Solaris 9. Prior to that, user threads had to bind to an available lwp before becoming eligible to run on the processor.)

Priority Model

The Solaris kernel is fully preemptible. This means that all threads, including the threads that support the kernel's own activities, can be deferred to allow a higher- priority thread to run.

Solaris recognizes 170 different priorities, 0-169. Within these priorities fall a number of different scheduling classes:

  • TS (timeshare): This is the default class for processes and their associated kernel threads. Priorities within this class range 0-59, and are dynamically adjusted in an attempt to allocate processor resources evenly.
  • IA (interactive): This is an enhanced version of the TS class that applies to the in-focus window in the GUI. Its intent is to give extra resources to processes associated with that specific window. Like TS, IA's range is 0-59.
  • FSS (fair-share scheduler): This class is share-based rather than priority- based. Threads managed by FSS are scheduled based on their associated shares and the processor's utilization. FSS also has a range 0-59.
  • FX (fixed-priority): The priorities for threads associated with this class are fixed. (In other words, they do not vary dynamically over the lifetime of the thread.) FX also has a range 0-59.
  • SYS (system): The SYS class is used to schedule kernel threads. Threads in this class are "bound" threads, which means that they run until they block or complete. Priorities for SYS threads are in the 60-99 range.
  • RT (real-time): Threads in the RT class are fixed-priority, with a fixed time quantum. Their priorities range 100-159, so an RT thread will preempt a system thread.

Of these, FSS and FX were implemented in Solaris 9. (An extra-cost option for Solaris 8 included the SHR (share-based) class, but this has been subsumed into FSS.)

Fair Share Scheduler

The default Timesharing (TS) scheduling class in Solaris attempts to allow each process on the system to have relatively equal CPU access. The nice command allows some management of process priority, but the new Fair Share Scheduler (FSS) allows more flexible process priority management that integrates with the project framework.

Each project is allocated a certain number of CPU shares via the project.cpu-shares resource control. Each project is allocated CPU time based on its cpu-shares value divided by the sum of the cpu-shares values for all active projects.

Anything with a zero cpu-shares value will not be granted CPU time until all projects with non-zero cpu-shares are done with the CPU.

The maximum number of shares that can be assigned to any one project is 65535.

FSS can be assigned to processor sets, resulting in more sensitive control of priorities on a server than raw processor sets. The dispadmin command command controls the assignment of schedulers to processor sets, using a form like:
dispadmin -d FSS
To enable this change now, rather than after the next reboot, run a command like the following:
priocntl -s -C FSS
priocntl can control cpu-shares for a project:
priocntl -r -n project.cpu-shares -v number-shares -i project project-name

The Fair Share Scheduler should not be combined with the TS, FX (fixed-priority) or IA (interactive) scheduling classes on the same CPU or processor set. All of these scheduling classes use priorities in the same range, so unexpected behavior can result from combining FSS with any of these. (There is no problem, however, with running TS and IA on the same processor set.)

To move a specific project's processes into FSS, run something like:
priocntl -s -c FSS -i projid project-ID

All processes can be moved into FSS by first converting init, then the rest of the processes:
priocntl -s -c FSS -i pid 1
priocntl -s -c FSS -i all

Implementation Details

Time Slicing for TS and IA

TS and IA scheduling classes implement an adaptive time slicing scheme that increases the priority of I/O-bound processes at the expense of compute-bound processes. The exact values that are used to implement this can be found in the dispatch table. To examine the TS dispatch table, run the command dispadmin -c TS -g. (If units are not specified, dispadmin reports time values in ms.)

The following values are reported in the dispatch table:

  • ts_quantum: This is the default length of time assigned to a process with the specified priority.
  • ts_tqexp: This is the new priority that is assigned to a process that uses its entire time quantum.
  • ts_slpret: The new priority assigned to a process that blocks before using its entire time quantum.
  • ts_maxwait: If a thread does not receive CPU time during a time interval of ts_maxwait, its priority is raised to ts_lwait.
  • ts_lwait:

The man page for ts_dptbl contains additional information about these parameters.

dispadmin can be used to edit the dispatch table to affect the decay of priority for compute-bound processes or the growth in priority for I/O-bound processes. Obviously, the importance of the different types of processing on different systems will make a difference in how these parameters are tweaked. In particular, ts_maxwait and ts_lwait can prevent CPU starvation, and raising ts_tqexp slightly can slow the decline in priority of CPU-bound processes.

In any case, the dispatch tables should only be altered slightly at each step in the tuning process, and should only be altered at all if you have a specific goal in mind.

The following are some of the sorts of changes that can be made:

  • Decreasing ts_quantum favors IA class objects.
  • Increasing ts_quantum favors compute-bound objects.
  • ts_maxwait and ts_lwait control CPU starvation.
  • ts_tqexp can cause compute-bound objects' priorities to decay more or less rapidly.
  • ts_slpret can cause I/O-bound objects' priorities to rise more or less rapidly.

RT objects time slice differently in that ts_tqexp and ts_slpret do not increase or decrease the priority of the

IA objects add 10 to the regular TS priority of the process in the active window. This priority shifts with the focus on the active window. object. Each RT thread will execute until its time slice is up or it is blocked while waiting for a resource.

Time Slicing for FSS

In FSS, the time quantum is the length of time that a thread is allowed to run before it has to release the processor. This can be checked using
dispadmin -c FSS -g

The QUANTUM is reported in ms. (The output of the above command displays the resolution in the RES parameter. The default is 1000 slices per second.) It can be adjusted using dispadmin as well. First, run the above command and capture the output to a text file (filename.txt). Then run the command:
dispadmin -c FSS -s filename.txt

Callouts

Solaris handles callouts with a callout thread that runs at maximum system priority, which is still lower than any RT thread. RT callouts are handled separately and are invoked at the lowest interrupt level, which ensures prompt processing.

Priority Inheritance

Each thread has two priorities: global priority and inherited priority. The inherited priority is normally zero unless the thread is sitting on a resource that is required by a higher priority thread.

When a thread blocks on a resource, it attempts to "will" or pass on its priority to all threads that are directly or indirectly blocking it. The pi_willto() function checks each thread that is blocking the resource or that is blocking a thread in the syncronization chain. When it sees threads that are a lower priority, those threads inherit the priority of the blocked thread. It stops traversing the syncronization chain when it hits an object that is not blocked or is higher priority than the willing thread.

This mechanism is of limited use when considering condition variable, semaphore or read/write locks. In the latter case, an owner-of-record is defined, and the inheritance works as above. If there are several threads sharing a read lock, however, the inheritance only works on one thread at a time.

Thundering Herd

When a resource is freed, all threads awaiting that resource are woken. This results in a footrace to obtain access to that object; one succeeds and the others return to sleep. This can lead to wasted overhead for context switches, as well as a problem with lower priority threads obtaining access to an object before a higher-priority thread. This is called a "thundering herd" problem.

Priority inheritance is an attempt to deal with this problem, but some types of syncronization do not use inheritance.

Turnstiles

Each syncronization object (lock) contains a pointer to a structure known as a turnstile. These contain the data needed to manipulate the syncronization object, such as a queue of blocked threads and a pointer to the thread that is currently using the resource. Turnstiles are dynamically allocated based on the number of allocated threads on the system. A turnstile is allocated by the first thread that blocks on a resource and is freed when no more threads are blocked on the resource.

Turnstiles queue the blocked threads according to their priority. Turnstiles may issue a signal to wake up the highest-priority thread, or they may issue a broadcast to wake up all sleeping threads.

Adjusting Priorities

The priority of a process can be adjusted with priocntl or nice, and the priority of an LWP can be controlled with priocntl().

Real Time Issues

STREAMS processing is moved into its own kernel threads, which run at a lower priority than RT threads. If an RT thread places a STREAMS request, it may be serviced at a lower priority level than is merited.

Real time processes also lock all their pages in memory. This can cause problems on a system that is underconfigured for the amount of memory that is required.

Since real time processes run at such a high priority, system daemons may suffer if the real time process does not permit them to run.

When a real time process forks, the new process also inherits real time privileges. The programmer must take care to prevent unintended consequences. Loops can also be hard to stop, so the programmer also needs to make sure that the program does not get caught in an infinite loop.

Interrupts

Interrupt levels run between 0 and 15. Some typical interrupts include:
  • soft interrupts
  • SCSI/FC disks (3)
  • Tape, Ethernet
  • Video/graphics
  • clock() (10)
  • serial communications
  • real-time CPU clock
  • Nonmaskable interrupts (15)

No comments: