Friday, April 12, 2013

Kernel Memory

Kernel memory size can be tracked using the sar -k command. The total of the "alloc" fields is the kernel memory size. If it appears to be growing without bound, you may have a memory leak. It should be noted that not all buckets are tracked by sar -k, so the reported memory size is not as accurate as that reported by crash.

On occasion there are problems related to memory leaks in the kernel or one of the associated modules. In these cases, kmastat can provide useful information pinpointing the source of the leak.

To check on kernel memory allocations on a running system, run a crash session as follows:

# crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> kmastat

The first number on the "Total" line represents the total amount of memory allocated by the kernel. If this is a significant fraction of available system memory and growing, there is a problem.

The output from the kmastat command also contains information on a number of "buckets" or categories for memory allocation.

Additional information can be obtained via the kmausers command, but this requires that we load kadb prior to booting. To do this, reach the ok> prompt, then:

ok> boot kadb -d
kadb: (hit the "return" key)
kadb[0]: kmem_flags/W 01
kadb[0]: :c

Loading kadb this way means that kadb will only be effective for this current boot session.

Once the system is up, we can either force a core dump via STOP-A/ ok> sync, or we can examine the live system. In either case, inside the crash session we would type:

>kmausers bucket_name

The result will show memory allocations inside that bucket. The names of functions inside each allocation will be a tip-off to what is actually grabbing the memory allocation.

A script can be run from cron to capture this information. The format of this script would be something like:

#!/bin/sh
date >> log_file
echo "kmastat" | /usr/sbin/crash -w log_file
sleep 20
echo "kmausers kmem_alloc_2048" | /usr/sbin/crash -w log_file

Slab Allocator

Solaris 2.4+ uses a kernel memory allocator known as a slab allocator.

A kernel memory allocator performs the following functions:

  • Allocate memory
  • Initialize objects/structures
  • Use objects/structures
  • Deconstruct objects/structures
  • Free memory

The structures in the memory objects include sub-objects such as linked list headers, mutexes, reference counts and condition variables. In the case of Solaris, the deconstruction step includes setting objects to their initial settings, which can save time when the memory objects have to be re-initialized.

A translation lookaside buffer (TLB) is an associative cache of recent address translations. When the MMU (memory management unit) cannot find a translation in the TLB, it lookus it up in the address maps and loads the address into the TLB. Entries in the TLB are replaced on a least recently used basis.

The slab allocator is organized as a collection of object caches. Each of these caches contains only one type of object (proc structures, vnodes, etc). The kernel is responsible for restoring each object to its initial state when it is released. When a cache requires additional space, the allocator gives a slab of memory from the page-level allocator and creates objects from it. The slab contains enough memory for several object instances. A small part of the slab is used by the cache to manage memory in the slab; the rest is divided into buffers that are the size of the object. The allocator then initializes these buffers with the appropriate constructor.

When the page-level allocator needs to recover memory, unused slabs are reaped by deconstructing the objects on slabs whose objects are all free, then removing the slab from the cache in question.

The structure for each slab includes unused space at the beginning of the slab (coloring area), the set of objects, more unused space (the amount left over after the maximum number of objects has been created), and a slab data area. Each object also includes a four byte area for a free list pointer. The slab data area includes a count of in-use objects, pointers for a doubly-linked list of slabs in the same cache, and a pointer to the first free area in the slab. The coloring areas are different sizes for each slab in a cache (where possible). This allows a balanced distribution of traffic on the hardware caches and memory busses by varying the offsets for the different slabs.

Large object slabs are slightly different in that management data is stored in a separate pool of memory, since large slabs are usually multiples of a page in size. A hash table is also maintained to provide lookups between the management area and the slabs.

No comments: