Lookaside Lists

Software doesn’t have to be very complex before its performance-aware programmer wonders whether some of its dynamic memory allocations come and go with such frequency that they might better be freed not by returning them to the operating system’s memory manager but instead by caching them for quick reuse. This is especially attractive when the allocations are for a particular purpose and are all the same size, so that the cache for that purpose need be nothing more sluggish than a single-linked list. In a simple implementation, allocations are sought from the cache first and only then from the memory manager, and freed memory goes only to the cache. This works well enough when the demand is persistent, but more sophistication is needed when demand has significant peaks. Between such peaks, the cached allocations become a large amount of memory that is unavailable for other use and which might better be returned to the operating system after all. The aim, then, is for the operating system to provide for quickly reusing fixed-size blocks of memory with light-weight management of demand and waste.

The Windows kernel certainly counts as complex software, especially in combination with its ecology of drivers that execute in kernel mode, and so even version 3.10 provides some sort of caching of fixed-size memory allocations. That first draft, based around exported functions such as ExInitializeZone and macros such as ExAllocateFromZone, is perhaps better described as batching its memory allocations. It also is perhaps better passed over. Though the code that supports this early feature is retained to this day, the relevant functions and macros were documented as obsolete as early as the Device Driver Kit (DDK) for Windows 2000 in favour of the lookaside lists that had been introduced for Windows NT 4.0.

Supporting Functions

Each lookaside list has a control structure that provides access to a single-linked list of cached memory allocations. When they were introduced, lookaside lists came in two types, distinguished by whether the allocations are intended to be of non-paged or paged memory. Separate sets of exported functions and inline routines provided for operating on these separate types of lookaside list. Windows Vista unified these two types of lookaside list at the price of introducing a third set of functions and routines.

Function or Routine Versions Remarks
ExAllocateFromLookasideListEx   inline routine defined in WDM.H, starting with Windows Vista
ExAllocateFromNPagedLookasideList   inline routine defined in NTDDK.H or WDM.H, starting with Windows NT 4.0
ExAllocateFromPagedLookasideList 4.0 and higher (x86);
5.2 only (x64)
also as inline routine defined in WDM.H, starting with Windows XP,
except for x86 if _WIN2K_COMPAT_SLIST_USAGE defined
ExDeleteLookasideListEx 6.0 and higher  
ExDeleteNPagedLookasideList 4.0 and higher  
ExDeletePagedLookasideList 4.0 and higher  
ExFlushLookasideListEx 6.0 and higher  
ExFreeToLookasideListEx   inline routine defined in WDM.H, starting with Windows Vista
ExFreeToNPagedLookasideList   inline routine defined in NTDDK.H or WDM.H, starting with Windows NT 4.0
ExFreeToPagedLookasideList 4.0 and higher (x86);
5.2 only (x64)
also as inline routine defined in WDM.H, starting with Windows XP,
except for x86 if _WIN2K_COMPAT_SLIST_USAGE defined
ExInitializeLookasideListEx 6.0 and higher  
ExInitializeNPagedLookasideList 4.0 and higher  
ExInitializePagedLookasideList 4.0 and higher  

Each type of lookaside list has exported functions to initialise the list and eventually to delete it. Initialisation prepares the control structure, but not only to enter caller-supplied parameters for the list’s later explicit operation. It also links the control structure into a double-linked list of all control structures of the same type. This provides for the kernel’s implicit operation on the list, as when trimming the cache between peaks in demand. The corresponding deletion (before the control structure is itself removed from memory) is therefore vital. As Microsoft puts it, though not until the DDK for Windows XP: “It is a serious programming error to do otherwise.”

Between initialisation and deletion, inline routines allocate from or free to the list. Allocation from the list means first to pop one entry from the list’s single-linked cache, else to obtain a fresh allocation according to the caller-supplied parameters. Among these can be a caller-supplied allocation routine, but the default is to allocate from the non-paged or paged pool. For all practical effect, the ExAllocateFrom... routines are the closely corresponding ExAllocatePoolWithTag but with the frequent gain of retrieving from the specialised list instead of contending with the many other callers of the general function. To free to the list is to push the allocation onto the list’s single-linked cache in anticipation of quick reuse, but again with a fallback: if the list is deemed to be big enough, it is instead freed more permanently by passing it through a caller-supplied routine, the default being to return the allocation to whichever pool it came from.

Supporting Structures

Each set of functions that operate on lookaside lists works with a corresponding control structure:

All three control structures are documented as opaque, much of the point being that all manipulation of them (if not all useful inspection of them) can be done through the documented functions. If you looked only at the sizes of these structures as known to the kernel, e.g., from type information in public symbol files, you might think the structures have seen non-trivial change:

Version NPAGED_LOOKASIDE_LIST PAGED_LOOKASIDE_LIST LOOKASIDE_LIST_EX
Size (x86) Size (x64) Size (x86) Size (x64) Size (x86) Size (x64)
4.0 to 5.0 0x50   0x68      
5.1 to early 5.2 0x0100   0x0100      
late 5.2 0xC0 0x80 0xC0 0x80    
6.0 and higher 0xC0 0x80 0xC0 0x80 0x48 0x60

It happens, however, that the variation in the sizes of the NPAGED_LOOKASIDE_LIST and PAGED_LOOKASIDE_LIST is all explained by the introduction of cache alignment, a consequent shift of the last member, and then a change in the alignment requirement. When cache-alignment was first applied to lookaside lists for Windows XP, the cache line was taken as 0x80 bytes. Windows Server 2003 SP1 reduced it to 0x40. Either way, the cache-aligned size would be 0x80 bytes for both x86 and x64 builds were it not for the particular way that the cache alignment is done: it applies not only to the whole structure but also to the x86 build’s last member which thus shifts to offset 0x80. It is some sort of irony that this last member is ignored by the kernel’s x86 builds in all these versions that have cache alignment: this structure that is designed for efficient memory use is a cache block bigger for what looks to be no benefit to anyone.

Cache alignment is anyway not universal. The C-language definitions in WDM.H omit the specification of cache alignment when building for 32-bit Windows if WDM.H is included from any of the usual headers for kernel-mode programming, i.e., NTDDK.H, NTIFS.H or NDIS.H. For most kernel-mode programming, including Microsoft’s, the 32-bit NPAGED_LOOKASIDE_LIST and PAGED_LOOKASIDE_LIST are 0x50 and 0x68 bytes in all versions.

General Lookaside List

All three public structures for lookaside lists are based in some sense on what was originally the GENERAL_LOOKASIDE but has since Windows Vista been named the GENERAL_LOOKASIDE_POOL.

Version GENERAL_LOOKASIDE GENERAL_LOOKASIDE_POOL
Size (x86) Size (x64) Size (x86) Size (x64)
4.0 to 5.0 0x48      
5.1 to early 5.2 0x80 0x80    
6.0 and higher 0x80 0x80 0x48 0x60

The plain GENERAL_LOOKASIDE became larger in version 5.1 because of cache alignment. The GENERAL_LOOKASIDE_POOL was introduced, without cache alignment, for Windows Vista in a reworking of the slightly different lookaside lists that the kernel dedicates to per-processor management of small pool allocations. In Windows Vista and since, each KPRCB for a processor has 0x40 or 0x60 of these lookaside lists in two or three arrays named PPNPagedLookasideList, PPPagedLookasideList and (in version 6.2 and higher) PPNxPagedLookasideList. Cache alignment of each structure in these arrays would waste a lot of space. Thus did the unaligned layout from version 4.0 become restored but need a new name.

Offset (x86) Offset (x64) Definition Versions
0x00 0x00
SLIST_HEADER ListHead;
4.0 to 5.2
union {
    SLIST_HEADER ListHead;
    SINGLE_LIST_ENTRY SingleListHead;
};
6.0 and higher
0x08 0x10
USHORT Depth;
4.0 and higher
0x0A  
USHORT Pad;
4.0 only
0x12
USHORT MaximumDepth;
5.0 and higher
0x0C 0x14
ULONG TotalAllocates;
4.0 and higher
0x10  
ULONG AllocateMisses;
4.0 only
0x18
union {
    ULONG AllocateMisses;
    ULONG AllocateHits;
};
5.0 and higher
0x14 0x1C
ULONG TotalFrees;
4.0 and higher
0x18  
ULONG FreeMisses;
4.0 only
0x20
union {
    ULONG FreeMisses;
    ULONG FreeHits;
};
5.0 and higher
0x1C 0x24
POOL_TYPE Type;
4.0 and higher
0x20 0x28
ULONG Tag;
4.0 and higher
0x24 0x2C
ULONG Size;
4.0 and higher
0x28 0x30
PALLOCATE_FUNCTION Allocate;
4.0 to 5.2
union {
    PALLOCATE_FUNCTION_EX AllocateEx;
    PALLOCATE_FUNCTION Allocate;
};
6.0 and higher
0x2C 0x38
PFREE_FUNCTION Free;
4.0 to 5.2
union {
    PFREE_FUNCTION_EX FreeEx;
    PFREE_FUNCTION Free;
};
6.0 and higher
0x30 0x40
LIST_ENTRY ListEntry;
4.0 and higher
0x38 0x50
ULONG LastTotalAllocates;
4.0 and higher
0x3C  
ULONG LastAllocateMisses;
4.0 only
0x54
union {
    ULONG LastAllocateMisses;
    ULONG LastAllocateHits;
};
5.0 and higher
0x40 0x58
ULONG Future [2];
4.0 and higher

Though the word at offset 0x0A is defined as Pad for version 4.0, it is in fact used as the MaximumDepth all along.

The pool lookaside lists that are dedicated to per-processor management of small pool allocations date from version 5.0. Version 4.0 has pool lookaside lists too but not separately for each processor. The control structures are in the kernel’s data and have a slightly different implementation (see below) in which AllocateMisses, FreeMisses and LastAllocateMisses are instead AllocateHits, FreeHits and LastAllocateHits.

Initialisation of a lookaside list allows the specification of callback functions. In lookaside lists that are initialised by ExInitializeLookasideListEx, these callbacks are slightly different, such that Allocate and Free are instead AllocateEx and FreeEx.

The Non-Paged Lookaside List

The NPAGED_LOOKASIDE_LIST is for non-paged memory that can be allocated and freed even at DISPATCH_LEVEL. It elaborates on the GENERAL_LOOKASIDE by adding a spin lock that provided early versions with synchronisation for access to the ListHead.

Offset (x86) Offset (x64) Definition Versions
0x00 0x00
GENERAL_LOOKASIDE L;
4.0 and higher
0x48 (4.0 to 5.0);
0x80
 
KSPIN_LOCK Lock;
4.0 to 5.0
KSPIN_LOCK Lock__ObsoleteButDoNotDelete;
5.1 and higher

The spin lock never was defined for 64-bit Windows, which never has needed external synchronisation for operations on an SLIST_HEADER. The spin lock became unnecessary for 32-bit Windows too, once Windows XP required the cmpxchg8b instruction. Though the spin lock could not simply be dropped from the definition (lest a new driver with a new NPAGED_LOOKASIDE_LIST find itself running on an older Windows version that would expect to initialise and use the spin lock), its explicit retention for the kernel’s definition of the structure has been nothing but waste since space for the spin lock is covered by the contemporaneous introduction of cache alignment.

The Paged Lookaside List

Memory that is managed in a paged lookaside list is allocated and freed at no higher than APC_LEVEL. The PAGED_LOOKASIDE_LIST is a GENERAL_LOOKASIDE plus a mutex instead of a spin lock.

Offset (x86) Offset (x64) Definition Versions
0x00 0x00
GENERAL_LOOKASIDE L;
4.0 and higher
0x48 (4.0 to 5.0);
0x80
 
FAST_MUTEX Lock;
4.0 to 5.0
FAST_MUTEX Lock__ObsoleteButDoNotDelete;
5.1 and higher

Again the mutex never was defined for 64-bit Windows and its explicit retention for 32-bit Windows XP and higher is nothing but waste because of the structure’s expansion for cache alignment.

The Extended Lookaside List

The LOOKASIDE_LIST_EX, which to some extent replaces both the NPAGED_LOOKASIDE_LIST and PAGED_LOOKASIDE_LIST, is accessed only through functions that are new for Windows Vista and therefore can throw away what were anyway the redundant provisions for backwards compatibility with Windows 2000 and earlier.

Offset (x86) Offset (x64) Definition Versions
0x00 0x00
GENERAL_LOOKASIDE_POOL L;
6.0 and higher

Perhaps more importantly for programming practice is that the LOOKASIDE_LIST_EX also throws away the other structures’ cache alignment. Drivers that want a cache-aligned lookaside list for performance must arrange the alignment themselves.

The Original Pool Lookaside List

If only for completeness, it’s perhaps as well to pick up some history. The lookaside lists that the kernel uses internally to optimise small pool allocations were originally reductions of the public structure. Microsoft’s name for the reduced structure is not known (though POOL_LOOKASIDE, to contrast with GENERAL_LOOKASIDE, seems plausible as a guess). Names and types of members are inferred by assuming that the GENERAL_LOOKASIDE for version 5.0 incorporated them straightforwardly. The structure is 0x28 bytes.

Offset (x86) Definition
0x00
SLIST_HEADER ListHead;
0x08
USHORT Depth;
0x0A
USHORT MaximumDepth;
0x0C
ULONG TotalAllocates;
0x10
ULONG AllocateHits;
0x14
ULONG TotalFrees;
0x18
ULONG FreeHits;
0x1C
ULONG LastTotalAllocates;
0x20
ULONG LastAllocateHits;
0x24
KSPIN_LOCK Lock;

The version 4.0 kernel’s .data section has two arrays of these, for nonpaged and paged allocations respectively. The eight structures in each array cache successively larger pool blocks from 0x20 bytes to 0x0100. See that each structure has its own spin lock so that pool blocks can be allocated from and freed to the list only by one processor at a time. Version 5.0 has the arrays in each processor’s KPRCB. With no need for a spin lock, the pool lookaside list could as well be a general lookaside list.