Geoff Chappell, Software Analyst
Software doesn’t have to be very complex before its performance-aware programmer wonders whether some of its dynamic memory allocations come and go with such frequency that they might better be freed not by returning them to the operating system’s memory manager but instead by caching them for quick reuse. This is especially attractive when the allocations are for a particular purpose and are all the same size, so that the cache for that purpose need be nothing more sluggish than a single-linked list. In a simple implementation, allocations are sought from the cache first and only then from the memory manager, and freed memory goes only to the cache. This works well enough when the demand is persistent, but more sophistication is needed when demand has significant peaks. Between such peaks, the cached allocations become a large amount of memory that is unavailable for other use and which might better be returned to the operating system after all. The aim, then, is for the operating system to provide for quickly reusing fixed-size blocks of memory with light-weight management of demand and waste.
The Windows kernel certainly counts as complex software, especially in combination with its ecology of drivers that execute in kernel mode, and so even version 3.10 provides some sort of caching of fixed-size memory allocations. That first draft, based around exported functions such as ExInitializeZone and macros such as ExAllocateFromZone, is perhaps better described as batching its memory allocations. It also is perhaps better passed over. Though the code that supports this early feature is retained to this day, the relevant functions and macros were documented as obsolete as early as the Device Driver Kit (DDK) for Windows 2000 in favour of the lookaside lists that had been introduced for Windows NT 4.0.
Each lookaside list has a control structure that provides access to a single-linked list of cached memory allocations. When they were introduced, lookaside lists came in two types, distinguished by whether the allocations are intended to be of non-paged or paged memory. Separate sets of exported functions and inline routines provided for operating on these separate types of lookaside list. Windows Vista unified these two types of lookaside list at the price of introducing a third set of functions and routines.
Function or Routine | Versions | Remarks |
---|---|---|
ExAllocateFromLookasideListEx | inline routine defined in WDM.H, starting with Windows Vista | |
ExAllocateFromNPagedLookasideList | inline routine defined in NTDDK.H or WDM.H, starting with Windows NT 4.0 | |
ExAllocateFromPagedLookasideList | 4.0 and higher (x86); 5.2 only (x64) |
also as inline routine defined in WDM.H, starting with Windows XP, except for x86 if _WIN2K_COMPAT_SLIST_USAGE defined |
ExDeleteLookasideListEx | 6.0 and higher | |
ExDeleteNPagedLookasideList | 4.0 and higher | |
ExDeletePagedLookasideList | 4.0 and higher | |
ExFlushLookasideListEx | 6.0 and higher | |
ExFreeToLookasideListEx | inline routine defined in WDM.H, starting with Windows Vista | |
ExFreeToNPagedLookasideList | inline routine defined in NTDDK.H or WDM.H, starting with Windows NT 4.0 | |
ExFreeToPagedLookasideList | 4.0 and higher (x86); 5.2 only (x64) |
also as inline routine defined in WDM.H, starting with Windows XP, except for x86 if _WIN2K_COMPAT_SLIST_USAGE defined |
ExInitializeLookasideListEx | 6.0 and higher | |
ExInitializeNPagedLookasideList | 4.0 and higher | |
ExInitializePagedLookasideList | 4.0 and higher |
Each type of lookaside list has exported functions to initialise the list and eventually to delete it. Initialisation prepares the control structure, but not only to enter caller-supplied parameters for the list’s later explicit operation. It also links the control structure into a double-linked list of all control structures of the same type. This provides for the kernel’s implicit operation on the list, as when trimming the cache between peaks in demand. The corresponding deletion (before the control structure is itself removed from memory) is therefore vital. As Microsoft puts it, though not until the DDK for Windows XP: “It is a serious programming error to do otherwise.”
Between initialisation and deletion, inline routines allocate from or free to the list. Allocation from the list means first to pop one entry from the list’s single-linked cache, else to obtain a fresh allocation according to the caller-supplied parameters. Among these can be a caller-supplied allocation routine, but the default is to allocate from the non-paged or paged pool. For all practical effect, the ExAllocateFrom... routines are the closely corresponding ExAllocatePoolWithTag but with the frequent gain of retrieving from the specialised list instead of contending with the many other callers of the general function. To free to the list is to push the allocation onto the list’s single-linked cache in anticipation of quick reuse, but again with a fallback: if the list is deemed to be big enough, it is instead freed more permanently by passing it through a caller-supplied routine, the default being to return the allocation to whichever pool it came from.
Each set of functions that operate on lookaside lists works with a corresponding control structure:
All three control structures are documented as opaque, much of the point being that all manipulation of them (if not all useful inspection of them) can be done through the documented functions. If you looked only at the sizes of these structures as known to the kernel, e.g., from type information in public symbol files, you might think the structures have seen non-trivial change:
Version | NPAGED_LOOKASIDE_LIST | PAGED_LOOKASIDE_LIST | LOOKASIDE_LIST_EX | |||
---|---|---|---|---|---|---|
Size (x86) | Size (x64) | Size (x86) | Size (x64) | Size (x86) | Size (x64) | |
4.0 to 5.0 | 0x50 | 0x68 | ||||
5.1 to early 5.2 | 0x0100 | 0x0100 | ||||
late 5.2 | 0xC0 | 0x80 | 0xC0 | 0x80 | ||
6.0 and higher | 0xC0 | 0x80 | 0xC0 | 0x80 | 0x48 | 0x60 |
It happens, however, that the variation in the sizes of the NPAGED_LOOKASIDE_LIST and PAGED_LOOKASIDE_LIST is all explained by the introduction of cache alignment, a consequent shift of the last member, and then a change in the alignment requirement. When cache-alignment was first applied to lookaside lists for Windows XP, the cache line was taken as 0x80 bytes. Windows Server 2003 SP1 reduced it to 0x40. Either way, the cache-aligned size would be 0x80 bytes for both x86 and x64 builds were it not for the particular way that the cache alignment is done: it applies not only to the whole structure but also to the x86 build’s last member which thus shifts to offset 0x80. It is some sort of irony that this last member is ignored by the kernel’s x86 builds in all these versions that have cache alignment: this structure that is designed for efficient memory use is a cache block bigger for what looks to be no benefit to anyone.
Cache alignment is anyway not universal. The C-language definitions in WDM.H omit the specification of cache alignment when building for 32-bit Windows if WDM.H is included from any of the usual headers for kernel-mode programming, i.e., NTDDK.H, NTIFS.H or NDIS.H. For most kernel-mode programming, including Microsoft’s, the 32-bit NPAGED_LOOKASIDE_LIST and PAGED_LOOKASIDE_LIST are 0x50 and 0x68 bytes in all versions.
All three public structures for lookaside lists are based in some sense on what was originally the GENERAL_LOOKASIDE but has since Windows Vista been named the GENERAL_LOOKASIDE_POOL.
Version | GENERAL_LOOKASIDE | GENERAL_LOOKASIDE_POOL | ||
---|---|---|---|---|
Size (x86) | Size (x64) | Size (x86) | Size (x64) | |
4.0 to 5.0 | 0x48 | |||
5.1 to early 5.2 | 0x80 | 0x80 | ||
6.0 and higher | 0x80 | 0x80 | 0x48 | 0x60 |
The plain GENERAL_LOOKASIDE became larger in version 5.1 because of cache alignment. The GENERAL_LOOKASIDE_POOL was introduced, without cache alignment, for Windows Vista in a reworking of the slightly different lookaside lists that the kernel dedicates to per-processor management of small pool allocations. In Windows Vista and since, each KPRCB for a processor has 0x40 or 0x60 of these lookaside lists in two or three arrays named PPNPagedLookasideList, PPPagedLookasideList and (in version 6.2 and higher) PPNxPagedLookasideList. Cache alignment of each structure in these arrays would waste a lot of space. Thus did the unaligned layout from version 4.0 become restored but need a new name.
Offset (x86) | Offset (x64) | Definition | Versions |
---|---|---|---|
0x00 | 0x00 |
SLIST_HEADER ListHead; |
4.0 to 5.2 |
union { SLIST_HEADER ListHead; SINGLE_LIST_ENTRY SingleListHead; }; |
6.0 and higher | ||
0x08 | 0x10 |
USHORT Depth; |
4.0 and higher |
0x0A |
USHORT Pad; |
4.0 only | |
0x12 |
USHORT MaximumDepth; |
5.0 and higher | |
0x0C | 0x14 |
ULONG TotalAllocates; |
4.0 and higher |
0x10 |
ULONG AllocateMisses; |
4.0 only | |
0x18 |
union { ULONG AllocateMisses; ULONG AllocateHits; }; |
5.0 and higher | |
0x14 | 0x1C |
ULONG TotalFrees; |
4.0 and higher |
0x18 |
ULONG FreeMisses; |
4.0 only | |
0x20 |
union { ULONG FreeMisses; ULONG FreeHits; }; |
5.0 and higher | |
0x1C | 0x24 |
POOL_TYPE Type; |
4.0 and higher |
0x20 | 0x28 |
ULONG Tag; |
4.0 and higher |
0x24 | 0x2C |
ULONG Size; |
4.0 and higher |
0x28 | 0x30 |
PALLOCATE_FUNCTION Allocate; |
4.0 to 5.2 |
union { PALLOCATE_FUNCTION_EX AllocateEx; PALLOCATE_FUNCTION Allocate; }; |
6.0 and higher | ||
0x2C | 0x38 |
PFREE_FUNCTION Free; |
4.0 to 5.2 |
union { PFREE_FUNCTION_EX FreeEx; PFREE_FUNCTION Free; }; |
6.0 and higher | ||
0x30 | 0x40 |
LIST_ENTRY ListEntry; |
4.0 and higher |
0x38 | 0x50 |
ULONG LastTotalAllocates; |
4.0 and higher |
0x3C |
ULONG LastAllocateMisses; |
4.0 only | |
0x54 |
union { ULONG LastAllocateMisses; ULONG LastAllocateHits; }; |
5.0 and higher | |
0x40 | 0x58 |
ULONG Future [2]; |
4.0 and higher |
Though the word at offset 0x0A is defined as Pad for version 4.0, it is in fact used as the MaximumDepth all along.
The pool lookaside lists that are dedicated to per-processor management of small pool allocations date from version 5.0. Version 4.0 has pool lookaside lists too but not separately for each processor. The control structures are in the kernel’s data and have a slightly different implementation (see below) in which AllocateMisses, FreeMisses and LastAllocateMisses are instead AllocateHits, FreeHits and LastAllocateHits.
Initialisation of a lookaside list allows the specification of callback functions. In lookaside lists that are initialised by ExInitializeLookasideListEx, these callbacks are slightly different, such that Allocate and Free are instead AllocateEx and FreeEx.
The NPAGED_LOOKASIDE_LIST is for non-paged memory that can be allocated and freed even at DISPATCH_LEVEL. It elaborates on the GENERAL_LOOKASIDE by adding a spin lock that provided early versions with synchronisation for access to the ListHead.
Offset (x86) | Offset (x64) | Definition | Versions |
---|---|---|---|
0x00 | 0x00 |
GENERAL_LOOKASIDE L; |
4.0 and higher |
0x48 (4.0 to 5.0); 0x80 |
KSPIN_LOCK Lock; |
4.0 to 5.0 | |
KSPIN_LOCK Lock__ObsoleteButDoNotDelete; |
5.1 and higher |
The spin lock never was defined for 64-bit Windows, which never has needed external synchronisation for operations on an SLIST_HEADER. The spin lock became unnecessary for 32-bit Windows too, once Windows XP required the cmpxchg8b instruction. Though the spin lock could not simply be dropped from the definition (lest a new driver with a new NPAGED_LOOKASIDE_LIST find itself running on an older Windows version that would expect to initialise and use the spin lock), its explicit retention for the kernel’s definition of the structure has been nothing but waste since space for the spin lock is covered by the contemporaneous introduction of cache alignment.
Memory that is managed in a paged lookaside list is allocated and freed at no higher than APC_LEVEL. The PAGED_LOOKASIDE_LIST is a GENERAL_LOOKASIDE plus a mutex instead of a spin lock.
Offset (x86) | Offset (x64) | Definition | Versions |
---|---|---|---|
0x00 | 0x00 |
GENERAL_LOOKASIDE L; |
4.0 and higher |
0x48 (4.0 to 5.0); 0x80 |
FAST_MUTEX Lock; |
4.0 to 5.0 | |
FAST_MUTEX Lock__ObsoleteButDoNotDelete; |
5.1 and higher |
Again the mutex never was defined for 64-bit Windows and its explicit retention for 32-bit Windows XP and higher is nothing but waste because of the structure’s expansion for cache alignment.
The LOOKASIDE_LIST_EX, which to some extent replaces both the NPAGED_LOOKASIDE_LIST and PAGED_LOOKASIDE_LIST, is accessed only through functions that are new for Windows Vista and therefore can throw away what were anyway the redundant provisions for backwards compatibility with Windows 2000 and earlier.
Offset (x86) | Offset (x64) | Definition | Versions |
---|---|---|---|
0x00 | 0x00 |
GENERAL_LOOKASIDE_POOL L; |
6.0 and higher |
Perhaps more importantly for programming practice is that the LOOKASIDE_LIST_EX also throws away the other structures’ cache alignment. Drivers that want a cache-aligned lookaside list for performance must arrange the alignment themselves.
If only for completeness, it’s perhaps as well to pick up some history. The lookaside lists that the kernel uses internally to optimise small pool allocations were originally reductions of the public structure. Microsoft’s name for the reduced structure is not known (though POOL_LOOKASIDE, to contrast with GENERAL_LOOKASIDE, seems plausible as a guess). Names and types of members are inferred by assuming that the GENERAL_LOOKASIDE for version 5.0 incorporated them straightforwardly. The structure is 0x28 bytes.
Offset (x86) | Definition |
---|---|
0x00 |
SLIST_HEADER ListHead; |
0x08 |
USHORT Depth; |
0x0A |
USHORT MaximumDepth; |
0x0C |
ULONG TotalAllocates; |
0x10 |
ULONG AllocateHits; |
0x14 |
ULONG TotalFrees; |
0x18 |
ULONG FreeHits; |
0x1C |
ULONG LastTotalAllocates; |
0x20 |
ULONG LastAllocateHits; |
0x24 |
KSPIN_LOCK Lock; |
The version 4.0 kernel’s .data section has two arrays of these, for nonpaged and paged allocations respectively. The eight structures in each array cache successively larger pool blocks from 0x20 bytes to 0x0100. See that each structure has its own spin lock so that pool blocks can be allocated from and freed to the list only by one processor at a time. Version 5.0 has the arrays in each processor’s KPRCB. With no need for a spin lock, the pool lookaside list could as well be a general lookaside list.