CMPXCHG8B Support in the 32-Bit Windows Kernel

The 32-bit Windows kernel started using the 8-byte compare-exchange instruction (cmpxchg8b) in version 4.0. At first, the instruction had only a few uses, for more efficient coding of a few newly exported functions:

ExInterlockedCompareExchange64;
ExInterlockedPopEntrySList;
ExInterlockedPushEntrySList;

and of one internal routine that improves on the ancient export ExInterlockedAddLargeInteger. With successive versions, cmpxchg8b found ever more use, not just in more exported functions, such as ExInterlockedFlushSList (added in version 5.0), but especially internally, e.g., for working with 64-bit page table entries when using Physical Address Extension (PAE).

Curiously, the Driver Development Kit (DDK) for Windows NT 4.0 left ExInterlockedCompareExchange64 undocumented, which is conspicuous because that particular function is little but a wrapper to get C-language arguments into appropriate registers for executing the instruction:

Destination as the formal operand;
Exchange in ecx (high) and ebx (low);
Comperand in edx (high) and eax (low);
Lock ignored.

The Lock argument provides for the “interlocked” functionality to be implemented without the cmpxchg8b instruction. All use of cmpxchg8b can be coded without the instruction, but at the price (in a multi-processor coding, for which temporarily disabling interrupts does not suffice) of having the caller provide storage for a primitive synchronisation object known as a spin lock. For instance, the single instruction

        lock    cmpxchg8b qword ptr [esi]

is replaceable with the following sequence

        pushfd
try:
        cli
        lock    bts dword ptr [edi],0
        jnb     acquired
        popfd
        pushfd
wait:
        test    dword ptr [edi],1
        je      try
        pause                   ; if available
        jmp     wait

acquired:
        cmp     eax,[esi]
        jne     keep
        cmp     edx,[esi+4]
        je      exchange
keep:
        mov     eax,[esi]
        mov     edx,[esi+4]
        jmp     done

exchange:
        mov     [esi],ebx
        mov     [esi+4],ecx
done:
        mov     byte ptr [edi],0
        popfd

provided that the 8 bytes at esi are never modified without acquiring the spin lock at edi which is in turn never used for any other purpose than guarding those 8 bytes. Even putting aside the undesirability of depending on all users of those 8 bytes to cooperate regarding the spin lock, there is the problem that the replacement is a lot of code, not just in terms of space but of execution time. Just for its savings on this point, cmpxchg8b is clearly a nice feature to have to hand, and it was a natural addition when Intel’s processors started working with a 64-bit external bus.

In the early days of Windows NT, however, not all the extant processors implemented the cmpxchg8b instruction. In versions before 5.1, every function that uses the instruction has an alternate coding for processors that do not support the instruction. Very early during its initialisation, the kernel checks whether the boot processor supports the cmpxchg8b instruction. If the support is missing, the kernel patches jmp instructions at the start of each of those functions to redirect execution to their alternates. Conversely, if the boot processor does support the instruction, and the functions are left unpatched, then the kernel requires that all processors support the instruction, under pain of the bug check MULTIPROCESSOR_CONFIGURATION_NOT_SUPPORTED (0x3E). Version 5.1 dropped the alternate codings and made it mandatory that the boot processor supports the cmpxchg8b instruction. Without this support, these versions of the kernel raise the bug check UNSUPPORTED_PROCESSOR (0x5D).

Non-Intel Processors

If reading only Intel’s literature, one might think that testing for the cmpxchg8b instruction is a simple matter of executing the cpuid instruction with 1 in eax and testing for the CX8 bit (masked by 0x0100) in the feature flags that are returned in edx. However, there have always been quirks.

Early Restriction

Versions 4.0 and 5.0 test for the cmpxchg8b instruction twice. The first test applies only to the boot processor. Its purpose is to determine whether functions that use the instructions must be patched, as described above. If the CPUID bit (masked by 0x00200000) in the eflags register cannot be changed, then there is no cpuid instruction let alone cmpxchg8b. The kernel then sets the CPUID bit and executes the cpuid instruction with 1 in eax to produce the feature flags in edx. If these feature flags have the CX8 bit set, then cmpxchg8b is supported and no patches are needed. The second test is done for all processors, including the boot processor, as part of a wider examination of processor features. In builds of version 4.0 from before Windows NT 4.0 SP4, however, this second test recognises the CX8 bit only if the processor’s vendor string is GenuineIntel, AuthenticAMD or CyrixInstead. (The vendor string is the sequence of characters obtained by executing cpuid with 0 in eax and then storing the values of ebx, edx and ecx at successive memory locations.)

For processors that set the CX8 bit but are not made by Intel, AMD or Cyrix, the two tests do not agree and the processor falls foul of the requirement that if the boot processor supports cmpxchg8b then all processors must. The result is the bug check MULTIPROCESSOR_CONFIGURATION_NOT_SUPPORTED—yes, even if there is only one processor. In the Knowledge Base article CMPXCHG8B CPUs in Non-Intel/AMD x86 Compatibles Not Supported, Microsoft is at best disingenuous in suggesting that the first test is only a rough guess from the processor’s “type”, which a second test must “verify” by querying for “specific features”: both tests are specifically for the CX8 bit; the “specific features” that are sought for the second test are not for cmpxchg8b but for particular manufacturers, and Microsoft must have understood this when choosing its words for that article.

Despite its acknowledgement of trouble caused by restricting one CPU feature to known vendors, Microsoft certainly didn’t rush to relax similar restrictions for other features. Although the version 4.0 from Windows NT 4.0 SP4 removes the vendor restrictions from its test for the CX8 bit, other feature flags of interest to it continue to be recognised only for particular vendors:

TSC, VME and MMX require GenuineIntel, AuthenticAMD or CyrixInstead;
PSE, PGE and CMOV require GenuineIntel or CyrixInstead before SP4, or also AuthenticAMD in SP4 and higher;
MTRR requires GenuineIntel.

Only with version 5.0 did Microsoft stop routinely excluding unknown (or unfavoured) CPU manufacturers from having Intel-compatible features be usable by Windows. (This is not to say that exclusions don’t exist in Windows 2000 and higher, just that it credibly isn’t done as routine practice.)

Explicit Recognition

Meanwhile, of course, the CPU manufacturers who were initially excluded from having Windows use their cmpxchg8b instruction will naturally have wanted to sell their processors. Even once they obtained recognition in new Windows versions, they will just as naturally have wanted to sell their processors even to people who might want to run an early build of Windows NT 4.0. To avoid the bug check on these versions, the processor must start with CX8 clear. Inevitably, these vendors developed ways to turn the CX8 bit off, and even to have it turned off by default. After all, if they’re going to sell a CPU for use in computers that might be sold to just about anyone, then to compete at all they need that all Windows versions do at least start on their processors, even if the kernel’s execution without cmpxchg8b is less than optimal. Competing equally might then be possible if Microsoft would make up for the earlier omission and build into new versions of its kernel some recognition of these processors’ support for cmpxchg8b and perhaps even to turn the CX8 bit back on. Microsoft did eventually do this, but again it didn’t rush.

Starting with version 5.1, which is also the first version that won’t start without support for cmpxchg8b, the Windows kernel makes special cases for processors that may implement the cmpxchg8b instruction but do not show the CX8 bit in the feature flags. These provisions remain in the 32-bit kernel until at least the 1803 release of Windows 10. The processors that are catered for are identified by the vendor strings GenuineTMx86, CentaurHauls and (a little later) RiseRiseRise. The following notes describe the provisions as actually made by the Windows kernel, and are in no way concerned with how well the implementation corresponds with documentation by the vendors.

TransMeta

For GenuineTMx86 processors starting with family 5 model 4 stepping 2, if the cmpxchg8b instruction is not indicated in the CPU feature flags, it is enabled by setting the 0x0100 bit in the model-specific register 0x80860004.

The previous paragraph describes the presumed intention. As actually coded, the model and stepping, taken together, must be at least 4 and 2 even if the family is greater than 5. A hypothetical family 6 model 1 stepping 1 would need to have the CX8 bit set or cleared in advance of booting Windows, depending on which version is to be run.

Centaur

If the CPU feature flags for a CentaurHauls processor do not show support for cmpxchg8b, then the support is enabled by slightly different methods depending on the family:

if the family is 5, set the 0x02 bit in the model-specific register 0x0107;
if the family is 6 or higher, set the 0x02 bit and clear the 0x01 bit in the model-specific register 0x1107.

Clearing the 0x01 bit is omitted in early builds. It begins with the version 5.1 from Windows XP SP2 and the version 5.2 from Windows Server 2003 SP1.

Rise

Starting with the version 5.1 from Windows XP SP2 and the version 5.2 from Windows Server 2003 SP1, all RiseRiseRise processors are treated as supporting cmpxchg8b, even without the CX8 bit in the feature flags from cpuid.