RtlUnicodeStringToInteger

The RtlUnicodeStringToInteger function parses a 32-bit integer from a string.

Declaration

NTSTATUS 
RtlUnicodeStringToInteger (
    UNICODE_STRING const *String, 
    ULONG Base, 
    ULONG *Value);

Parameters

The required String argument indirectly provides the size and address of an array of Unicode characters. These input characters are as many at Buffer as fit within Length bytes, up to but not including the first null. Only the Length and Buffer in the input structure matter: the MaximumLength is ignored. The characters are treated as read-only.

The optional Base argument is the numerical base to use for parsing characters as digits. The supported bases are 2, 8, 10 and 16. This argument can be zero to direct that the base be inferred from a prefix in the string else be defaulted to 10.

The required Value argument is the address of a variable that is to receive the integer that the characters evaluate to.

Return Value

The function returns STATUS_SUCCESS if successful, else a negative error code.

Availability

The RtlUnicodeStringToInteger function is exported by name from the kernel in version 3.51 and higher. It is present in version 3.10 but only as an internal routine.

In user-mode, the RtlUnicodeStringToInteger is exported by name from NTDLL.DLL in all known versions, i.e., 3.10 and higher.

Documentation Status

The RtlUnicodeStringToInteger function is documented in all known editions of the Device Driver Kit (DDK) or Windows Driver Kit (WDK) since at least the DDK for Windows NT 3.51. Though this documentation is of the kernel-mode function as an export from the kernel, it is mostly applicable to the user-mode implementation too, both being plausibly compiled from the same source file.

Only relatively recently can documentation of RtlUnicodeStringToInteger be fairly described as accurate. This seems a little strange for what might otherwise be thought a simple function. It might understandably have been thought to be worth only a little trouble. Against this is the notion that utility functions such as this are supposed to be favoured by programmers, over writing their own routines, one argument being that the operating system’s manufacturer presumably does them better. Microsoft sometimes laments what trouble it is put to because programmers don’t paint between the lines, but if Microsoft accepted a matching responsibility to draw the lines accurately, then this article on an apparently straightforward utility function could not possibly have so long a section on Documentation Errors (see below).

Behaviour

The function examines the Unicode characters described by String, interprets them mostly as digits relative to some Base, and writes its evaluation to the address given by Value. The parsing allows for the following elements in sequence, each being optional:

Failure to parse into these elements is not failure for the function, but just means the string evaluates as zero.

Implementation Details

In version 6.0 and higher, if the Length member of the given String is zero or is odd, the function returns STATUS_INVALID_PARAMETER (but see below about Exception Handling). Earlier versions accept an odd Length, simply ignoring the excess byte. In these versions, the function ordinarily succeeds trivially if Length is zero or one but is liable to faulty behaviour (see below among the Coding Errors).

The function skips leading white space, meaning specifically characters that are numerically less than 0x0020.

There may then be a plus sign (0x002B) or minus sign (0x002D). If it is a plus sign, it is ignored, but a minus sign here has the effect of negating whatever evaluation results from subsequent characters.

If Base is anything other than 0, 2, 8, 10 or 16, the function returns STATUS_INVALID_PARAMETER (but, again, see below). If Base is 0, the next two characters can be a valid case-sensitive base specifier: '0' (0x0030) and then one of 'b' (0x0062) for 2, 'o' (0x006F) for 8, or 'x' (0x0078) for 16. Without a base specifier, the base defaults to 10.

Subsequent characters, if any, are parsed as case-insensitive digits relative to the given or inferred base. Characters '0' (0x0030) to '9' (0x0030) count as 0 to 9. Characters 'A' (0x0041) to 'F' (0x0046) and 'a' (0x0061) to 'f' (0x0066) count as 10 to 15. Evaluation starts as zero and accumulates as an unsigned 32-bit integer for as many characters as are valid digits for the base, including none. There is no defence against overflow: the evaluation is modulo 4G.

Exception Handling

While writing the evaluation to the variable whose address is given as Value, occurrence of an exception is failure for the function, which returns the exception code, e.g., STATUS_ACCESS_VIOLATION.

In version 6.0 and higher, even the failing function writes its evaluation (which will be zero). An exception then returns the exception code, not STATUS_INVALID_PARAMETER.

Documentation Errors

Though the function’s only significant change of code was for Windows Vista, it has been through two significant changes of documentation: first, for Windows Vista, but evidently not to describe the changed code; and then for Windows 8. There is also for Windows 7 the usual revision in which ancient functions, such as this, are said to be “Available in Windows 2000 and later versions of Windows.”

That the function fails if Length is zero dates from Windows Vista, and is not just a bug fix but is arguably the most noticeable change of behaviour in the function’s history. Yet it did not make it to the documentation until Windows 8. The documentation then changes to read as if “the string is empty” is the only cause of failure. A closely related failure which is also new for Windows Vista is that an odd Length is similarly rejected as an invalid parameter: this is not noted in any known edition of the documentation.

Though skipping white space at the beginning is ancient behaviour, Microsoft somehow managed not to document it for nearly two decades (again for Windows 8).

No known documentation from Microsoft says explicitly that Base is restricted. It lists 2, 8, 10 and 16 as the only possibilities for the inferred base when Base is given as zero, but it leaves open what Base is allowed when given as non-zero. This is specially remarkable because an invalid Base is originally the only failure case that the function tests for itself (as opposed to failing from a handled exception).

The prefix that the function looks for when Base is zero also was problematic for Microsoft to convey. Documentation before Windows Vista has it that the function “checks for a leading character” rather than for two, omitting the need for introduction by a '0'. Correcting this looks to be the main reason the documentation was revised for Windows Vista.

On the plus side, documentation for Windows 8 not only discovered the white space, so that it finally covers all elements of the expected syntax, but presents a helpful table of examples. Curiously, no example shows a sequence of digits whose evaluation overflows 32 bits. To this day, 12th March 2019, Microsoft’s online documentation still does not tell programmers what evaluation to expect from overflow.

From the DDK for Windows NT 3.51 through to the WDK for Windows 7, the documentation says of failure that “the Value is set to 0,” and the function “returns STATUS_INVALID_PARAMETER.” Presumably, this means that Microsoft intended from the beginning that callers can rely on the function to produce zero as its evaluation even when failing. The fact, however, is that the function does not actually do this until Windows Vista. Documentation for Windows 8 removes all talk of setting the Value on failure, even though the contemporaneous implementation does always try.

Coding Errors

Although version 3.10 understands well enough that it is reading Unicode characters from a UNICODE_STRING, it interprets only the low byte of each as if it had read only a single-byte character. Who’s ever to know how that happened, but it is represented below as if Microsoft used its TCHAR type (which helps the same source code work for either Unicode or ANSI characters, depending on conditional compilation) without having defined the UNICODE macro.

Empty Input

That Microsoft waited until Windows 8 to document that the function ignores leading white space is specially remarkable because Microsoft will by then have known for a few years that the loop for this undocumented skipping of white space had coding errors. In all versions, the loop for skipping white space at the start of the string is something like

PCWSTR p = String -> Buffer;
ULONG count = String -> Length / sizeof (WCHAR);
TCHAR first;
while (count -- != 0) {
    first = (TCHAR) *p ++;
    if (first > _T (' ')) break;
    if (count == 0) {
        first = _T ('\0');
        break;
    }
}

Here, count is my name for the function’s count of characters that remain for it to examine. The first character that is not white space is not only the first to examine on exit from the loop but is remembered to the function’s end in case it is a minus sign. There are two problems if the Length is 0 or 1 when entering the loop: count underflows; and (before version 10.0) first is uninitialised. Both these problems are immaterial in version 6.0 and higher because 0 and 1 are rejected before the loop.

Before Windows Vista, a Length of 0 or 1 can crash the function. The undefended underflow of count means the function will proceed in the mistaken belief that the Buffer continues for billions of more characters. Whether the function tries to read any depends on what it believes is the first. If this is on the stack, then for being uninitialised it can retain an essentially arbitrary value from prior execution. If this happens (or is contrived) to be a plus sign, minus sign, a valid digit for the given Base, or '0' if Base is zero, then the function will attempt to read at least one character from Buffer even though the Length of 0 or 1 gives it no entitlement.

When Length is 0, Buffer can legitimately be NULL and the function’s unentitled attempt to read from Buffer will fault.

It must be stressed, though, that aside from this case, there ordinarily will be no harm. When Length is 0 or 1, Buffer can also legitimately be not NULL, the general idea being that it addresses MaximumLength bytes of which Length bytes are currently a Unicode string, not counting any null terminator, which the UNICODE_STRING documentation explicitly allows need not be present. But in ordinary practice, and certainly in what kernel-mode programmers have learnt is the safest practice for working with UNICODE_STRING structures, the Buffer will have been prepared from a null-terminated string, e.g., by feeding it to RtlInitUnicodeString, and a null terminator will be present (unless, before version 5.2, the string is very long). Though the function’s reading of this null is technically out-of-bounds, it causes no harm and the function succeeds with zero as its valuation.

But ordinary practice is not all practice. Whatever the Length in a UNICODE_STRING, what the Buffer contains beyond its first Length bytes is essentially arbitrary, e.g., for being retained from previous use of the Buffer for some other string. When Length is 0 or 1 and the Buffer happens to be filled with valid digits, the function will continue reading them beyond the buffer’s end, where there is not certainly any more memory to read. This too will fault.

Both ways to crash RtlUnicodeStringToInteger before Windows Vista by giving zero for Length and contriving what’s on the stack and in the Buffer are reproduced easily enough within a test program using the NTDLL implementation. Whether these input cases can be arranged for a call to the function as made by a separate process or in kernel mode is not known.

IRQL

Only one reason is known that the kernel’s implementation of RtlUnicodeStringToInteger cannot safely execute at high IRQL if the String and its Buffer and the Value too are all in non-paged memory: all versions that export the function implement it in paged memory. The RtlUnicodeStringToInteger function must therefore be called only at PASSIVE_LEVEL, which Microsoft has always documented.