Profiling

Improved performance can be gained simply by thinking about a program’s design and implementation, hoping to realise where it might have been planned for different trade-offs, or written with more care, or even just compiled and linked with different options for optimisation. That is all guesswork, however, and even the most educated guesswork can take a programmer only so far. Especially for a complex program, significant reworking just to improve performance—even if better performance is needed urgently—is pretty much unthinkable without evidence that points convincingly to what change would be most productive.

The particular type of evidence considered here comes from sampling the computer’s execution in some orderly way that builds a statistical profile. It’s especially useful if what’s to be profiled executes the same way over and over or is left to its typical usage for a long time. Its great merit is that, unlike instrumentation, it requires no change to the code. Instead, the code’s execution is sampled from outside, by enlisting the operating system to arrange for hardware interrupts to recur frequently enough to build a good sample over time but not too frequently to disturb the profiled code’s normal execution (let alone any other code’s). The profile is then a frequency distribution: how often did the profiled code get interrupted where?

Historically, and perhaps still most commonly, these hardware interrupts are generated by a high-performance timer of some sort and are reliably periodic. Given that the interrupts occur in the profiled code often enough to make a good sample, the profile thus answers the question “where did the profiled code spend its time?”

This profiling by time has operating-system support in Windows right from the start, i.e., Windows NT 3.10. As early as version 3.51, however, it was at least imagined that the hardware interrupts may be generated from other profile sources. After all, since a periodic interrupt might be arranged as the response to a timer’s counter of clock cycles reaching some limit (or, perhaps more typically, counting down to zero), it’s hardly different to arrange for a profile interrupt whenever a processor’s count of branch mis-predictions or cache misses reaches some limit. Then profiling would answer questions such as “where does the profiled code suffer from branch mispredictions?” and might prompt such questions as “did I expect that my code would be more delayed by cache misses here than there?”

Note that the preceding description talks not of the program’s execution but of the profiled code’s. Typical practice, if only as a first step, is indeed to have a separate profiling tool sample the whole of a program’s execution. But there is no rule about this. The profiled code can be any range of address space for any process that the profiler has sufficient access to. At one extreme, a profiler might sample all execution anywhere in a process’s address space, even its kernel-mode execution. More often in practice, profiling is directed to addresses where particular modules are known to be loaded. At the other extreme, a profile that knows enough of the target program, e.g., by having symbol files, might study very closely the execution of particular routines.

That the profiler is most often its own program, developed explicitly as a tool for performance analysis of other programs, is inevitable given profiling’s advantage of measuring the execution of code exactly as it will run in real-world use after sale to customers. But there is no rule about this, either. For a high-performance encoding of an algorithm, both the development and then the continued evaluation after real-world deployment might easily be helped if, as some sort of mixture of sampling and instrumentation, the program that uses the algorithm collects for itself a profile just of the program’s own runs of the algorithm. A Demonstration of Self-Profiling, with source code, is presented separately.

Documentation Status

Perhaps the main constraint on innovative use of profiling for the performance analysis of Windows applications is that the operating system’s support is mostly undocumented. That’s not to say it’s unknown outside Microsoft. Programmers evidently have watched Microsoft’s tools for profiling and inferred what magic incantations they can put to use for their own purposes in their own ways. But it is to say that for a technique that plainly does interest many programmers, the details are relatively obscure.

API Summary

Broadly speaking, the incantations start by describing the desired profilng to the NtCreateProfileEx function or to its older form NtCreateProfile. This produces a handle to a newly created executive profile object that remembers the parameters, i.e., what execution to sample, subject to what conditions, with what granularity, and where to store the results. This handle can then be given to NtStartProfile and NtStopProfile, in turn, any number of times, to start and stop the profiling of whatever execution was described, until the handle is eventually closed, e.g., by an explicit call to CloseHandle.

Though the profiling functions are not documented, the abstracted profile sources have been semi-documented from the very beginning. They are modelled programmatically as the KPROFILE_SOURCE enumeration. Its C-language definition—unchanged in two decades—is in one or another header from every known Device Driver Kit (DDK) or Windows Driver Kit (WDK) except for the very earliest. Despite this, profile sources have been somewhat mysterious since the definition is never referenced from elsewhere in the same header or from other headers or from any sample code. Adding to the mystery is that for much of the history of Windows no profile source actually was implemented other than a periodic timer. It is not hard to find on the Internet people who write that profiling from a processor’s Performance Monitoring Counters (PMC), e.g., for branch misprediction, is something that Windows has only recently learnt. To some extent, they are correct: 32-bit Windows acquired PMC support only with Windows 8. But 64-bit Windows had this capability from the start, i.e., Windows Server 2003 SP1.

Because profiling is a statistical sampling, the quality of the data depends on the circumstances in which the sample is collected. One factor that’s in the control of the API is the frequency of sampling. The interval between interrupts can be learnt through NtQueryIntervalProfile and changed through NtSetIntervalProfile. For the basic profile source ProfileTime (0), the unit of measurement for the corresponding interval is in the standard one of timekeeping in Windows: 100 nanoseconds. For the other profile sources, the interval is a count of events after which the processor is to raise an interrupt. The shorter the interval, the more overhead. Even though profiling to increment an execution count adds very, very little to each interrupt, certainly relative to tracing an event to an NT Kernel Logger session, the accumulation can be noticeable not just on the profiled execution but on all execution, i.e., of other people’s code. Profile sources are a communal resource. Because setting a shorter interval for a profile source affects other people’s code, many programs would better not be permitted to do it: since Windows 8 it requires SeSystemProfilePrivilege.

Mostly, though, profiling requires no privilege or unusual permission. Privilege is required to profile globally, but any program that can open a handle to a process and get PROCESS_QUERY_INFORMATION access can profile that process’s execution even if it can’t read what it’s profiling. That’s enough to stop a low-integrity process from profiling a higher-integrity process, e.g., to get data from which to infer addresses that are subject to Address Space Layout Randomisation (ASLR). But even a low-integrity program can profile its own user-mode execution—and before Windows 8.1 its kernel-mode execution too.