Geoff Chappell - Software Analyst
SKETCH OF HOW RESEARCH MIGHT CONTINUE AND RESULTS BE PRESENTED
The large-scale structure of an INDEX.DAT file is a header followed by an array of fixed-sized blocks. The header is 0x4000 bytes. The blocks are 0x80 bytes. Blocks are the allocation units for file-map entries. The name file-map entry is here taken from the WININET symbol file, which shows FILEMAP_ENTRY as Microsoft’s name for a structure that every file-map entry begins with. A file-map entry can have any size (less than 64KB) but necessarily consumes as many consecutive whole blocks as needed to contain the entry. As more space is required for file-map entries, blocks are added to the file, always in multiples of 0x4000 bytes. There is an upper limit because the allocation state of the blocks is recorded only in the header. Indeed, the header is mostly a bitmap in which successive bits represent successive blocks. Ignoring other data in the header gives 16MB as an approximate maximum size for an INDEX.DAT file.
File-map entries fall into two broad categories. The sort of entry that the file exists for holds a URL and associates it with other information. Some of this cached information for the URL is stored in the entry itself, but a significant provision is that information to be saved about a URL may be stored as a separate file, called the local file, such that the entry in INDEX.DAT needs only to save some record of where to find the local file. Indeed, the collection of these local files is the cache. What needs to be saved in INDEX.DAT for any one URL is therefore rarely huge but is typically a few blocks big, the size being dominated by the length of strings such as the URL itself and the filename for the cached storage. Each URL, together with its associated information, is usefully saved as its own file-map entry.
The other broad category exists because the file has to support efficient searching for URL entries and also allows for grouping of URL entries. Both purposes need a multitude of small structures that would be wasteful to store in the file as entries in their own right and which may anyway need to be kept together for ready access. Since the file is memory-mapped, a good scale for the notion of ready access is the CPU page size (0x1000 bytes). The INDEX.DAT file therefore has entries that are always page-sized and which hold in turn a collection of small control structures of one sort or another. As an aside, it seems most plausible that 0x80 is chosen as the block size so that a CPU page helpfully corresponds to one dword in the allocation bitmap: free space for a page-aligned page-sized entry could be found just by scanning the bitmap for the first clear dword. File-map entries that are at least a page big are always page-aligned.
Microsoft’s name for the file header is not recorded in the public symbols for WININET.
Offset | Size | Description |
---|---|---|
0x00 | 0x1C bytes | signature, necessarily “Client UrlCache MMF Ver 5.2”, including null terminator |
0x1C | dword | file size, in bytes |
0x20 | dword | file offset of first page in hash table, else zero |
0x24 | dword | total number of blocks following header |
0x28 | dword | number of allocated blocks |
0x2C | 4 bytes | apparently unused |
0x30 | qword | cache limit, in bytes |
0x38 | qword | cache size, in bytes |
0x40 | qword | cache usage exempt from scavenging, in bytes |
0x48 | dword | number of subdirectories in cache |
0x4C | 0x0180 bytes | array of 0x20 structures, each of 0x0C bytes, to describe subdirectories in cache |
0x01CC | 0x80 bytes | array of 0x20 dwords, apparently called header data |
0x024C | 4 bytes | apparently unused |
0x0250 | 0x3DB0 bytes | allocation bitmap for blocks following header |
The two members that are marked above as unused are plausibly just artefacts of the programming. This would be directly so for the unused dword at offset 0x2C, which could be compiler-generated padding for the 64-bit alignment of the next member. That the dword at offset 0x024C is unused may indicate that Microsoft’s definition of the header as a formal structure does not include the bitmap. Instead, following a typical practice at Microsoft, the programmer may have defined a single-element array (of bytes or dwords) at offset 0x024C to mark the intention of following the header with something else, even though other code then places the something else not at the marker but after the structure.
The signature at offset 0x00 is required for an existing INDEX.DAT file to be considered valid and is entered into any INDEX.DAT file that is initialised or re-initialised by this WININET version. It also appears in the registry, in each of several possible keys:
Key: | HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Internet
Settings\5.0\Cache HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings\5.0\Cache HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings\5.0\LowCache |
Value: | Signature |
Type: | REG_SZ |
Data: | Client UrlCache MMF Ver 5.2 |
Its purpose there is to mark that WININET has already determined whether the current user, at the current process’s integrity level, has a per-user Content cache or must use a shared cache. This and other matters of cache configuration are left for a separate article.
Several fields in the header help with bookkeeping for the allocation bitmap at the header’s end. The size of a valid INDEX.DAT file, as saved at offset 0x1C in the header, is 0x4000 for the header plus 0x80 for each block that’s counted at offset 0x24 in the header. The dword at offset 0x28 tells how many of these blocks are allocated to file-map entries. A block is allocated if the corresponding bit at offset 0x0250 is set. The header has space for at most 0x0001ED80 such bits, for a maximum file size of 0x00F70000 bytes.
Incidentally, some parts of the WININET code allow for a configurable block size. Other parts have hard-coded assumptions about the block size and won’t work correctly unless the block size is 0x80 bytes.
Local files for the cache are potentially numerous. They can be distributed among as many as 32 randomly named subdirectories of whichever directory holds the INDEX.DAT file. The number of subdirectories yet created for this purpose is saved at offset 0x48 in the file header. The subdirectories themselves are described at offset 0x4C in an array of unnamed structures:
Offset | Size | Description |
---|---|---|
0x00 | dword | number of files in this subdirectory |
0x04 | 8 bytes | name of subdirectory, without null terminator |
With these descriptions in the file header, each URL entry that has a local file in the cache need not hold a complete pathname for the local file, nor even reproduce the name of the directory that contains the local file, just the filename: a one-byte index into this array suffices for the path.
The array at offset 0x01CC provides for indexed storage of an arbitrary dword whose interpretation varies with the index. Although this header data takes space in every INDEX.DAT file, most of it is meaningful in the Content container only.
Index | Symbolic Name | Interpretation |
---|---|---|
0x00 | CACHE_HEADER_DATA_CURRENT_SETTINGS_VERSION | number of changes to any of many WININET settings |
0x01 | CACHE_HEADER_DATA_CONLIST_CHANGE_COUNT | number of changes to container list for same registry set |
0x02 | CACHE_HEADER_DATA_COOKIE_CHANGE_COUNT | number of changes to Cookies container |
0x03 | CACHE_HEADER_DATA_NOTIFICATION_HWND | window handle for cache notifications |
0x04 | CACHE_HEADER_DATA_NOTIFICATION_MESG | window message for cache notifications |
0x05 | CACHE_HEADER_DATA_ROOTGROUP_OFFSET | file offset of first GROUP_ENTRY, else zero |
0x06 | CACHE_HEADER_DATA_GID_LOW | low 32 bits for generation of most recently allocated GROUPID, else zero |
0x07 | CACHE_HEADER_DATA_GID_HIGH | high 32 bits for generation of most recently allocated GROUPID, else zero |
0x0E | CACHE_HEADER_DATA_SSL_STATE_COUNT | potted description needed here! |
0x15 | CACHE_HEADER_DATA_NOTIFICATION_FILTER | bit flags to filter cache notifications |
0x16 | CACHE_HEADER_DATA_ROOT_LEAK_OFFSET | file offset of first leak entry |
0x1B | CACHE_HEADER_DATA_ROOT_GROUPLIST_OFFSET | file offset of first GROUP_LIST_ENTRY, else zero |
This list is of all that are meaningful to WININET. The header data cannot be defined exhaustively from inspection of WININET, because of exposure through the exported functions GetUrlCacheHeaderData, IncrementUrlCacheHeaderData and SetUrlCacheHeaderData. Since these functions are undocumented, external users may be few. The only ones supplied with Windows (Vista) are INETCPL.CPL and MSDRM.DLL, and they access only CACHE_HEADER_DATA_SSL_STATE_COUNT, for purposes not yet studied. These functions anyway affect only the Content container.
Perusal of earlier WININET versions confirms that more of the header data used to be meaningful, even as recently as version 6.0.
Indices 0x05, 0x16 and 0x1B point to the start of one or another chain of structures in file-map entries. Much like the dword at offset 0x20 in the header, which points to the hash table, they are essential for navigating the file. This is not so for the other indices. They appear to be kept in the file header because the file can be accessed from multiple processes concurrently and its memory-mapped image is conveniently to hand as shared memory.
The first three indices and 0x0E are global counters. Each counter governs some state that may be maintained by multiple processes but is invalidated for all if changed by one. For index zero, the relevant state is a large collection of settings, mostly loaded from the registry, that have no obvious or direct conection with URL caching.
Indices 0x03, 0x04 and 0x15 support the (undocumented) RegisterUrlCacheNotification function. No software supplied with Windows Vista imports it. That the header data is more valuable as shared memory than as persistent storage is especially marked for index 0x03: it holds a window handle, whose persistence in a file from one Windows session to another really can’t be much use.
All file-map entries, meaning entries allocated as whole blocks of an INDEX.DAT file, begin with an 8-byte FILEMAP_ENTRY structure:
Offset | Size | Description |
---|---|---|
0x00 | dword | signature |
0x04 | dword | number of blocks allocated to entry |
Some types of file-map entry are distinguished by their signature:
“HASH” | 0x48534148 | page in the hash table |
“LEAK” | 0x4B41454C | leak entry, actually a modified URL entry |
“REDR” | 0x52444552 | redirection entry |
“URL ” | 0x204C5255 | standard URL entry |
Here, for easier reading, each dword signature is presented first as characters, starting from the least significant. No signature is set explicitly for the page-sized entries that hold the several structures for supporting groups of URL entries. For these, the signature is 0xDEADBEEF, this being what all dwords in all the blocks for any entry are filled with when the entry is newly allocated (before the count of blocks is recorded at offset 0x04).
Incidentally, the WININET code provides for filling blocks with 0x0BADF00D when they are deallocated from a file-map entry, but the option to do this is never exercised. Code for deleting URL entries is called with either “DEL ” or “UPD ” as an argument, presumably so that this signature can be set into the deleted entry. However, the called code never acts on this argument. Perusal of earlier WININET versions confirms that the code for both these cases used to be active.
A typical problem for accessing the cache is that a URL is known and information about this URL is either to be retrieved from the cache or saved into the cache. Of course, WININET does not search the whole INDEX.DAT file, nor even just where URLs are known to be stored. Instead, a 32-bit hash is computed of the URL and a small portion of the file is searched for matching hash items. This portion is here called the hash table. It is built as page-sized file-map entries, which are typically scattered through the file.
Each page in the hash table has a 0x10-byte LIST_FILEMAP_ENTRY structure as a header. This begins in turn as a FILEMAP_ENTRY, with “HASH” as its signature and 0x20 as its block count:
Offset | Size | Description |
---|---|---|
0x08 | dword | file offset to next page of hash table, else zero |
0x0C | dword | 0-based serial number of this page within hash table |
Pages that are allocated to the hash table are never deallocated. The file offset of the first page is recorded in the file header, at offset 0x20. The hash table is always examined from the first-allocated page to the last, following the links at offset 0x08 and checking each for the correct serial number at offset 0x0C.
The LIST_FILEMAP_ENTRY on each page of the hash table is followed immediately by an array of 8-byte HASH_ITEM structures:
Offset | Size | Description |
---|---|---|
0x00 | 5 bits | flags |
0x00 | 1 bit | apparently unused |
0x00 | 26 bits | high 26 bits of hash |
0x04 | dword | file offset of corresponding file-map entry, else 3 |
The array of hash items on the page is two-dimensional. Specifically, the hash items are arranged as 64 sets of 7. The 64 comes about because the sets are indexed by the low 6 bits of the hash. That there are 7 items per set is because that’s as big as each set can be for 64 of them to fit the page. Space after the hash items, i.e., from offset 0x0E10, is unused. To enumerate all the URLs about which information is cached in an INDEX.DAT file, the hash table is examined by working upwards through all the hash items on each page of the hash table, from the first-allocated page to the last. To look up a particular URL is also to search the hash table from the first page to the last, but looking only at the seven hash items per page that are selected by the low 6 bits of the URL’s hash.
Since the low 6 bits of the hash are implied by the hash item’s position on the page, the whole hash need not be held in the hash item itself, just the high 26 bits. That leaves each hash item with 6 bits to use as flags.
The hash algorithm for URL caching is presented separately. For most practical purposes, it suffices to know just that the input for the computation is the URL exactly as given, i.e., without case conversion, up to but not including the null terminator, except to ignore at most one trailing forward slash. This last point helps when an original URL is redirected simply by appending a forward slash: with or without the slash, the lookup is the same.
The hash table has a HASH_ITEM for every file-map entry that might be sought from a URL. Such entries come in two types: URL entries and redirection entries. What type a hash item represents is recorded in the flags, along with a few other properties that might usefully be known immediately from the hash item without following the file offset to the file-map entry (which would have to be validated and interpreted).
The low 3 bits of the flags in a HASH_ITEM are more or less, but not formally, a single field. They are sometimes examined for equality after masking by 0x07, but are sometimes tested individually. The interpretation adopted here is that if the 0x01 bit is clear, then the hash item represents a URL entry and the other bits are independent:
0x01 | clear: file offset in hash item is of URL entry and hash is of URL |
0x02 | corresponding URL entry is locked |
0x04 | corresponding URL entry has trivial redirection |
A URL entry is locked while its local file is being accessed, most notably for the RetrieveUrlCacheEntryFile and RetrieveUrlCacheEntryStream functions, which require a subsequent call to UnlockUrlCacheEntryFile or UnlockUrlCacheEntryStream. Nested locking is supported through a count in the URL entry itself, at offset 0x58 (see later). An entry cannot be deleted while locked. An attempt at such deletion may appear to succeed, but the entry is not actually deleted until the final unlock.
The 0x04 flag eases a common case of URL redirection. It means simply that the redirection appended a forward slash. Put another way, the URL that is saved in the URL entry is the original URL plus a forward slash. With such a simple relationship, there’s no need to save the original URL separately as a redirection entry (see later). A search for either URL produces the one hash item for the one URL entry.
When the 0x01 flag is set, the hash item represents something other than a URL entry and the low 3 bits are better interpreted as one field:
0x01 | hash item is free; whole first dword of hash item should be 1 |
0x03 | hash item is unused; whole first dword of hash item should be 3 |
0x05 | file offset is of redirection entry; hash is of original URL |
Hash items in a new page for the hash table are initialised with 3 in both dwords, presumably just as a quick way to set 0x03 for the flags in the first dword. When searching the hash table, seven items per page, for a particular URL, finding an item that has 0x03 as its first dword means that the search is over and the URL is not in the hash table. If the URL is to be entered into the hash table, then the first free hash item, with 0x01 as its first dword, that was noticed on the way, is allocated to the URL. If there was no free hash item, then the unused item is allocated to the URL. If the search ended at the last page of the hash table without finding an unused item, then the hash table gets a new page.
Other flags in a hash item are particular to the grouping of URL entries, and appear to be meaningful only in hash items for URL entries:
0x08 | corresponding URL entry belongs to a group |
0x10 | corresponding URL entry belongs to a list of groups |
These flags are not independent: the 0x10 flag is never set unless the 0x08 flag is also set. The 0x10 flag must, of course, be set if the URL entry belongs to more than one group. However, if the URL entry belongs to exactly one group, then this flag can be either set or clear. The difference is in the linkage from entry to group. If the flag is clear, the entry links directly to its group. If the flag is set, the entry links to a list of groups which happens to be a list of one.
The descriptions above are anyway offered only as interpretations of what seems to be intended, not what actually is coded. It can happen that the 0x08 flag is set even though the URL entry does not belong to any group, but this is here taken to be the consequence of a coding error—indeed, of two coding errors. The essential point for both is that when a group is deleted, any entries that belong to the group but which will not be deleted with the group would better not be left still referring to the group. The main such reference is the dword at offset 0x28 in the IE6_URL_FILEMAP_ENTRY (see later). In one case, code that supports the DeleteUrlCacheGroup function clears this dword but does nothing about the 0x08 flag in the hash item. This might not matter—indeed, the intended meaning of the 0x08 flag could be just that the corresponding URL entry may belong to a group—except that some code for enumerating entries takes for granted that if the 0x08 flag is set in the hash item then the dword at offset 0x28 in the URL entry is a meaningful file offset. It just doesn’t defend against this dword being zero. One or other piece of code is faulty, though I must admit I can find no serious consequence, just a quirk:
The cumbersome parenthesis at step 2 is not just a necessary condition for triggering the coding error. It hints at a second coding error. For entries whose cache entry type has any set bit that is not in either of the collections URLCACHE_FIND_DEFAULT_FILTER or INCLUDE_BY_DEFAULT_CACHE_ENTRY, the cleanup of links from entry to group isn’t even attempted. Both the dword and the flag persist as if the entry still belongs to the deleted group. Creating this anomalous state is as easy as:
For confirmation that the state created by these steps genuinely is anomalous, create another group immediately and enumerate it for entries of whatever type you created at step 2. The entry from step 2 magically appears in the new group (whose creation reuses memory that held the definition of the old group, such that the entry’s stale references to those definitions are picked up for the new group).
The main type of file-map entry in an INDEX.DAT file is one that associates a URL with information that is cached for that URL. Each such entry has a fixed-sized header which is followed by variable-sized data, typically strings. The header is an IE6_URL_FILEMAP_ENTRY structure, based on FILEMAP_ENTRY. The signature is typically “URL ” but can be modified to “LEAK” as a special case.
Offset | Size | Description | |
---|---|---|---|
0x08 | 8 bytes | last modified time, as FILETIME structure | |
0x10 | 8 bytes | last access time, as FILETIME structure | |
0x18 | dword | expiry time, as DOS time | |
0x1C | dword | potted description needed here! | |
0x20 | dword | size of local file, in bytes | |
0x24 | dword | apparently unused, except for explicit initialisation to zero | |
0x28 | dword | file offset of GROUP_ENTRY or LIST_GROUP_ENTRY | |
0x2C | dword | in URL entry: | exempt delta |
in leak entry: | file offset of next leak entry | ||
0x30 | dword | size of structure in excess of FILEMAP_ENTRY, in bytes | |
0x34 | dword | offset from start of structure to URL, as saved in entry after header | |
0x38 | byte | index of directory containing local file | |
0x39 | byte | synchronisation count | |
0x3A | byte | potted description needed here! | |
0x3B | byte | potted description needed here! | |
0x3C | dword | offset from start of structure to name of local file, as saved in entry after header | |
0x40 | dword | cache entry type, as bit flags | |
0x44 | dword | offset from start of structure to header information, as saved in entry after header | |
0x48 | dword | size of header information, in bytes | |
0x4C | dword | offset from start of structure to file extension, as saved in entry after header | |
0x50 | dword | last synchronisation time, as DOS time | |
0x54 | dword | number of times entry has been locked | |
0x58 | dword | nesting level of locks on entry | |
0x5C | dword | creation time, as DOS time | |
0x60 | dword | potted description needed here! | |
0x64 | 4 bytes | apparently unused |
That the last dword may truly be unused is again plausible as a programming artefact. The structure is perhaps defined with a a one-element character array at the end as an allowance for variable-sized data to follow the structure, even though the data actually gets placed after the structure.
Perusal of symbol files for earlier WININET versions confirms that there has been defined an IE5_URL_FILEMAP_ENTRY and, before that, a plain URL_FILEMAP_ENTRY. The byte at offset 0x3A appears to exist, nowadays, only to distinguish an IE6_URL_FILEMAP_ENTRY from an IE5_URL_FILEMAP_ENTRY. It and the byte at offset 0x3B are set to 0x10 for the newer structure and 0x00 for the older. The two structures have the same layout except that the older is only 0x60 bytes. The member at offset 0x60 is not present unless the byte at offset 0x3A is at least 0x10, and is anyway barely used in version 7.0. Indeed, the dword at offset 0x60, the bytes at offsets 0x3A and 0x3B, and even the dwords at offsets 0x24 and 0x30 are all so little used, but with the look of having been more used, that meaningful description ought not be attempted without closer inspection of earlier WININET versions.
The cache entry type at offset 0x40 is a collection of bit flags. The following are generally meaningful:
0x00000001 | NORMAL_CACHE_ENTRY | set initially for all entries in Content container |
0x00000004 | STICKY_CACHE_ENTRY | entry is exempt from scavenging |
0x00000008 | EDITED_CACHE_ENTRY | local file need not be in cache |
0x00010000 | SPARSE_CACHE_ENTRY | potted description needed here! |
0x00100000 | COOKIE_CACHE_ENTRY | set initially for all entries in Cookies container |
0x00200000 | URLHISTORY_CACHE_ENTRY | set initially for all entries in History container |
0x00400000 | PENDING_DELETE_CACHE_ENTRY | set when deletion is attempted while entry is locked |
0x10000000 | INSTALLED_CACHE_ENTRY | potted description needed here! |
0x80000000 | IDENTITY_CACHE_ENTRY | potted description needed here! |
These are the bits that are interpreted, set or cleared by WININET itself while managing URL entries as a file-format feature. All bits in the cache entry type are exposed to external interpretation and control, even at the risk of conflicts with WININET’s own bookkeeping. See especially that the SetUrlCacheEntryInfo function can set the cache entry type in a URL entry to anything (exactly as given in the CacheEntryType member of the INTERNET_CACHE_ENTRY_INFO structure, when either CACHE_ENTRY_ATTRIBUTE_FC or CACHE_ENTRY_TYPE_FC is specified in the dwFieldControl argument).
Perusal of earlier WININET versions suggests that some of these flags have meant more. The INSTALLED_CACHE_ENTRY and IDENTITY_CACHE_ENTRY types look to be particularly affected by a reduction of support in version 7.0, such that description ought not be attempted without closer inspection of earlier versions.
The dwords at offsets 0x54 and 0x58 can be inspected through the GetUrlCacheEntryInfo function, in the dwHitRate and dwUseCount members of the INTERNET_CACHE_ENTRY_INFO structure. Both the hit rate and use count are incremented in the URL entry each time the entry is locked for retrieval of its local file. Only the use count is decremented each time the entry is unlocked.
Interpretation of the hit rate as the number of times the entry has been locked is, however, not strictly justfied. The SetUrlCacheEntryInfo function can set the hit rate to anything (from the dwHitRate member of the INTERNET_CACHE_ENTRY_INFO structure, when CACHE_ENTRY_HITRATE_FC is specified in the dwFieldControl argument).
When a URL entry is deleted, the corresponding local file, if any, would ideally be deleted too. If it happens that the local file cannot be deleted because of an error that may just be temporary, which means specifically ERROR_ACCESS_DENIED or ERROR_SHARING_VIOLATION, then the URL entry is converted to a leak entry and is removed from the hash table. Leak entries are essentially URL entries with “LEAK” as the signature. Though they are removed from being found as URL entries, they are kept in a list so that deletion of the local file can eventually be re-attempted. The file offset of the leak entry at the head of the list is found from the header data, in the dword indexed by CACHE_HEADER_DATA_ROOT_LEAK_OFFSET. In each leak entry, the member at offset 0x2C is unnecessary (else there would have been no attempt to delete the leak entry while it was a URL entry) and is reused for linking to the next leak entry.
When a URL is entered into the cache, as through the CommitUrlCacheEntry function, it can be given together with a URL that it was redirected from, i.e., the original URL. Either URL can be searched for. When the redirection is just a matter of appending a forward slash, the redirection is accommodated by ignoring the forward slash when computing the hash and marking the URL entry’s hash item by setting its 0x04 flag. In general however, both URLs are represented in the hash table. The URL that actually is entered into the cache has a hash item which links to an IE6_URL_FILEMAP_ENTRY. The original URL has a separate hash item that links to a structure which is not named in the public symbol file but is here called a redirection entry. It too is a file-map entry, based on FILEMAP_ENTRY, but with “REDR” as its signature:
Offset | Size | Description |
---|---|---|
0x08 | dword | file offset of hash item for URL entry |
0x0C | dword | hash of (target) URL, but with low 6 bits clear |
0x10 | varies | original URL |
The WININET code for creating a redirection entry computes the size of entry as a header of 0x14 bytes plus the original URL as a null-terminated string. Presumably, the structure is defined with a single-element character array at offset 0x10, and in this case the programmer actually does copy the characters to that placeholder instead of to the end of the structure.
Any number of redirection entries may link to one URL entry, to model that any number of original URLs redirect to the same target URL. Perhaps because of this, there is no link back from the URL entry. When a URL entry is deleted, the redirection entries that link to it are left alone. They retain the file offset of a hash item that may be reused, sooner or later, for a different URL entry or even for a redirection entry. The defence is provided by the saved hash at offset 0x0C. A redirection entry is invalid unless the hash item pointed to from offset 0x08 is plausibly still the one the redirection entry was created for. Specifically, the first dword of the supposed hash item must have the 0x01 flag clear (as expected of a hash item for a URL entry) and must have the same hash as saved at offset 0x0C in the redirection entry.
URL entries in a Content container can be grouped. Since groups are not much used nowadays, at least not by Microsoft in software supplied with Windows, a brief review may help. An empty group is created through the exported function CreateUrlCacheGroup, which returns a 64-bit group ID to represent the group in calls to other functions. There is also a built-in group with a preset group ID (used most notably by IEFRAME when caching FAVICON.ICO files). Properties can be set for a group by calling the exported function SetUrlCacheGroupAttribute. URL entries can be assigned to a group through the exported function SetUrlCacheEntryGroup. The most prominent merit to grouping URL entries is that enumeration of URL entries can be refined by supplying the group ID as a search parameter. A less prominent but conceivably very useful feature is that URL entries can be made sticky simply by assigning them to a sticky group. Another is that URL entries assigned to a group can be deleted en masse by assigning them to a group and then deleting the group (with a suitable flag specified). To delete a group, call the DeleteUrlCacheGroup function.
In the INDEX.DAT file format, each group is represented by a 0x28-byte GROUP_ENTRY structure:
Offset | Size | Description | |
---|---|---|---|
0x00 | qword | group ID, else zero in a free entry, or -1 in an index entry | |
0x08 | dword | in allocated entry: | group flags |
in index entry: | file offset of first GROUP_ENTRY on next page of such structures, else zero | ||
0x0C | dword | group type | |
0x10 | qword | disk usage, in bytes | |
0x18 | dword | disk quota, in kilobytes | |
0x1C | dword | in allocated entry: | file offset of GROUP_DATA_ENTRY structure containing optional attributes, else zero |
in first index entry: | file offset of first free GROUP_DATA_ENTRY structure, else zero | ||
0x20 | 8 bytes | apparently unused |
The unused space at offset 0x20 may be an alignment artefact. For instance, in anticipation of variable-sized data at the end of the structure, a programmer may have thought to mark the spot with a one-element byte array. A wasteful side-effect, because of members that demand 64-bit alignment, would be that the structure acquires eight more bytes.
Note that a group entry is not a file-map entry. It is too small to justify consuming a whole block. Group entries are instead prepared collectively in page-sized file-map entries. Each such page is a FILEMAP_ENTRY followed immediately by an array of as many GROUP_ENTRY structures as fit the page. The file offset of the first group entry on the first page of group entries is saved in the file header, as the CACHE_HEADER_DATA_ROOTGROUP_OFFSET index in the header data. The last group entry on each page is marked specially as an index entry. It can never represent a group but instead provides the link to the next page of group entries. Group entries are always scanned from the first on a page up to but not including the index entry on that page, repeating for each page, starting from the first page that was ever allocated, proceeding to the most recently allocated.
A group entry is free, for representing a new group, simply because its group ID is zero. Deleting a group frees the corresponding group entry for reallocation to a subsequently created group. (Indeed, deleting a group clears all the bytes of the group entry.) Deleting all the groups that are defined on a page of group entries merely leaves a page of free group entries: once a file-map entry is allocated to hold group entries, it stays allocated.
The flags at offset 0x08 are acquired only from the dwFlags argument of the CreateUrlCacheGroup function. It would seem then that only two bits can ever be set:
0x01 | CACHEGROUP_FLAG_NONPURGEABLE |
0x02 | CACHEGROUP_FLAG_FLUSHURL_ONDELETE |
Neither is directly meaningful. The former records that the group was created to be sticky, but what matters for whether a group actually is sticky is that the 0x1000000000000000 bit is set in the group ID. The other flag can usefully be given to the DeleteUrlCacheGroup function but whether it is set or clear in the group entry appears to be entirely meaningless. Useful or not, the flags as recorded in the group entry can be retrieved through the GetUrlCacheGroupAttribute function, in the dwGroupFlags member of the INTERNET_CACHE_GROUP_INFO structure.
The disk usage at offset 0x10 is maintained by WININET as the total size of local files for all URL entries that belong to the group. Its current value, converted to KB, can be retrieved through the GetUrlCacheGroupAttribute function, in the dwDiskUsage member of the INTERNET_CACHE_GROUP_INFO structure.
The type member at offset 0x0C is exactly as accessed through the dwGroupType member of the INTERNET_CACHE_GROUP_INFO structure given to the GetUrlCacheGroupAttribute and SetUrlCacheGroupAttribute functions. Neither function interprets this member in any way. Except for access through these functions, the type appears to be unused.
The remaining attributes that can be set for a group through the SetUrlCacheGroupAttribute function are, or can be, relatively substantial. Since they are anyway optional, it would be wasteful to provide for storing them in every group entry. If they ever are set for a group, they are held separately, in a GROUP_DATA_ENTRY structure:
Offset | Size | Description | |
---|---|---|---|
0x00 | GROUPNAME_MAX_LENGTH bytes | group name | |
0x78 | GROUP_OWNER_STORAGE_SIZE dwords | owner storage | |
0x88 | dword | in allocated entry: | zero |
in free entry: | file offset of next free GROUP_DATA_ENTRY, else zero |
Again, GROUP_DATA_ENTRY structures are not file-map entries but are instead prepared collectively in page-sized file-map entries. Each such page is a FILEMAP_ENTRY followed immediately by an array of as many GROUP_DATA_ENTRY structures as fit the page. No page of group data entries is allocated until either a group name or owner storage is set for some group. The dword at offset 0x1C in an allocated GROUP_ENTRY is the file offset of its associated group data entry. Group data entries that are not allocated to a group, i.e., the free entries, are kept in a chain, linked through the member at offset 0x88. The current head of the chain is found from offset 0x1C in the index entry on the first page of group entries. Group data entries are allocated from the head of the chain. When a group data entry is freed, all its bytes are cleared and it is then returned to the head of the chain of free entries.
The name and owner storage at offsets 0x00 and 0x78 are exactly as accessed through the szGroupName and dwOwnerStorage members of the INTERNET_CACHE_GROUP_INFO as given to the GetUrlCacheGroupAttribute and SetUrlCacheGroupAttribute functions. The only interpretation of either member by either function is that SetUrlCacheGroupAttribute checks that a proposed group name is not too large. Except for access through these functions, the group name and owner storage appear to be unused.
Importantly, granted that groups have any importance at all, a URL entry may be assigned to multiple groups. When this happens, the dword at offset 0x28 in the URL entry no longer shows the way directly to a single GROUP_ENTRY but to a list of them (and the change is marked by setting the 0x10 flag in the URL entry’s hash item). Each element of the list is a LIST_GROUP_ENTRY:
Offset | Size | Description |
---|---|---|
0x00 | dword | file offset of GROUP_ENTRY structure, else zero |
0x04 | dword | file offset of next LIST_GROUP_ENTRY, else zero |
These LIST_GROUP_ENTRY structures are prepared collectively in page-sized file-map entries. Each such page is a FILEMAP_ENTRY structure followed immediately by an array of as many LIST_GROUP_ENTRY structures as fit the page. Note that no page of list group entries is allocated until at least one URL entry is assigned to more than one group.
Each list group entry is intended to be always (for all practical purposes) in exactly one list, linked through the dword at offset 0x04. It can be in a list for a URL, in which case the dword at offset 0x28 in the URL entry gives the file offset of the first entry in the list. Otherwise, the list group entry should be in a list of free entries. This free list, once it exists, has a permanent head. The file offset of this head entry is maintained in the file header, as the CACHE_HEADER_DATA_ROOT_GROUPLIST_OFFSET index in the header data.
Except where otherwise noted, this article is specific to the 32-bit WININET.DLL version 7.0.6000.16386 from the original Windows Vista.