CURRENT WORK ITEM - PREVIEW ONLY

UrlCrackW

This function breaks a URL into components: a scheme, user name, password, host name, port number, URL path and extra information.

Declaration

BOOL
UrlCrackW (
    LPWSTR lpszUrl,
    DWORD dwUrlLength,
    DWORD dwFlags,
    SHURL_COMPONENTSW *lpUrlComponents);

Since the SHURL_COMPONENTSW structure appears to be used only for this function, its format is as well given here:

typedef struct {
    DWORD dwStructSize;
    LPWSTR lpszScheme;
    DWORD dwSchemeLength;
    SHINTERNET_SCHEME nScheme;          // enum, see below
    LPWSTR lpszHostName;
    DWORD dwHostNameLength;
    SHINTERNET_PORT nPort;              // WORD
    LPWSTR lpszUserName;
    DWORD dwUserNameLength;
    LPWSTR lpszPassword;
    DWORD dwPasswordLength;
    LPWSTR lpszUrlPath;
    DWORD dwUrlPathLength;
    LPWSTR lpszExtraInfo;
    DWORD dwExtraInfoLength;
} SHURL_COMPONENTSW;

The same applies to the SHINTERNET_SCHEME enumeration:

typedef enum {
    SHINTERNET_SCHEME_UNKNOWN           = -1,
    SHINTERNET_SCHEME_FTP               = 1,
    SHINTERNET_SCHEME_GOPHER,           // 2
    SHINTERNET_SCHEME_HTTP,             // 3
    SHINTERNET_SCHEME_HTTPS,            // 4
    SHINTERNET_SCHEME_FILE,             // 5
    SHINTERNET_SCHEME_NEWS,             // 6
    SHINTERNET_SCHEME_MAILTO,           // 7
    SHINTERNET_SCHEME_SOCKS,            // 8
    SHINTERNET_SCHEME_JAVASCRIPT,       // 9
    SHINTERNET_SCHEME_VBSCRIPT,         // 10
    SHINTERNET_SCHEME_RES,              // 11
} SHINTERNET_SCHEME;

Parameters

The lpszUrl argument is the address of the URL that is to be cracked. Beware that this URL may be corrupted by the function.

The dwUrlLength argument is either a count of Unicode characters for the URL (not needing a terminating null), else is zero to denote that the URL is a null-terminated Unicode string.

The dwFlags argument provides bit flags that vary the treatment of escape sequences. Valid flags are ICU_DECODE (0x10000000) and ICU_ESCAPE (0x80000000).

The lpUrlComponents argument is the address of a SHURL_COMPONENTSW structure that describes what components are wanted and where they are to be returned, and which receives information about them. The dwStructSize member should be set in advance to the size of the structure. Among the other members are pointer-and-length pairs, one for each component other than the port number. These too must be set in advance, as described in the next paragraphs. No initialisation is required for the nScheme and nPort members.

Pointers and Lengths

In general, a pointer-and-length pair in the URL_COMPONENTSW structure describe a buffer for receipt of the corresponding component. The length is counted in Unicode characters. The buffer should allow sufficient space for the component as a null-terminated Unicode string. A length of zero ensures that the buffer will be considered too small for receipt of the component.

The pointer may be NULL and the length zero to indicate that there is no buffer and that the corresponding component is not wanted.

The code provides explicitly for the pointer to be NULL with the length non-zero. However, the ensuing behaviour (as detailed below) is bizarre and this case is dismissed here as ill-defined.

Return Value

The function returns TRUE for success and FALSE for failure. If the function fails, an error code is set for retrieval through GetLastError.

On success, the components of the given URL are returned through members of the URL_COMPONENTSW structure. For each pointer-and-length pair that described a buffer on input, the buffer now contains a copy of the corresponding component as a null-terminated Unicode string, and the length member is updated to the size of this component, in Unicode characters, but not counting the terminating null.

Failure with ERROR_INSUFFICIENT_BUFFER as the error code indicates that a successful return would have been possible except that at least one of the buffers described by a pointer-and-length pair was too small to receive the corresponding component (and its terminating null). The length member for each such component is updated to show what length would have sufficed. Other components are returned as if for success.

Behaviour

This function has very nearly the same prototype as the long-standing and long-documented WININET function InternetCrackUrl. The implementation is very nearly identical except that InternetCrackUrl has ANSI as the native character set (with a Unicode form converting to and from ANSI) while UrlCrackW has Unicode as native (with no ANSI form). However, there are differences, such that UrlCrackW cannot sensibly be deemed semi-documented by reference to the documentation of InternetCrackUrl, not that the latter is anyway accurate or comprehensive.

Parameter Validation

The function requires the following of its parameters, else it fails, with ERROR_INVALID_PARAMETER as the error code:

URL Syntax

The URL that is to be cracked consists of the non-null Unicode characters at the address lpszUrl, up to a maximum of dwUrlLength characters if dwUrlLength is non-zero. This URL is parsed as a sequence of components and separators, according to the following sketch but with numerous special cases:

Each component may be explicitly empty, e.g., when there are no characters between the relevant separators. Each component may be implicitly empty, because the URL fits some case that simply doesn’t provide for that component. Either way, an empty component is treated as having been found but with zero as its length.

Scheme

Characters up to but not including the first colon name the scheme. Eleven schemes have specific support:

file, ftp, gopher, http, https, javascript, mailto, news, res, socks and vbscript

Recognition is insensitive to case. Each has a corresponding value in the SHINTERNET_SCHEME enumeration, as returned through the nScheme member. For other schemes, this member receives the value SHINTERNET_SCHEME_UNKNOWN (-1).

The function fails, with ERROR_INTERNET_UNRECOGNIZED_SCHEME as the error code, in any of the following conditions:

(It may help to enumerate some special cases, if only to confirm that they are not omitted. For a URL whose first character is a colon, the scheme is empty and counts as unknown, not as an error. For the news scheme and for all unknown schemes, two slashes after the colon are permitted but not required.)

If the scheme is followed by a colon, two slashes and at least one more character, then the colon and two slashes are discarded as separators, and the next component begins after the two slashes. Otherwise, the next component begins immediately after the colon.

User Name and Password

The file and res schemes do not provide for a user name or password. Neither do the news scheme or any unknown schemes unless the colon that follows them is given with two slashes and at least one more character.

In the general case however, a user name and password are indicated if an @ sign occurs before a slash or before the URL ends. Among the characters up to but not including the @ sign, the user name extends up to but not including the first colon. If a colon is present, the password consists of whatever characters follow the colon. A second colon causes the function to fail, with ERROR_INTERNET_INVALID_URL as the error code.

Escape sequences in the user name and password are decoded, irrespective of dwFlags. An invalid escape sequence causes the function to fail, with ERROR_INTERNET_INVALID_URL as the error code. Note that the components are identified first and then decoded, so that escape sequences allow for inclusion of slashes, colons and @ signs which would not otherwise be possible for these components. Note also that this decoding corrupts the input URL at lpszUrl unless either ICU_DECODE or ICU_ESCAPE is set in dwFlags, or the user name and password happen to contain no percent signs.

The @ sign is discarded, and the next component begins immediately after.

Host Name and Port Number

The file scheme does not provide for a host name or port number. Neither do the news scheme or any unknown schemes unless the colon that follows them is given with two slashes and at least one more character.

The res scheme does not provide for a port number. All characters up to but not including a slash (else to the URL’s end) form the host name.

Otherwise, a host name and port number are drawn from the characters up to but not including a slash else to the URL’s end. Within this range, the host name extends up to but not including the first colon. If a colon is present, the port number consists of whatever characters follow the colon. A second colon causes the function to fail, with ERROR_INTERNET_INVALID_URL as the error code.

Escape sequences in the host name and port number are decoded, irrespective of dwFlags. An invalid escape sequence causes the function to fail, with ERROR_INTERNET_INVALID_URL as the error code. Note that the components are identified first and then decoded, so that escape sequences allow for inclusion of slashes and colons which would not otherwise be possible for these components. Note also that this decoding corrupts the input URL at lpszUrl unless either ICU_DECODE or ICU_ESCAPE is set in dwFlags, or the host name and port number happen to contain no percent signs.

If the port number is not empty, it must be a sequence of decimal digits, evaluating to a maximum of 65535, else the function fails, with ERROR_INTERNET_INVALID_URL as the error code. Note that the port number is evaluated only after decoding escape sequences, so that although an escaped decimal digit may be unlikely in practice, it causes no error.

If the host name and port number are followed by a slash, then the slash is not discarded but is instead the first character of the next component. (Without a slash, there is no next component.)

Path

Whatever remains becomes the path, up to but not including the first question mark or hash sign. In general, the path leads with the slash that separates it from the host name or port number. The only exceptions are the cases that do not provide for a host name or port number: thus, the file scheme always, and the news and all unknown schemes unless followed by a colon, two slashes and at least one more character.

Escape sequences in the path are decoded if ICU_ESCAPE is set in dwFlags. An invalid escape sequence causes the function to fail, with ERROR_INTERNET_INVALID_URL as the error code. Note that this decoding is done before separating the path from extra information. Thus, when parsing with the ICU_ESCAPE flag set, escape sequences do not let the path contain a question mark or hash sign.

If a URL has the file scheme, the path begins with the character after the colon and double slash (as noted above) but is further transformed by the PathCreateFromUrl function, so that the path as returned is the so-called MS-DOS path. Details of this transformation fall outside the scope of this article: refer to Microsoft’s documentation of PathCreateFromUrl.

Extra Information

Whatever remains is the so-called extra information, necessarily beginning with either a question mark or hash sign.

Escape sequences in the extra information are decoded if ICU_ESCAPE is set in dwFlags. An invalid escape sequence causes the function to fail, with ERROR_INTERNET_INVALID_URL as the error code. Note that this decoding is done before the extra information is separated from the path. Thus, when parsing with the ICU_ESCAPE flag set, the extra information begins with the first question mark or hash sign after the start of the path, even if escaped.

Component Return

For each component other than the port number, the SHURL_COMPONENTSW structure whose address is given by the lpUrlComponents argument provides a pointer and a length (counted in Unicode characters). Several modes of return operate depending on whether the input pointer is or is not NULL and the input length is or is not zero.

No Interest

If the pointer is NULL and the length is zero, the caller has indicated no interest in the corresponding component. No information is returned about this component.

Buffer Given

A non-NULL pointer gives the address of a buffer into which the function is to copy the corresponding component as a Unicode string with a terminating null. The length on input is given as the size of this buffer, in Unicode characters.

If the buffer is large enough for the component, as found, plus a terminating null, then the component is copied to the buffer as a null-terminated Unicode string. The length member is set to the number of characters in the component, not counting the terminating null. If ICU_DECODE is set in dwFlags, then escape sequences in this returned component are decoded. Note that the length member is set first and may then exceed the number of characters in the component as actually returned.

If the buffer is too small for the component, as found, plus a terminating null, then nothing is copied to the buffer. The length member is set to the number of characters in the component, plus one for the terminating null. The occurrence of this condition for any component causes the function to fail, with ERROR_INSUFFICIENT_BUFFER as the error code, but only after processing the return of all components.

It is permitted that the length on input be given as zero. The effect is that the buffer must be too small, so that the length member must be set to the length that would suffice.

Pointer Wanted

If the pointer is given as NULL and the length as non-zero, then the function updates both the pointer and length, apparently intending to describe the component as found in the input URL.

Note however that if either the ICU_DECODE or ICU_ESCAPE bit is set in dwFlags, then the returned pointer is not meaningful. Instead of pointing into the input URL, it points into a temporary copy that the function made of the URL and which is formally invalid by the time the function returns.

Port Number

The port number is returned in the nPort member, as a 16-bit numeric evaluation. An empty port number is evaluated as zero, except that if the URL has the ftp, gopher, http or https scheme, an empty port number is defaulted to 21, 70, 80 or 443 respectively.

Availability

The UrlCrackW function is exported from SHLWAPI.DLL as ordinal 480 in version 5.50 and higher.

Though this function dates from as long ago as 2000, it was still not documented by Microsoft in the MSDN Library at least as late as the CD edition dated January 2004.

Most symbolic names in this article are inventions, pending knowledge of Microsoft’s nomenclature. They are however modelled very closely on documentation of the WININET function InternetCrackUrl. That SHLWAPI renames WININET’s INTERNET_SCHEME to SHINTERNET_SCHEME is known for certain from Microsoft’s symbol file for SHLWAPI, and it is surmised here that similar renaming applies throughout.

Use By Microsoft

A known use of this function by Microsoft is for Internet Explorer, specifically for MSHTML.DLL to support the location scripting object and the IHTMLLocation interface.