Geoff Chappell - Software Analyst
SKETCH OF HOW RESEARCH MIGHT CONTINUE AND RESULTS BE PRESENTED
Character Types in JScript
Each script is received by JSCRIPT.DLL as an array of Unicode characters.
Many of the ASCII-compatible characters, i.e., those with values < 0x0080, each
have a highly specific meaning in JScript. Some < 0x0080 and all ≥ 0x0080 are of
interest to JScript only for whether they are line terminators or white space or
can be used in naming things.
As far as concerns JScript as a language, characters ≥ 0x0080 are classified
by reference to categories in the Unicode Standard. The JSCRIPT implementation
mostly classifies according to the CT_CTYPE1 flags
produced for each character by the Windows API function
GetStringType. If this function reports the character as defined but with
no other property, then JSCRIPT resorts to its own tables.
Line Terminators
JSCRIPT recognises four line-terminating characters:
- 0x000A
- 0x000D
- 0x2028
- 0x2029
White Space
The white-space characters are:
- 0x0009
- 0x000B
- 0x000C
- 0x0020
- any character ≥ 0x0080, other than 0x2028 and 0x2029, for which
C1_ALPHA is clear and either
C1_BLANK or C1_SPACE
is set
- 0xFEFF, if C1_DEFINED is set but all other
flags are clear
JSCRIPT applies a slightly different notion of white space when scanning for
the termination of an HTML comment, but this is not strictly a matter of JScript
interpretation, and details are left for elsewhere.
Letters
JScript leaves many characters for naming things. Any character in any of
several categories of letter may appear anywhere in an identifier, as may the
dollar sign (0x0024) and underline (0x005F). To JSCRIPT, the letters are:
- 0x0041 to 0x005A inclusive
- 0x0061 to 0x007A inclusive
- any character ≥ 0x0080, other than 0x2028 and 0x2029, for which
C1_ALPHA is set
and any character listed below
- 0x0192,
- 0x249C to 0x24E9,
- 0x3005, 0x3007, 0x309B to 0x309E, 0x30FC to 0x30FE,
- 0xFB29,
- 0xFE80 to 0xFEFC,
- 0xFF70, 0xFF9E, 0xFF9F and 0xFFFD
for which C1_DEFINED is set but all other flags
are clear.
Digits and Marks
Some more characters are also left for naming things except that they cannot
begin an identifier:
- 0x0030 to 0x0039
- any character ≥ 0x0080, other than 0x2028 and 0x2029, for which
C1_ALPHA is clear and either
C1_DIGIT or C1_PUNCT
is set
and any of the numerous characters listed below
- 0x02B0 to 0x02DE, 0x02E0 to 0x02E9,
- 0x0300 to 0x0345, 0x0360, 0x0361, 0x0374, 0x0375, 0x037A, 0x0384, 0x0385,
- 0x0482 to 0x0486,
- 0x0559, 0x0591 to 0x05A1, 0x05A3 to 0x05B9, 0x05BB to 0x05BD, 0x05BF,
0x05C1, 0x05C2, 0x05C4,
- 0x0640, 0x064B to 0x0652, 0x0670, 0x06D6 to 0x06ED,
- 0x0901 to 0x0903, 0x093C to 0x094D, 0x0950 to 0x0954, 0x0962, 0x0963,
0x0981 to 0x0983, 0x09BC, 0x09BE to 0x09C4, 0x09C7, 0x09C8, 0x09CB to 0x09CD,
0x09D7, 0x09E2, 0x09E3, 0x09F2 to 0x09FA,
- 0x0A02, 0x0A3C, 0x0A3E to 0x0A42, 0x0A47, 0x0A48, 0x0A4B to 0x0A4D, 0x0A70
to 0x0A74, 0x0A81 to 0x0A83, 0x0ABC to 0x0AC5, 0x0AC7 to 0x0AC9, 0x0ACB to
0x0ACD, 0x0AD0,
- 0x0B01 to 0x0B03, 0x0B3C to 0x0B43, 0x0B47, 0x0B48, 0x0B4B to 0x0B4D,
0x0B56, 0x0B57, 0x0B70, 0x0B82, 0x0B83, 0x0BBE to 0x0BC2, 0x0BC6 to 0x0BC8,
0x0BCA to 0x0BCD, 0x0BD7, 0x0BF0 to 0x0BF2,
- 0x0C01 to 0x0C03, 0x0C3E to 0x0C44, 0x0C46 to 0x0C48, 0x0C4A to 0x0C4D,
0x0C55, 0x0C56, 0x0C82, 0x0C83, 0x0CBE to 0x0CC4, 0x0CC6 to 0x0CC8, 0x0CCA to
0x0CCD, 0x0CD5, 0x0CD6,
- 0x0D02, 0x0D03, 0x0D3E to 0x0D43, 0x0D46 to 0x0D48, 0x0D4A to 0x0D4D,
0x0D57,
- 0x0E3F, 0x0EAF to 0x0EB9, 0x0EBB to 0x0EBD, 0x0EC0 to 0x0EC4, 0x0EC6,
0x0EC8 to 0x0ECD,
- 0x0F00 to 0x0F03, 0x0F13 to 0x0F1F, 0x0F2A to 0x0F39, 0x0F3E, 0x0F3F,
0x0F71 to 0x0F84, 0x0F86 to 0x0F8B,
- 0x1FBD to 0x1FC1, 0x1FCD to 0x1FCF, 0x1FDD to 0x1FDF, 0x1FED to 0x1FEF,
0x1FFD, 0x1FFE,
- 0x2000 to 0x200F, 0x2028 to 0x202E, 0x2044, 0x206A to 0x2070, 0x2074 to
0x207C, 0x207F to 0x208C, 0x20A0 to 0x20AC, 0x20D0 to 0x20E1,
- 0x2100 to 0x2138, 0x2153 to 0x2182, 0x2190 to 0x21EA,
- 0x2200 to 0x22F1,
- 0x2300, 0x2302 to 0x2328, 0x232B to 0x237A,
- 0x2400 to 0x2424, 0x2440 to 0x244A, 0x2460 to 0x249B, 0x24EA,
- 0x2500 to 0x2595, 0x25A0 to 0x25EF,
- 0x2600 to 0x2613, 0x261A to 0x266F,
- 0x2701 to 0x2704, 0x2706 to 0x2709, 0x270C to 0x2727, 0x2729 to 0x274B,
0x274D, 0x274F to 0x2752, 0x2756, 0x2758 to 0x275E, 0x2761 to 0x2767, 0x2776
to 0x2794, 0x2798 to 0x27AF, 0x27B1 to 0x27BE,
- 0x3004, 0x3006, 0x3012, 0x3013, 0x3020 to 0x302F, 0x3031 to 0x3037,
0x303F, 0x3099, 0x309A,
- 0x3190 to 0x319F,
- 0x3200 to 0x321C, 0x3220 to 0x3243, 0x3260 to 0x327B, 0x327F to 0x32B0,
0x32C0 to 0x32CB, 0x32D0 to 0x32FE,
- 0x3300 to 0x3376, 0x337B to 0x33DD, 0x33E0 to 0x33FE,
- 0xFB1E,
- 0xFE20 to 0xFE23, 0xFE62, 0xFE64 to 0xFE66, 0xFE69, 0xFE70 to 0xFE72,
0xFE74, 0xFE76 to 0xFE7F,
- 0xFF04, 0xFF0B, 0xFF1C to 0xFF1E, 0xFF3E, 0xFF40, 0xFF5C, 0xFF5E, 0xFFA0,
0xFFE0 to 0xFFE6 and 0xFFE8 to 0xFFEE
for which C1_DEFINED is set but all other flags
are clear.
Digression on Unicode Support
Script writers who think to use characters ≥ 0x0080 should appreciate that
JSCRIPT’s interpretation of these characters is suspect. The algorithms
presented above are unchanged from JSCRIPT version 5.6.0.6626 (in the original
Windows XP) through to version 5.7.0.6000 (in Windows Vista), but the
classification of characters can vary because of the information that JSCRIPT
gets from Windows.
It is ordinarily desirable, of course, that Windows software should use the
Windows API to get system-wide information about the properties of Unicode
characters, rather than depend on its own understanding. In practice, however,
those flags reported by GetStringType vary from
one Windows version to another. Indeed, since they come ultimately from tables
in a file (LOCALE.NLS in the System directory), and it must be supposed that the
file is updatable, it may be that the flags vary even within a Windows version.
So, if you really must use some such character as 0x037B (Greek Small
Reversed Lunate Sigma Symbol) when naming some variable, you take your chance on
behalf of your script’s users. Windows Vista recognises this character as a
lower-case letter (with the C1_DEFINED,
C1_ALPHA
and C1_LOWER flags), but Windows XP doesn’t
recognise the character at all.