Geoff Chappell - Software Analyst
SKETCH OF HOW RESEARCH MIGHT CONTINUE AND RESULTS BE PRESENTED - PREVIEW ONLY
The C++ language recognises some escape sequences introduced by two question marks.
Trigraph | Translation |
??! | | |
??' | ^ |
??( | [ |
??) | ] |
??- | ~ |
??/ | \ |
??< | { |
??= | # |
??> | } |
When a trigraph is recognised in the input stream, the leading question marks are discarded and the last character of the trigraph is reinterpreted as if the input stream had instead provided the character that the trigraph translates to. For example,
??=define RTL_NUMBER_OF(a) ( ??/ sizeof (a) / sizeof ((a) ??(0??)) ??/ )
translates to
#define RTL_NUMBER_OF(a) ( \ sizeof (a) / sizeof ((a) [0]) \ )
Trigraphs have the highest translation precedence. For a trigraph to be recognised, the three characters really must be consecutive in the input stream. Even the intrusion of a line splice, as allowed for the << token in
int x = 1 <\ < 3;
stops the recognition of trigraphs.
The other side to this high precedence is that the characters for all other programming elements are read as if trigraphs are already translated. As Microsoft says in the product documentation, “translation of trigraphs takes place in the first translation phase, before the recognition of escape characters in string literals and character constants.”
Support for trigraphs leads to two types of trouble. The type that seems to have concerned Microsoft for its documentation (and a handful of articles in the Knowledge Base) is in some sense a false positive, namely that a trigraph is detected where the programmer (perhaps in ignorance) had not intended one. As suggested by the documentation, a typical case would have consecutive question marks in a string constant, whether because the programmer goes overboard with punctuation, as in Microsoft’s example
printf ("What??!\n");
or (perhaps less plausibly) because question marks are used as single-character wildcards in filenames, as in
FindFirstFile ("???-schedule.txt", &data);
The other type of trouble is the false negative, so that a properly formed trigraph is left untranslated. This is not a misunderstanding by the programmer but by the preprocessor. Although the product documentation talks of Phases of Translation, it does not mean that the input stream is subjected to multiple passes such that the first sees every trigraph reduced to one character. Indeed, in the Overview of File Translation, the documentation makes plain that there is an “actual order” of translation, done “as if” in multiple passes over the whole input stream.
Of course, an “as if” implementation requires rather more care. There is a risk of being too clever and missing cases, such that what look like properly formed trigraphs are left untranslated. Perhaps because Microsoft expects that trigraphs are simply never intended nowadays in any real-world programming and really can’t care to look for defects, let alone sort them out, there are rather many cases of oversight.
The most notable occur in various preprocessor directives that are interpreted ahead of formal tokenisation. Where these directives allow white space, they sometimes provide for line splicing and the discarding of comments but neglect to recognise trigraphs. For example, in
# ??/ define TEST
the trigraph is translated to a backslash and thence interpreted as a line splice, so that the two lines make a #define for the identifier TEST as a trivial macro. However, in the slightly different
#define ??/ TEST
the expectation that #define be followed by white space and an identifier is defeated: the trigraph is not translated, its leading question mark gets dismissed as an error (C2007), the lines do not get spliced, and the identifier TEST seems to be on its own line (which is most likely also an error).