Worst abuse of the C preprocessor (IOCCC winner, 1986)
IOCCC (The International Obfuscated C Code Contest) is an annual programming contest for C “obfuscated” code.
Obfuscated code is machine code that is deliberately written to be hardly understood by humans. It can be used for different purposes but it is mainly done for security reasons, by making it obscure to avoid tampering, hide implicit values or conceal the logic used.
In 1986 the award for the category “worst abuse of the C preprocessor” was given to Jim Hague for writing the following piece of code:
Apart from the particular formatting, what jumps to the eye is the number of “unnecessary” macros and the repetitive use of DIT
and DAT
variations.
If we compile the code at this point we see many warnings. Among them, two for the implicit declaration of __DIT
and _DAH
. After that step, we can run the code, and as we provide sequences of ascii characters, it returns sequences of “.” and _.
$ ./a.out hello, world
.... . .-.. .-.. --- --..-- .-- --- .-. .-.. -..
This is actually Morse code (the one used in telegraph communications back in XIX century). When decoded it reverses back to HELLO, WORLD.
Let’s first try to perform the pre-processor job and replace the macros by their values. After a bit of reformatting, this is what we have:
char _DAH_[]=”ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e’b.s;i,d:”;
main()
{
char *_DIT, *DAH_, *DIT_, *_DIT_, *malloc (), *gets();
for (_DIT = malloc(81), DIT_=_DIT++; _DIT == gets(_DIT); __DIT(‘n’))
for (DAH_=_DIT; *DAH_; __DIT(*_DIT_ ? _DAH(*DIT_ ) : ‘?’),__DIT(‘ ‘),DAH_++)
for (*DIT_ = 2, _DIT_ = _DAH_; *_DIT_ && (*_DIT_ != (*DAH_ >= ‘a’ ? *DAH_&223 : *DAH_ )); (*DIT_ )++,_DIT_++)
*DIT_+= (*_DIT_>=’a’ ? *_DIT_ — ‘a’ : 0);
}
_DAH(DIT_)
{
__DIT(DIT_> 3 ? _DAH(DIT_>>1) : ‘’);
return DIT_ & 1 ? ‘-’ : ‘.’;
}
__DIT(DIT_) char DIT_;
{
(void) write (1,&DIT_,1);
}
We see the three functions we expected: main
, _DAH
, and __DIT
. We also see an external variable __DAH__
, a long string. __DIT
is like the putchar
function from the standard library, printing a char at a time.
And what about _DAH
?
As long as the argument is a number that takes more than 2 bits to write, it calls the function again, stripping the number from its last bit. The output will be part of the argument printed as —
and .
masking for 1
and 0
, i.e. the number in binary format, and it will return the second leftmost digit. As an example, if we call _DAH(5)
, 5 being 101
in binary, it will call _DAH(2)
. That is the base case. it prints (nothing) and return 10 & 1 == 0 so .
. Then it will print .
and return 101 & 1 == 1 so -
. If we want to print that we have to call __DIT(_DAH(5))
which outputs .-
which actually corresponds to 3 written in binary. _DAH(n)
is a rather obfuscated function, which will not print/return n
in binary but n — 2 .
The main function
char code[]=”ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e’b.s;i,d:”;
main ( )
{
char *line, *letter, *value, *code_copy, *malloc ( ),* gets ( );
for (line = malloc(81), value= line++; line == gets(line); putchar(‘n’))
{
for (letter = line; *letter; putchar(*code_copy ? _DAH(*value) : ‘?’), putchar(‘ ‘), letter++)
{
for (*value = 2, code_copy = code; *code_copy && (*code_copy != (*letter >= ‘a’ ? *letter & 2
23: *letter)); (*value)++, code_copy++)
{
*value += (*code_copy >=’a’ ? *code_copy — ‘a’: 0);
}
}
}
}
The outer loop: Each time the user enters a new line, it creates a buffer, and reads a line from the standard input to the buffer. The function either returns the buffer it takes as an argument or NULL
in this case. This loop also assigns an address to value
which it will use in the inner loop. This loop will print a new line after completion of the two inner loops.
The middle loop: it will loop through each letter in the string obtained above. As it moves from one letter to another, it will either print it using _DAH
seen above or print a ?
and add a space. As we looked at _DAH
above, we used integers as arguments. It works fine as letters, ASCII characters like *value
to be more precise, and are small integers in C.
The inner loop: it sets the value of *value
to 2, and look at the value of the letter at this point. ((*letter >= 'a') ? *letter & 223 : *letter)
means if the letter is lower case, use its upper case version. The octal 233
or 10010011
serves as a mask to change the one bit that is different for a lowercase letter than an uppercase. Of course 223
is not the most obvious one 137
would more easily come to mind. Knowing this, we can realize this inner loop will iterate as long as the *letter
variable does not have a match in the code
string. As it iterates, it will increase *value
by 1 and move on to the next letter in code
. Interestingly, if at one point in this process, the letter in code
is lowercase, the loop will increase *value
by some special number.
Going from a letter to its Morse code: globally, the inner loop starts with a *value = 2
and increase it by 1 each time the iteration moves on to the next letter in code
until the letter in code
is the letter in my line. It will then print *value
using _DAH
. Let’s see some examples:
*letter = 'E'
then *value = 2
and since E
is the first letter in code
it goes back to the middle loop and calls putchar(*code_copy ? _DAH(*value) : '?')`
. Value of *code_copy == 'E'
here so expression above becomes putchar(_DAH(2))
As seen above _DAH(2)
returns/prints the binary value of 0
as ‘.’ and ‘-’. The output of this expression will be .
This is indeed the Morse code for E.
*letter = 'I'
then to start *value = 2
, since I
is the third letter in code
the inner loop will exit with *value == 4
. putchar(_DAH(4))
will print out ..
, the Morse code for I.
We can observe a pattern, as the order chosen for the letters in code
is such that the letters have Morse codes that are equal to the binary values of their index in the code
string plus 2. Of course, this is not perfect. Remember the inner loop special cases ? If *code == 'a'
, the loop will skip that *value
or said otherwise, this *value
or index does not map any Morse code. If *code == 'b'
, the loop will skip that *value
and shift by 'b' — 'a' == 1
, which means from the on, the Morse code for the letters in code will map the value of their index in the string plus 3. An so on, next comes a d
which will shift that new mapping by 'd' — 'a' == 3
.