My Experience with the 1994 Loebner Competition
by Thom Whalen
The Loebner Prize
Competition is a restricted Turing test to evaluate the "humaness"
of
computer programs which interact in natural language.
This is true story of my experience as a Loebner contestant in the 1994 Loebner Prize competition.
You can decide if this is tragedy, farce, or horror.
The first step in entering, (after writing the program), was to submit
an
application for the first cut. The implication was that only a
select few
programs would be accepted for the contest. I suspect that any program that
was even
close to reasonable was accepted, but I am not sure.
After being notified that my program was accepted as a contestant, I
was sent
a schedule of testing dates when my modem should be turned on and I
should be
available by phone. These testing dates turned out to be only
approximate and I
was not contacted by phone (as near as I can remember; maybe once or
twice). In
any event, about two weeks before the contest date, I was assured by
email that
all was working fine, so I concentrated on final tune-ups to the natural
language program; a tricky business because there is a real danger that
you will
get a wonderful idea at the last minute that you cannot resist trying to
implement and end up introducing a disasterous bug into your program.
The day before the contest was a disaster. I came into my office,
believing
that my safest strategy was to leave the program untouched until the
contest was
over, but found a message in my email box which said that final testing
had
found that my program was failing to handshake with their program, using
the
simple and obvious protocol specified in the official rules, and that I
would be
disqualified unless the problem was solved instantly.
The difficulty was that their turn-taking protocol sent two carriage
returns
to signal the end of the judge's turn. And, by the way, they would
insert
a line feed after every carriage return.
I assume that the logic was that this is the protocol that is
normally used
on human chat systems like IRC.
I discovered, after a couple of hours of frantic debugging, that the
problem
was not in my program, but in UNIX. The UNIX shell automatically
converted every
line feed to a carriage return, so my program was receiving a plethora
of
newline characters and was responding badly to them. I missed the
obvious
solution, which was to modify my program to wait for four carriage
returns in a
row to signal an end of turn. Instead, I spent a frantic day running back and
forth to
my UNIX administrators trying to figure out if I had to hack the shell,
rewrite
the termcaps file, or sacrifice a goat under a full moon to convince the
system
to let the line feeds remain as line feeds.
The full moon was not
necessary. Two
dead goats later, we got the shell to stop converting the linefeeds by
modifying
the login script to set the right environment variables inside a
subshell which
then spawned the program. Then, I had to modify my program to terminate
after two
linefeeds instead of two carriage returns (in violation of the exact
wording of
the official protocol, which specified responding to carriage returns
and
ignoring linefeeds).
I dwell on this, not only because it was so traumatic, but because I
know for
a fact that I was not the only contestant who had difficulties with the
protocol
at the last minute. The problem is that we are applications-level
programmers
and the turn-taking protocol relies on transport level interactions over
which
we have little control. And that, as near as I can tell, the preliminary
testing
was conducted by sending a bunch of empty carriage returns to the
programs
without paying attention to the results, so that our failure to adhere strictly
to the turn-taking
protocol were not discovered until the final testing.
As a consequence, when the day of the contest arrived, I did not have
any
idea whether my program had survived the final testing or was
disqualified. I
sat in my office, thankful that I had not traveled to California to
witness the
contest, but stayed in Ottawa in case there were last minute problems,
and
watched my modem for three hours, wondering if the contest organizers were going to log in.
Log in they did. And hung up. And logged in. And hung up. Finally, at
about
noon in my time zone, they phoned to tell me the schedule for the day. I
inferred that my program had not been disqualified and settled in for an
additional ten hours of nail biting over how badly my program would
fare.
I had adopted a high-risk strategy and fully expected my natural
language
system to crash and burn ignomiously.
For reasons unrelated to
the
contest, I had been putting all my effort into a sex information system and that was what
I had submitted to the contest.
Sex is a difficult topic linguistically. It is an especially broad topic, covering
everything from how to meet a girl to statistics about herpes
infections. It is
also a topic in which synonyms abound. You would be amazed at the number
of
synonyms for the female breast. Many of these synonyms are culture-,
age-, and
gender-specific. It is also a topic in which oblique phrasings are de
rigure.
Phrases like "do it" is common, both in a specific context where it
refers
to a specific act, and in the general context where it may either refer
generally to sexual activity in or narrowly to sexual intercourse.
Its difficulty makes it the perfect topic to exercise a new natural
language
shell. It also makes it a terrible topic for a public competition and I
knew that
it would perform badly.
All the testing was done over the Internet.
I imagined the typical user as a young male computer scientist who has a
rich
sexual fantasy life, but has never had an actual girlfriend. A typical
question
that my sex information system expected to answer was something like,
"How do I
find a girl who will rim me."
You don't have to be Einstein to know that
no
middle-aged woman judge is going to stand in front of a television
camera and
type that on a computer terminal.
I was painfully aware that the judges were from a different
subculture,
probably had a lot more sexual experience, and were in a different
situation
than my intended user population.
I rationalized my choice of sex as a topic by telling myself that, at
least, it
was the most human topic that I could imagine and that the judges might
be
impressed by its broad range of knowledge and wonderfully detailed,
honest, but
generally politically correct, answers. But there was no way on earth
that
anyone would ever mistake my program for a live human being.
I had the additional worry that I had deliberately not told my
immediate
supervisor, senior management, or anyone else in the government that I
was
working on a sex information system; that I had let approximately 10,000
people
call a Canadian government computer and ask blunt questions about sex over the
course of
four months; and that I was now displaying this system to the
international
press without their knowledge or permission.
Even the hint that a
question could
be raised on the floor of the House by the Opposition as to why the Department of Industry
was
providing sex information to the public without the knowledge, consent,
or
participation of the Department of Health would have been sufficient to
shut the
project down and force my withdrawl from the contest. The Official
Opposition
loses no opportunity to embarrass the government and the goverment never
hesitates to protect itself from potential embarassment.
Even though I managed
to get as far as the contest without being discovered and shut down, the
potential political fallout after the contest made me more than a little
nervous.
So I sat and watched the transcripts scroll up my screen and waited
to see it
perform disasterously.
My most pessimistic predictions were dead-accurate.
In my laboratory, our rule of thumb is that natural language
information
systems are usable if they answer 50 per cent of questions
appropriately, but
they will not be liked. If they exceed 65 per cent, they will be
well-liked, and
if they exceed 75 per cent they will be very well-liked. In the Internet
testing, my sex system was exceeding 80 per cent and people were
spontaneously
indicating that they liked using it. When the competition started, it
performed
below 20 per cent for the first judge.
I thought briefly about
unplugging
the modem and pleading unresolvable technical difficulties, or at least
insanity.
I did not take the insanity plea, but hung in there and watched its
performance come up to about 50 per cent by the end of the competition.
I
suspect that performance was improving during the course of the contest
because
the judges were learning how to ask questions that were more likely to
elicit
meaningful answers. Bad as overall performance was, a detailed look
showed an
even worse picture. As I expected, many of the questions were on the
periphery
of the topic, a clear consequence of the judges trying to avoid blunt
questions
about sex in a public forum, so that fully a third of the system's
answers
consisted of an appropriate, "I have no information about that." Overall
only 32
per cent of the questions typed by the judges elicited correct
information from
my program.
Just when it looked like things could not get any worse, my program
literally
lost its mind. In essence, my program navigates around a kind of
dictionary and,
due to a programming error, it was able to navigate right off the edge.
It no
longer recognized any words at all. Fortunately after a thirteen
responses of "I
cannot give you an answer to that," to simple, obvious questions from
three
different judges, the human referee recognized that it had gone brain
dead and
rebooted the system. I did not expect to win any points with those
judges.
The contest organizers had promised to phone and let the losers know
that
they had lost so that they would not have to spend the night waiting for
nothing. In my time zone, after 10:00 PM, it looked like the contest had
ended
over an hour earlier, and I was already packed up and waiting for the
"Better-luck-next-time" call which would send me home to commiserate
with my
family.
Instead the caller told me that I had won. My first reaction was
the
rather graceless thought that the other programs must have really bombed
if my
program's miserable performance was rated the highest. It did occur to
me that
maybe I was the only computer contestant because all the other programs
had been
disqualified for failing to respond properly to carriage returns.
I was
faced
with the immediate problem of trying to sound pleased and excited in the
audio
press conference after spending two days and a night in black
depression.
Perversely, I was pleased when I got the final results and saw that
the other
programs did not do too badly. My program was the winner by a technical
decision, not a knockout. Only three of the five judges ranked it
highest and
one judge ranked it as the worst of the bunch. There may be hope for AI
yet.
Monday morning, I met with my director and confessed that I had won an
international competition and been interviewed on CNN. Neither one of us
wanted to dwell on the topic of my entry. When the cheque arrived, I signed
it over to the government — my research is government property — but I kept
the bronze medal, which is the real treasure in my mind.
Through the rose-tinted filter of hindsight, entering the Loebner Prize Competition
was an adventure that
I would
not have wanted to miss.
I expect to do a lot better next year.