Thom's Participation in the Loebner Competition 1995
or
How I
Lost the
Contest and Re-Evaluated My Concept of Humanity
by Thom Whalen
The Loebner Prize
Competition is a restricted Turing test to evaluate the "humaness"
of
computer programs which interact in natural language.
In 1994, I
won by a hair the
Loebner Prize Competition in San Diego, California. In
fact,
considering how poorly my program performed, I am still surprised that I
won at
all.
In 1995 the competition was held in New York on 16 December.
I felt that I could not help but do better.
The 1995 rules were changed from previous competitions. For the first
time,
the judges would be permitted to ask any question that they wanted,
rather than
being restricted to a particular topic for each program. This was
intended to
make the competition more difficult.
In order to accomodate the "no-topic" rule, I decided that the best
approach
would be to try to model a human being. I would not simply try to answer
questions, but would try to incorporate into my program a personality, a personal
history, and a
unique view of the world. In short, I would try to invent a person.
This may sound daunting until you realize that people have been
inventing human beings for centuries; every novel or a play is populated
with
invented people. For example, Sir Arthur Connan Doyal created a complete
personality, personal history, and unique world view for Sherlock Holmes
which
was so compelling that many people believe he was a historical figure.
The only difference was that I would have to make a character which
would
respond to a variety of inputs. I had done this before in a simulation
of a
conversation with a university undergraduate. This time, with more
experience and newer, more powerful software, I could surely make an
even better
simulation.
To limit the scope of the conversation, I decided to create a character who had a
fairly
narrow world view; who was only marginally literate and, therefore, did
not read
books or newspapers; and who worked nights, and, therefore, was unable
to watch
prime-time television.
Furthermore, to provide some direction for the
conversation to develop to try to capture the judge's attention, I
created a
minor mystery plot. He would be a janitor who was about to lose his job.
By
conversing with him, you could find out that he was actually the victim
of a
deliberate slander and learn enough to tell him how to keep his job.
I spent three months writing the conversation and testing it on the
Internet.
As the deadline approached, I had second thoughts about entering the
1995
competition at all. Unlike the previous four competitions, in 1995,
programs
could not participate through any kind of communications medium. They
would be
required to run on site at the Salmugundi Art Club in New York City.
My program has been developed on a Sun SPARC workstation, and would
not run
on a PC. I did not relish the thought of trying to carry a SPARC to New
York.
They do not work well when they are not connected to a network, they do
not fit
under an airplane seat, and I did not want risk having my primary
development
platform stuck in some customs broker's office for weeks while I missed
the
competition.
Hugh Loebner agreed that I could enter the competition contingent
upon him
supplying a computer for me to use. And he did. Sun computers agreed to
lend a
SPARC workstation to the competition.
My program, Joe the Janitor, was in. I was committed.
As the date for the contest approached, I devoted a couple of weeks
to
implementing the mundane technical details that would be required for
the
competition.
I decided that the easiest way to configure my entry was to have the
SPARC
communicate with a PC via the serial port. That way the apperance of the
screen
would be identical to the human confederates' screens. An added bonus was that
the PC
would take responsibility for collecting the transcripts in the required
format.
All I had to do was make my program communicate through the serial
port on a
standalone SPARC using the communications protocol specified for the
contest.
Yeah, right.
To get control of the serial port, I poured through the UNIX
technical
manuals to learn all about "ioctl()" and "termio.h" and "non-cannonical
mode"
and other mysterious UNIX incantations.
I also poked and prodded Loebner's communications program to learn
all about
double carriage returns and "CCC99" handshakes and other arcane rites.
Next, I had to learn all about how Sun Workstations are administered
in
stand-alone mode. Sun's motto is "The computer is the network." My worst
nightmare was traveling all the way to New York and then finding that I
could
not get the SPARC running properly. I thought that Sun would probably
deliver a
computer that worked in standalone mode, but I could not risk being
caught
unawares if their machine expected to find a network plugged into the
ethernet
port. So I learned about more obscure UNIX incantations called, "boot
-s" and
"localhost 127.0.0.1" and "hostname.xx0".
Finally, I had to introduce realistic keystroke delays, typing
errors, and
thoughtful pauses into the output of my program. Unlike the previous
year, the
judges would be seeing the output of the program displayed character by
character. The program not only had to appear to understand English, it
had to
look like there was a human being typing the answers.
Armed with my program disk, a sheet of instructions for configuring
UNIX in
standalone mode, another sheet of instructions for communicating with
Loebner's
program, and my own cables and manuals -- just in case -- I drove to the
Ottawa
airport on Thursday afternoon.
I toted my suitcase across the airport parking lot in -20 C, a wind
blowing
steadily at 30 km/hr, and 5 cm/hr snow accumulation -- in technical
terms, a
Canadian mid-winter blizzard -- wondering if the airplane would be able
to take
off at all. But my fears were for nought. Air Canada was not about to be
deterred by a little adverse weather.
In New York, I found that their balmy +3 C with no percipitation was
too warm
for my eiderdown parka with the fur-rimmed snorkle hood.
Something else to make me sweat through the next two days.
Friday morning I found the Salmugundi Club with the help of my cab
driver
("What? Fifth Avenue? Where on Fifth Avenue? You don't know the cross
street?
Are you sure you don't know the cross street? How about guessing. Maybe
47th
street? Does that sound right? No? Well, pick a cross street!") and found
Hugh
Loebner waiting for me.
Let me make this perfectly clear. I found that only
Hugh
Loebner was waiting for me. ("Staff? Help? No, there's no one else. I'm
running
this contest myself. Like Blanche in 'A Streetcar Named Desire,' I'm
relying on
the kindness of strangers.")
He favored western-style string ties. ("Howdy, Stranger!").
There was a room stacked high with some thirty-odd crates. Sure
enough, two of
these crates held a SPARCstation and monitor and the rest held IBM PCs
and
monitors. None of these crates held null modem cables. None of these
crates held
power bars. And none of these crates held the video multiplexers
necessary to
show the contest to the audience.
Fortunately, I brought my own null modem and cables. Hugh went out
and bought
a couple of power bars.
The SPARCstation that Sun delivered was perfectly configured. There
was no
need for "boot -s" or "localhost 127.0.0.1" or even "mv hostname.le0
hostname.xx0". Twenty minutes later Joe the Janitor and I were on
speaking
terms.
Hugh Loebner got FRED, Robby Garner's program, running and announced
that we
had a contest. Even if no other contestants or confederates showed up,
we still
had two computer programs that could compete against each other.
I was not going to win by default.
I spent the rest of the day fiddling with Joe. Tweaking this and
twitching
that; uncertain whether I was improving his performance or introducing
more
bugs. But I was too nervous to leave him alone.
On Saturday, Joe the Janitor would face Joseph Weintraub's program,
the
PC-Therapist, which had won the first three Loebner Competitions. Though
Joseph
and I had both won Loebner medals in previous years, we had never
competed in
the same competition.
The courier did not arrive with the promised cables. Hugh went out
and bought
more power bars. He had some new null modem cables custom made.
That evening, two other competitors, Philip Maymin and Joseph
Weintraub
arrived at the club. We ate a Christmas dinner, and then played some pool. Hugh
prefered
a game called "cowboy" ("Howdy, Stranger!"). I won our game. At least I
can claim
that I won something in New York.
We ended the evening making sure that the other programs
worked.
Now we had a four-way contest. As well, the courier had finally delivered
the video
cables and null modems, so we would have a contest that an audience
could see.
The next morning, bright and early (8:15 AM), we started setting up
the room
for the competition.
Unfortunately, the competition was held in the same
room as
the Christmas dinner, so there was no way to set up the computers before
the day
of the contest. Hugh ("I depend on the kindness of strangers"), Philip,
José the
superintendent, and I rolled up our sleeves and began uncrating the IBM
PCs and
carrying them up the stairs. I dearly wish elevators had been invented a
hundred
and fifty years ago when the Salmugundi Club was founded.
To be honest, I rather enjoyed helping set up the computers. It gave
me
something more productive to do than to sit and
stew about
how Joe would perform.
In three hours Philip, Hugh, and I managed to set up a dozen PCs, one
SPARC,
and twenty monitors, install the communications software everywhere,
yoke the
right machines together with the null modems, and install curtains to
separate
the judges from the confederates and the audience.
When the confederates arrived, Hugh led them to their terminals
and gave them their instructions. The
judges arrived and were
introduced. The competition was begun. Judges typed questions for
fifteen
minutes on each terminal and programs and confederates responded. The
audience
watched. Philip and I watched. Joseph Weintraub spent most of his time
in the
club lounge, cool and confident.
In the second round, the judges were given an additional five minutes
to
query any terminal that they were uncertain about. None of the judges
bothered
trying Joe a second time. I knew that was a bad sign.
Finally, the judges were asked to rank-order each terminal from most
to least
human.
The results were tallied and Hugh announced the winner: "Joseph
Weintraub."
I lost.
Actually I came in second, but losing to Joseph Weintraub was still
losing.
Robby Garner from Robitron came in third. He was at a clear
disadvantage
because his program, FRED, ran from DOS so his screen looked different
from the
other seven screens.
Philip Maymin's strategy was to minimize the judges' opportunity to
interact
with his program. It produced very long output at a painfully slow
typing speed.
Many judges only had an opportunity to ask a single question and we only
saw
about three different answers during the whole contest. Cute idea. The
judges
were not impressed.
After the competition, we talked to the judges. They were mostly from
the
media and unanimously agreed that they enjoyed being judges. They were
in no
hurry to leave and the journalists among them took the time to interview
everyone in sight.
I was disappointed that Weintraub won again, but the rules were clear
and he
won fairly.
There are lessons for me to learn. Several of my hypotheses were
disproved.
Or at least cast into strong doubt.
First, I had hypothesized that the number of topics that would arise
in an
open conversation would be limited. If you look at Dale Carnegie, an
expert in
making small talk to strangers, he states that strangers talk about (in
this
approximate order):
- their names
- where they live
- where they used to live
- people that they know in common
- the weather
- sports
- politics
- books, television, movies and music
- hobbies
I believe that he is correct and I programmed Joe to have some
response for
common questions on each of these topics.
My error was that the judges, under Loebner's rules, did not treat
the
competitors as though they were strangers. Rather, they specifically
tested the
program with unusual questions like, "What did you have for dinner last
night?"
or "What was Lincoln's first name." These are questions that no one
would ever
ask a stranger in the first fifteen minutes of a conversation.
Robby Garner's program, FRED, encountered the same problem for about
the same
reason. It was prepared to answer questions about various aspects of his
personal life, but the judges never asked any questions which produced
those
answers.
Second, I hypothesised that, once the judge knew that he was talking
to a
computer, he would let the computer suggest a topic. I do not believe
that any
existing computer program can seriously pretend to be a human being for
more
than a half-dozen interactions, so I consider the human confederates to
be a red
herring. I believe that the real issue is whether my program appears
more human
than the other programs.
Thus, my program tried to interest the judges in Joe's employment
problems as
soon as possible. This would lead most quickly to the richest
interactions
because this was the part of the program that had been most highly
developed.
I was surprised to see how persistant some judges were in refusing to
ever
discuss Joe's job. It seemed that the judges would rather see the
program reply,
"I don't know," twenty times in a row to various strange questions than
to get
reasonable responses to questions about why he is worried about losing
his job.
I guess they really wanted to hammer home the point that Joe is not a
human
being.
Third, I hypothesized that the judges would be more tolerant of the
program
saying, "I don't know." than of a non-sequiter. Thus, rather than having
the
program make a bunch of irrelevant statements when it could not
understand
questions, I simply had it rotate through four statements that were
synonymous
with "I don't know."
Weintrab's program, however, was a master of the non-sequiter. It
would
continually reply with some wildly irrelevant statement, but throw in a
qualifying clause or sentence that used a noun or verb phrase from the
judge's
question in order to try to establish a thin veneer of relevance.
I am amazed at how cheerfully the judges toleranted that kind of
behaviour. I
can only conclude that people do not require that their converstational
partners
be consistent or even reasonable.
But I am not ready to draw any conclusion about whether this is a
fundumental
problem with the Turing test. Remember that we are talking about
conversational
partners that are fairly quickly recognizable as computer programs. To
appear
completely human, I would expect (hope) that the program would have to
be much
more responsive to the questions that were asked.
Fourth, I hypothesized that a critical component of "humanness" was
personality. I felt that it was important that my program have a
consistant and
identifiable personality.
I think I was successful in this. In discussion with the judges after
the
contest, when I confessed to being Joe's creator, one of the judges said
that he
thought Joe had the best defined personality. I then asked if he rated
Joe as
the most human computer and he said, "No." I probed and prodded for a
few
minutes, but he could not explain why he thought that humanness was
different
than having a human personality.
I am still puzzled about that.
The failure of all four of my hypotheses leaves me in a quandry.
I believe that I could modify Joe to beat Weintraub simply by
replacing the
"I don't know" part of the program with a little Weintraub/ELIZA style
routine.
I estimate that it would take about two weeks of effort to produce a
routine
that would be adequate for my purposes, though much less sophisticated
than
Weintraub's. Then my program would still answer all of the questions
that it
already does, but, when it encountered an unfamiliar question, it would
never
say "What?" or "I don't know." Rather, it would just introduce a new
topic. Not
as smoothly as Weintraub's program, but smoothly enough.
But I don't know if I want to do that. Making that modification would
have no
purpose whatsoever, apart from winning the next competition. The primary
goal of
my TIPS software, like Robby's FRED, is to create useful information
systems
such as computer help systems, not to win Loebner's competition. I do
not
believe that Weintraub's approach, which follows in the footsteps of
Joseph
Weisenbaum's ELIZA, will ever lead to a useful way to deliver factual
information.
Lying awake in the middle of the night in my hotel room after losing
the
competition, I wrote out a list of eight major enhancements to my software
which would
make it a more powerful information delivery system. I would much rather
spend
my time implementing these enhancements than re-implementing an enhanced
ELIZA
that would not be good for anything.
As well, I am philosophically enamoured with the idea of writing a
program
which models a real human being rather than a program which simply tries
to
field random questions. And I am philosophically opposed to writing a
program
that performs syntactic tricks without any interesting semantics (or
more
specifically, semantic-based pragmatics). I would rather keep trying to
model a
human being that will do better than Weintraub's program than to beat
him at his
own game. If I have to resort to ELIZA's tricks, then I will be
admitting a
fundumental flaw in my own approach.
The next contest will be held in April '96. I enjoyed entering the
Loebner
competition immensely in the last two years. I would encourage everyone
working
in natural language to think about entering the next one. If for no
other reason
than to show that ELIZA-style programs are not the epitomy of
natural-language
processing.
For myself, I do not know what the future will bring. Except that I
do know
that I will keep developing my software. And I will start thinking
seriously
about what it means to judge a conversational partner as "human."
A Retrospective Note from 2013: I always intended to enter the Loebner Prize Competition again, but my
research took me in other directions and I and never returned to natural language
programming. It's a pity that life doesn't give us time to do everything we want because
I would have liked to have tried the competition again.