| Kessner,
M. (2000), On the reliability of usability testing, Carleton
University Masters Thesis, Ottawa, Ontario, December.
Kessner, M.,
Wood, J. Dillion, R.F. and West, R.L. (2001), On the reliability
of usability testing, CHI 2001 Poster.
Molich, R.,
Thomsen, A.D., Karyukina, B., Schmidt, L., Ede, M., Oel, W.V. and
Arcuri, M. (1999), Comparative evaluation of usability tests,
CHI'99 Extended Abstract, 83-84 (Summary
available.)
Molich, R.,
Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D., Kirakowski,
J. (1998), Comparative evaluation of usability tests, Proceedings
of the Usability Professionals Association. (Summary
available.)
|
|
Rolf
Molich of DialogDesign in Denmark published two articles (Molich,
et.al., 1998; Molich, et.al., 1999) over the past three years that
helped us to understand better the limitations of even our best usability
testing method performance testing.
He and his
colleagues did a comparative evaluation of usability tests by having
four commercial usability labs carry out tests on the same commercially
available calendar program. The purpose of the comparative evaluation
was to observe the different ways in which independent laboratories
conducted usability tests. The testers independently performed usability
tests that each involved about five typical users, and then prepared
a test report. Their results showed that some labs found few usability
problems (4), while others found many (98).
| Usability
Laboratories |
A
|
B
|
C
|
D
|
| Usability
Specialists |
2
|
2
|
1
|
3
|
| Number
of Tests |
18
|
5
|
4
|
5
|
| Problems
Found |
4
|
98
|
25
|
35
|
Only one problem
was found by all four teams, and over 90% of the problems found
by each team was found only by that team.
Molich and
his colleagues conducted a follow-up to the first test to determine
if the results were unique or could be replicated. In the second
study, seven different professional usability labs and two university
student teams independently carried out usability tests of a well-known
Web site hotmail.com. They each prepared and submitted their
standard test report. Again, their results showed that some labs
found few problems (10), while others found many (150).
| Usability
Laboratories |
A
|
B
|
C
|
D
|
E
|
F
|
G
|
H
|
I
|
| Usability
Specialists |
2
|
7
|
1
|
1
|
3
|
1
|
1
|
3
|
7
|
| Number
of Tests |
7
|
6
|
6
|
50
|
9
|
5
|
11
|
4
|
6
|
| Problems
Found |
26
|
150
|
17
|
10
|
68
|
75
|
30
|
18
|
20
|
The results
from the first study were, indeed, replicated. Again, there seemed
to be little consistency across testing organizations. Over half
(55%) of the problems found by each team were found only by that
team.
More recently,
Martin Kessner (Kessner, 2000; Kessner , et.al., 2001) from Carleton
University in Ottawa had six usability testing teams conduct usability
tests on a prototype of a system.
He attempted
to improve the agreement of the testing teams by
- testing
a prototype that had not yet been used by actual users,
- limiting
the issues to be evaluated to five questions specified by designers,
- focusing
exclusively on usability issues (excluding all marketing and other
issues),
- having two
evaluators group similar observations into categories of problems
that were essentially the same, and
- using only
professional usability teams (no student teams).
From the original
total of 117 potential "usability problems" reported by
all the testing teams, the evaluators excluded 31 as non-usability
problems. They then combined similar problems and ended up with
a final number of 36 unique usability problems. Consistent with
the first two studies, none of the problems was found by every team,
and a large proportion of the problems (44%) were found by one team
only.
When considering
the five specific questions that designers wanted answered, there
was moderate agreement among the teams on two questions, and low
agreement on the other three.
Taken together,
the findings of these three studies show that there is considerable
need for improvement in the usability testing process. Contrary
to what some would like us to believe, effective usability testing
is extremely difficult to do well. As a discipline, we need fewer
"discount" methods, and more research-based, truly valid
methods for finding usability true problems.
These findings
show that even experienced usability professionals have difficulty
in identifying usability problems. Should designers trust all observations
made by usability professionals? With this much variability in performance
testing results, should Web site designers trust any observations
made by usability professionals?
Usability professionals
do not let clients drop off a prototype Web site with the request
to find as many problems as possible; and professional designers
do not take seriously the never-ending list of "problems"
identified by someone who has a usability lab with fancy video equipment.
Any amateur with a conference room and a couple of subjects can
use a performance test to find all kinds of so-called "usability
problems." Some do not even need the test subjects they
can find a multitude of "problems" just by staring at
a website and fiddling with the links.
I agree with
Kessner, et.al. (2001), the one thing that will most likely reduce
the large-scale disagreements among usability testers is to have
designers specify precisely the usability questions they have.
Ideally, these questions will include the maximum allowable time
for task completion, and a clear definition of success for each
task. The true usability professional can then effectively use a
performance test to identify those usability problems that most
need finding and fixing.
|