|
Kessner, M. (2000), On the reliability of usability
testing, Carleton University Masters Thesis, Ottawa, Ontario, December.
Kessner, M., Wood, J. Dillion, R.F. and West, R.L. (2001), On
the reliability of usability testing, CHI 2001 Poster.
Molich, R., Thomsen, A.D., Karyukina, B., Schmidt, L., Ede, M., Oel,
W.V. and Arcuri, M. (1999), Comparative evaluation
of usability tests, CHI'99 Extended Abstract, 83-84 (Summary
available.)
Molich, R., Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller,
D., Kirakowski, J. (1998), Comparative evaluation of usability tests,
Proceedings of the Usability Professionals Association.
(Summary
available.)
|
Rolf Molich of DialogDesign in Denmark published two articles (Molich,
et.al., 1998; Molich, et.al., 1999) over the past three years that helped
us to understand better the limitations of even our best usability testing
method – performance testing.
He and his colleagues did a comparative evaluation of usability tests
by having four commercial usability labs carry out tests on the same commercially
available calendar program. The purpose of the comparative evaluation
was to observe the different ways in which independent laboratories conducted
usability tests. The testers independently performed usability tests that
each involved about five typical users, and then prepared a test report.
Their results showed that some labs found few usability problems (4),
while others found many (98).
| Usability Laboratories |
A |
B |
C |
D |
| Usability Specialists |
2 |
2 |
1 |
3 |
| Number of Tests |
18 |
5 |
4 |
5 |
| Problems Found |
4 |
98 |
25 |
35 |
Only one problem was found by all four teams, and over 90% of the problems
found by each team was found only by that team.
Molich and his colleagues conducted a follow-up to the first test to
determine if the results were unique or could be replicated. In the second
study, seven different professional usability labs and two university
student teams independently carried out usability tests of a well-known
Web site – hotmail.com. They each prepared and submitted their standard
test report. Again, their results showed that some labs found few problems
(10), while others found many (150).
| Usability Laboratories |
A |
B |
C |
D |
E |
F |
G |
H |
I |
| Usability Specialists |
2 |
7 |
1 |
1 |
3 |
1 |
1 |
3 |
7 |
| Number of Tests |
7 |
6 |
6 |
50 |
9 |
5 |
11 |
4 |
6 |
| Problems Found |
26 |
150 |
17 |
10 |
68 |
75 |
30 |
18 |
20 |
The results from the first study were, indeed, replicated. Again, there
seemed to be little consistency across testing organizations. Over half
(55%) of the problems found by each team were found only by that team.
More recently, Martin Kessner (Kessner, 2000; Kessner , et.al., 2001)
from Carleton University in Ottawa had six usability testing teams conduct
usability tests on a prototype of a system.
He attempted to improve the agreement of the testing teams by
- testing a prototype that had not yet been used by actual users,
- limiting the issues to be evaluated to five questions specified by
designers,
- focusing exclusively on usability issues (excluding all marketing
and other issues),
- having two evaluators group similar observations into categories of
problems that were essentially the same, and
- using only professional usability teams (no student teams).
From the original total of 117 potential "usability problems"
reported by all the testing teams, the evaluators excluded 31 as non-usability
problems. They then combined similar problems and ended up with a final
number of 36 unique usability problems. Consistent with the first two
studies, none of the problems was found by every team, and a large proportion
of the problems (44%) were found by one team only.
When considering the five specific questions that designers wanted answered,
there was moderate agreement among the teams on two questions, and low
agreement on the other three.
Taken together, the findings of these three studies show that there is
considerable need for improvement in the usability testing process. Contrary
to what some would like us to believe, effective usability testing is
extremely difficult to do well. As a discipline, we need fewer "discount"
methods, and more research-based, truly valid methods for finding usability
true problems.
These findings show that even experienced usability professionals have
difficulty in identifying usability problems. Should designers trust all
observations made by usability professionals? With this much variability
in performance testing results, should Web site designers trust any observations
made by usability professionals?
Usability professionals do not let clients drop off a prototype Web site
with the request to find as many problems as possible; and professional
designers do not take seriously the never-ending list of "problems"
identified by someone who has a usability lab with fancy video equipment.
Any amateur with a conference room and a couple of subjects can use a
performance test to find all kinds of so-called "usability problems."
Some do not even need the test subjects – they can find a multitude
of "problems" just by staring at a website and fiddling with
the links.
I agree with Kessner, et.al. (2001), the one thing that will most likely
reduce the large-scale disagreements among usability testers is to have
designers specify precisely the usability questions they have.
Ideally, these questions will include the maximum allowable time for task
completion, and a clear definition of success for each task. The true
usability professional can then effectively use a performance test to
identify those usability problems that most need finding and fixing.
|