|
You may be surprised. Even a task as (seemingly) transparent as usability
testing Microsoft's hotmail.com elicited different data based on different
approaches to usability testing.
Molich, Ede, Kaasgaard and Karyukin (2004) reports on the findings of
the Comparative Usability Evaluation Study (CUE-2). This meta-analysis
describes the usability testing approaches and results across nine independent
usability groups asked to conduct a "standard" usability test
of hotmail.com. The teams included six industry labs, two university-based
teams with commercial activities and two student teams. Each team was
provided the same project background information and access to a "Marketing
Liaison" for further clarification or feedback on their proposed
methods.
Molich and colleagues compared and contrasted the usability testing approach,
usability problems discovered, and reporting of findings across the teams.
Their finding is jarring: " ...our simple assumption that
we are all doing the same and getting the same results in a usability
test is plainly wrong" (p. 65).
The details – particularly if you think of usability testing as
a process-driven task – are equally jarring:
The teams Usability teams ranged from 1 to 7 members in size. They used
from 16 to 218 hours to conduct the test.
Selection of Method
Eight out of
the nine participating teams used some variation on think-aloud testing
to conduct the usability review. The commonalities largely end here.
The various teams tested 6.6 participants on average, with a range from
4 to 50 across the teams. (The team testing 50 participants used a semi-structured
exploration/questionnaire approach with no direct observation of users
completing tasks.)
Interacting with the "client"
Only two of the nine teams solicited client input beyond the initial
briefing during the usability testing project.
Developing the testing protocol The project briefing provided to each
team outlined 18 features that the Hotmail team indicated could be enhanced
through user input. Five were listed as top priority.
Despite this client-based direction, the overlap in tested tasks was
limited: 51 different tasks appeared on the testing protocols. Only
two usability testing tasks were common across all of the teams
(Register, send someone e-mail). 25 (49%) of the tasks tested were proposed
by only one team.
Leading the witness... 8 of the
nine testing teams used leading questions on their testing protocols.
Leading questions are questions that contained hidden instructions or
cues, such as "Create a personal signature" in a context where
the user needs to click on a link with the word "signature"
in it. Leading questions test participants' ability to recognize keywords
rather than there ability to complete the task.
Usability problems uncovered The usability
teams reported from 10 to 149 problems. No single usability problem
was reported by all nine testing teams. One problem was reported
by 7 of the nine teams.
For the two tasks that were tested by all teams, 232 unique problems
were reported. 75% of the problems identified were identified by only
one of the teams.
Reporting the findings Many violations of best practices in usability
testing reports (see Dumas and Redish, 1999) were noted. Key among those
were:
- 4 out of 9 reports failed to include an Executive Summary.
- 7 out of 9 reports contained two or less screen shots
(3 reports had no screenshots).
- Reports failed to indicate problem frequency.
- 3 of 9 reports failed to prioritize problems based on severit.
- Reports identified too many problems to be useful. (Molich and colleagues
suggest 15-60 problems to be manageable.)
Quality of findings Two interesting findings emerge. First, student reports
are not easy to distinguish from professional reports.
Second, the results of one indirect testing team differed from those
of the direct testing teams. The indirect testing team reported far fewer
problems than the direct observation teams. This group also failed to
observe the one serious problem that was identified by 7 of the 8 remaining
teams. Molich observes, "Unattended testing didn't lead to any more
(in fact, quite a bit less) reported problems and didn’t provide
insights that other methods [missed]." (p. 73).
|