UI Design Newsletter – September, 2003

In This Issue

Pitting usability testing against heuristic review

Kath Straub, Ph.D., CUA, Chief Scientist of HFI, looks at when to use usability testing and when to use heuristic reviews.

The Pragmatic Ergonomist

Dr. Eric Schaffer, Ph.D., CUA, CPE, Founder and CEO of HFI offers practical advice.

Pitting usability testing against heuristic review

Consider this scenario: You are managing the Intranet applications for a large company. You've spent the last year championing data-driven (re-)design approaches with some success. Now there is an opportunity to revamp a widely used application with significant room for improvement. You need to do the whole project on a limited dollar and time budget. It's critical that the method you choose models a user-centered approach that prioritizes the fixes in a systematic and repeatable way. It is also critical that the approach you choose be cost-effective and convincing. What do you do?

Independent of the method you pick, your tasks are essentially to:

  • Identify the problems
  • Prioritize them based on impact to use
  • Prioritize them based on time/cost benefits of fixing the problems
  • Design and implement the fixes

In this situation, most people think of usability testing and heuristic (or expert) review.

In usability testing, a set of representative users are asked (individually) to complete a series of critical and/or typical tasks using the interface. During the testing, usability specialists observe the self-evidency of the task process. That is, they watch as participants work through preselected scenarios and note when participants stumble, make mistakes, and give up. The usability impact is prioritized objectively according to how frequently a given difficulty is observed across the participants. Prioritizing cost-benefits of fixes and suggesting specific design improvements for that specific interface are typically based on the practitioner's broader experience with usability inspection/testing experience.

As the name implies, heuristic review offers a shortcut method for usability evaluations. Here, one or more specialists examine the application to identify potential usability problems. Problems are noted when the system characteristics violate constraints known to influence usability. The constraints, which vary from practitioner to practitioner, are typically based on principles reviewed in usability tomes (e.g., Moglich and Nielsen, 1990) and/or criteria derived from and validated by human factors research (e.g., Gerhard-Powals, 1996). The critical steps of prioritizing the severity of the problems identified and suggesting specific remediation approaches for that interface are based on the practitioner's broader experience with usability inspection/testing experience.

Empirical evaluations of the relative merit of these approaches outline both strengths and drawbacks for each. Usability testing is touted as optimal methodology because the results are derived directly from the experiences of representative users. For example, Nielsen and Phillips (1993) report that despite its greater absolute cost, user testing 'provided better performance estimates' of interface effectiveness. The tradeoff is that coordination, testing, and data reduction adds time to the process and increases the overall man- and time-cost of usability testing. As such, proponents of heuristic review plug its speed of turnaround and cost-effectiveness. For example, Jeffries, Miller, Wharton, and Uyeda, (1991) report a 12:1 superiority for an expert inspection method over usability testing based on a strict cost-benefit analysis. On the downside, there is broad concern that the heuristic criteria do not focus the evaluators on the right problems (Bailey, Allan and Raiello, 1992). That is, simply evaluating an interface against a set of heuristics generates a long list of false alarm problems. But it doesn't effectively highlight the real problems that undermine the user experience.

There are many, many more studies that have explored this question. Overall, the findings of studies pitting usability testing against expert review, lead to the same ambivalent (lack of) conclusions.

Hear what I say, or watch what I do?

In an attempt to find alternative approaches to compare usability testing and expert review, Muller, Dayton and Root (1993) reanalyzed the findings from four studies (Desurvire, Condziela, and Atwood, 1992; Jeffries and Desurvire, 1992; Jeffries, Miller Wharton and Uyeda, 1991; and Karat, Campbell and Fiegel, 1992). Rather than looking at the raw number of problems identified by each technique, their re-analysis categorized the findings of the previous studies on parameters such as:

  • # problem classes per hour invested,
  • # of classes of usability problems identified,
  • likelihood of identifying severe problems,
  • uniqueness of results, and
  • average cost/problem identified for each technique.

Again, their re-analysis demonstrated no stable difference indicating that either usability testing or heuristic review (conducted by human factors professionals) is a superior technique.

An array of methodological inconsistencies makes interpreting the findings in toto even more challenging. The specific types of interfaces or tasks used in comparison vary widely from study to study. Many studies do not clearly articulate what the "experts" are expected to do to come up with their findings (much less, what they really DO). The specific heuristics applied are rarely specified clearly. The level of expertise of the evaluators is rarely described or clearly equated, although it is often offered informally as a factor in the diversity of outcomes. As such, it is possible that, among other things, the conclusions of specific studies falsely favor a method, when relative benefits really result from the broader and deeper experience of the individual implementing the method (Johns, 1994).

So, what's an Intranet manager to do?

Equivocal research findings aren't really helpful when your task is to select the most cost-effective means of identifying, prioritizing and fixing problems in your interface. To make it worse, it appears that the problems identified by the two approaches are largely non-overlapping. According to Law and Hvannberg (2002), the problems identified by usability testing tend to reflect flaws in the design of the task scenarios, such as task flows that do not reflect the steps/order that users expect. In contrast, expert reviews highlight problems intrinsic to the features of the system itself, such as design consistency. Actually that's the good news.

Findings from recent studies extend those findings and may potentially help identify the parameters for identifying the right solution. Instead of pitting one strategy against the other, this study focuses on identifying the qualitative differences between the findings of usability testing and expert review.

Levels of Understanding

Rasmussen (1986) identified three levels of behavior that could lead to interface challenges: skill-based, rule-based, and knowledge-based behavior.

Success at the skill-based level depends on the users' ability to recognize and pay attention to the right signals presented by the interface. Skill-based accuracy can be undermined, for example, when non-critical elements of an interface flash or move. These attention grabbing elements may pull a user's focus from the task at hand, causing errors.

Success at the rule-based level depends on the users' ability to perceive and respond to the signs that are associated with the ongoing procedure. Users stumble when they fail to complete a step or steps because the system (waiting) state or next-step information was not noticeable or clearly presented.

Success at the knowledge-based level depends on the users' ability to develop the right mental model for the system. Users acting based on an incorrect or incomplete mental model may initiate the wrong action resulting in an error.

Fu, Salvendy and Turley (2002) contrasted usability testing and heuristic review by evaluating their effectiveness at identifying interface problems at Rasmussen's three levels of processing. In their study, evaluators were assigned to evaluate an interface via either observing usability testing or heuristic review starting with an identical set of scenario-based tasks. Across the study, 39 distinct usability problems were identified. Consistent with previous research, heuristic evaluation techniques identified slightly more of the problems (87%) and usability testing slightly fewer (54%). There was a 41% overlap in the problems identified.

More interestingly, when the problems/errors were categorized on Rasmussen's behavior-levels, heuristic review and usability testing identified complimentary sets of problems: heuristic review identified significantly more skill-based and rule-based problems, whereas usability testing identified challenges that occur at the knowledge-based level of behavior.

Upon consideration, this distribution is not terribly surprising. Usability Testing identifies significantly more problems at the knowledge-level. Knowledge-based challenges arise when users are learning – creating or modifying their mental models – during the course of the task itself. Often the problems that surface here are the result of a mismatch between the expected and actual task flow. Since the mental interaction models for experts are usually fairly articulated, experts are not good at conjuring the experiences or speculating about expectations of novice users.

In contrast, skill- and rule-based levels of behavior are well studied and documented in attention and perception literature (e.g., Proctor and Van Zandt, 1994). Human Factors courses often focus on these theories. The criteria for heuristic review are essentially derived from them. It is not surprising that usability specialists focusing on heuristic criterion for evaluation would identify relatively more problems at this level. Not incidentally, parameters of interface design affecting skill- and rule-based levels of behavior reflect characteristics that are intrinsic to the interface itself. Intrinsic problems are more likely challenge advanced users because they present a fundamental challenge to their experience-based "standard model" of interface interactions.

Fu, Salvendy and Turley (2002) conclude that the most effective approach is to integrate both techniques at different times in the design process. Based on their analysis, heuristic review should be applied first or early in the (re-)design process to identify problems associated with the lower levels of perception and performance. Once those problems are resolved, usability testing can be used to effectively focus on higher level interaction issues without the distraction of problems associated with skill- and rule-based levels of performance.

Mindful of budget and time constraints, Fu and colleagues also note that if redesign is concerned only with icon design or layout, heuristic review may be sufficient to the task. However, if modification affects software structure, interaction mapping or complex task flows, usability testing is the better choice. To that end, the more complex the to-be-performed tasks are, the more critical representative usability testing becomes. The more complex the proposed redesign, the more critical that both methods be employed.

It depends...

So what should you do with your evaluation project? Like most other projects, it depends on the specific case. Despite the chaotic nature of the field, it is still possible to draw a few conclusions from these studies:

  • heuristic (expert) reviews and usability testing identify different types of usability problems
  • expert reviews really only work well when experts do them
  • combining techniques optimizes the return

Taken together these suggest that to select the most appropriate method you will need to consider more than your budget and the possible methodologies. To make the best decision, you will also need to weigh the expertise of your evaluators, the maturity of your application, the complexity of the tasks, and possibly even the current status of your usability program.


I have to say that I find it totally unsurprising that the quality of the evaluator is the most important factor in determining the effectiveness of usability evaluation, regardless of the method.

We've known for thirty years in software development that quality of personnel was (by at least a factor of 4) the most important thing affecting the quality of the resulting product. Things like software development methods and tools are known to have roughly a 10-20% impact on metrics such as speed of delivery or errors in final product. It therefore would be unsurprising to see a similar result in the HCI field.

Alan Wexelblat

Reader comments on this and other articles.

The Pragmatic Ergonomist, Dr. Eric Schaffer

When people ask me to run a usability test I usually recommend an expert review first.

In formative testing (intended to mold the design) expert reviews are just too good a deal to pass up. They have to be done by experts and the experts must focus on systematic analysis and comparison with many hundreds of research-based principles. In fact, I think a proper expert review can get at issues of mental model and navigational structure as well as usability testing.

The expert review can find opportunities that could never be discovered in a test. We might see that users take 45 seconds to request a wakeup call using an automated hotel wakeup system. But only an expert would realize that defaults could be used to cut this time significantly.

Usability testing is more expensive. It may be needed to graphically convince stakeholders that there are usability issues (it is hard to argue with video of users in tears). Testing will identify things that the experts miss and as such is a good follow on to expert reviews. Usability testing is required for summative analysis (measuring if usability objectives have been met).

Leave a comment here

