About HFI   Certification   Tools   Services   Training   Free Resources   Media Room  
 Site MapWe help make companies user-centric   
Human Factors International Home Page Human Factors International Home Page
Free Resources

UI Design Newsletter – July, 2006

In This Issue

Is usability testing as we know it about to radically change?

Susan Weinschenk, Ph.D., CUA, Chief of Technical Staff at HFI, looks at new trends in usability testing.

The Pragmatic Ergonomist

Dr. Eric Schaffer, Ph.D., CUA, CPE, Founder and CEO of HFI offers practical advice.

Is usability testing as we know it about to radically change?

Usability testing is a tried and true methodology in our industry. Periodically it comes under fire from within and outside the usability community, but has always stood the test of time. Although there is some variation in usability testing protocols from tester to tester, or firm to firm, the basic concepts – think-aloud techniques during the test, usability engineer observing, logging data, interpreting user actions – remain the same.

Is this all about to change? Recent research on usability techniques is yielding some interesting results, and may point us in different directions in the future. Do we stay the course of tradition, or do we embrace growth and change?

(Please note that the comments in this newsletter are about actual research, not just trying out different methodologies.)

Question #1: Does the usability engineer even need to be there?

For several years there has been debate in the usability community about automated testing vs. having a usability engineer present and running the test. West and Lehman conducted a study in which they compared automated testing with traditional usability engineer-led testing. Are the results and the data generated by automated testing the same as if a usability engineer ran the test?

Here's what they found: There were a few differences in the data coming from users in the automated vs. traditional testing. For example, the task times were longer in the automated test because in the automated condition the reading of the task was included in the overall task time, whereas the usability engineer didn't start the clock in the traditional test until after the participant read the instruction for the task. But much of the data was the same for the automated test and the in-person test, both quantitatively as well as qualitatively. Failure rates were very similar, and both methods elicited plenty of participant comments.

Before you get too excited or too upset (depending on whether you are a fan of automated testing or not), there is one key difference that the researchers didn't think was very important, but I disagree. With the in-person usability engineer, the usability expert found on average 13 additional usability problems that were not identified in the automated condition.

You can get valid information from automated testing. You can use it for major benchmarking measures, but don’t expect to find all the critical usability issues.

One vote for tradition.

Question #2: Does it matter if you test with lo-fidelity or hi-fidelity prototypes, or is that even the right question?

For many years lo-fidelity prototypes – paper sketches, for example – were considered the preferred alternatives for testing since they were easy and fast to create. It was believed that users would not assume you were "done with design" and would therefore be more likely to give feedback. Recently high fidelity prototypes have taken over, as they allow more realistic depictions of today's complicated, colorful and richly interactive screens.

In a research study by McCurdy et al, the authors argue that we have the question and answers wrong. What should you be using? – "Mixed fidelity" prototypes. Characterizing prototypes as low fidelity or high fidelity doesn't capture the possible ranges of differences one can have. The authors suggest that it is more useful to use 5 dimensions:

  • level of visual refinement
  • breadth of functionality
  • depth of functionality
  • richness of interactivity
  • richness of data model

You decide, based on the purpose of a particular usability test, whether to use low, medium or high amounts for each dimension. Their study suggests that carefully choosing from the dimensions results in data that is closer to "real" performance data, yet you have the advantages of a lower fidelity test (easier to create and change than a final product).

One vote for change and growth.

Question #3: Should you test one design or many?

Although some usability tests involve testing multiple designs, most test one design and look for usability problems/issues in the one design which will then be iterated. Is there an advantage to testing alternative designs all at the same time?

Tohidi, et al studied whether the quantity and type of comments you would receive during a usability test would change if you showed more than one design. If you show three alternative designs, for example, do you get different feedback, or more feedback, than if you tested one?

The data they collected contained interesting results and implications. When only one prototype was shown, it had higher ratings and more positive comments. People were being "nicer" about evaluating the single design. When users saw three alternative designs during the same test, then they gave more critical feedback. They weren't so "nice." The authors refer to previous analysis by Wiklund postulating that when participants view more than one prototype it sends a clear message that the designers have not yet made up their mind as to which design to use. Since a commitment hasn't been made, the researchers are seen as being more neutral, and thus the participant doesn't have to worry as much about disappointing the researcher with a negative reaction. This in turn allows the participant to be more critical.

Interestingly, in this study the researchers had a hypothesis they were testing that showing users multiple design solutions would help the users engage in participatory design. This proved not to be true. In both the one-design condition as well as the three-design condition, users did not come up with redesign suggestions. (Well, we know that users are not usually designers... this finding is not surprising).

A small but interesting finding in this study was that participants who reviewed only one design made comments, but did not totally "reject" the design. However, some of the participants in the multiple design condition did reject the entire design, saying things such as, "I would not buy this one."

One vote for change and growth.

Question #4: Usability testing = the think-aloud technique?

One of the hallmarks of an in-person usability test is the think-aloud technique. Can you imagine a usability test in which the user is not thinking aloud? Well, think again. In a study by Guan et al, the researchers challenge our assumptions. They look at a technique called Retrospective Think Aloud (RTA). The usual usability testing protocol is Current Think Aloud (CTA). There has been some criticism that CTA does not simulate normal tasks. In "real" life, users are not annotating each action with thinking aloud while they are doing tasks. Lately there has been some interest in using RTA instead of CTA. With RTA, users do the tasks silently, and then talk about what they did afterwards. In this study they compared RTA with eye tracking data to determine the validity of the RTA technique.

They found that people's recounting of what went on in their task performance matched the same sequence as what they attended to according to the eye tracking data. And it didn't matter whether the task was simple or complex.

However, they also found that the participants left out a lot of information. The sequence of what they said they did and why they did it matched the sequence in eye tracking, but there was a lot of information omitted. The researchers attribute this to the fact that the participants are summarizing their actions, but I wonder if this may actually be hinting at a new frontier of usability testing instead – see Question #5.

One vote for traditional.

Question #5: Coming attraction?

Both CTA and RTA assume that having users monitor their own actions and reactions results in valid data. But in a fascinating book called Strangers to Ourselves: Discovering the Adaptive Unconscious, Timothy Wilson reviews theories and research indicating that the vast majority of our actions and decisions are made from non-conscious processes. In other words, although we will prattle on about why we do what we do, the real reasons are not available to our conscious minds. It's a compelling argument, with real data to back it up. So what does this mean for usability testing and the think-aloud technique? I'm still working on this one... I'm hoping someone will devise a galvanic skin response mouse so that we can measure changes in bodily functions rather than relying on meta-cognition.

One vote for growth and change.

Where are we heading?

So what's the final tally? Two votes for tradition, and three for growth and change... Hang on, it might be a bumpy ride!


Guan, Z., Lee, S., Cuddihy, E., Ramey, J. (2006). The Validity of the Stimulated Retrospective Think-Aloud Method as Measured by Eye Tracking, CHI 2006 Proceedings.

McCurdy, M., Connors, C., Pyrzak, G., Kanefsky, B., and Vera, A. (2006). Breaking the Fidelity Barrier, CHI 2006 Proceedings.

Tohidi, M., Buxton, W., Baecker, R., and Sellen, A. (2006). Getting the Right Design and the Design Right: Testing Many Is Better Than One, CHI 2006 Proceedings.

West, R. and Lehman, K. (2006). Automated Summative Usability Studies: An Empirical Evaluation, CHI 2006 Proceedings.

Wiklund, M., Thurott, C., and Dumas, J. (1992). Does the Fidelity of Software Prototypes Affect the Perception of Usability? Proceedings Human Factors Society 36th Annual Meeting, 399-403.

Wilson, T. (2004). Strangers to Ourselves: Discovering the Adaptive Unconscious, Belknap Press; New Ed edition.

I thought Susan’s article was fantastic: a concise, timely and relevant summary of academic research that may influence future directions in our field.

I wanted to also point out that in market research in Australia, at least, they are starting to use neuroscience to gauge reactions to things (e.g. advertisements) in much the way that Susan alludes to in her response to Question 5. See this link, for example.

I have concerns about the usefulness of such data and the quality of the resulting analysis (much as I do with eye tracking) but I thought you might be interested nonetheless.

Finally, thank you for a consistently great read.

Jessica Enders
The Hiser Group


One thought on the generalizability of the West and Lehman paper. They used SAS employees from marketing and UI design groups. I realize they did this for expediency's sake. However, I wonder whether less techy, communications oriented people would provide as good a qualitative feedback in their typed in automated test comments. I was little suspicious when I saw a few of the verbatims seemed more descriptive than what I might expect test participants from the public at large might say. I might not expect references such as "easy to find the dialog" and "in order to commit the changes" from the public at large. Also, at least in the examples given, some were very descriptive task focused - almost what I'd expect a usability engineer to do when describing a problem. It'd be nice to see the study done with a broader recruiting profile.

John Bierschwale
Quest Software

People looking for another perspective on "QUESTION #4: USABILITY TESTING = THE THINK-ALOUD TECHNIQUE?" should read relevant sections of "Blink" by Malcolm Gladwell.

David Robertson

Reader comments on this and other articles.

The Pragmatic Ergonomist, Dr. Eric Schaffer

Use automatic unmoderated testing for SUMMATIVE testing only, where you just want to measure time and errors.

Test early and often, irrespective of the quality of your prototype. If your user praises the design, take this with a grain of salt (especially in Asia).

Use the retrospective method only when the task is too complex or the distraction is unacceptable for normal, talk-aloud methods. In formative testing, stop at key points and do little, in-depth interviews to understand the underlying motivations and feelings.

Leave a comment here

© 1996-2014 Human Factors International, Inc. All rights reserved  |  Privacy Policy  |   Follow us: