About HFI   Certification   Tools   Services   Training   Free Resources   Media Room  
               
 Site MapWe help make companies user-centric   
Human Factors International Home Page Human Factors International Home Page
Free Resources

UI Design Newsletter – May, 2007

In This Issue

Why "how many users" is just the wrong question – Rethinking the requirements for valid usability tests

HFI Chief Scientist, Kath Straub, PhD, CUA,
revisits the question about the number of users required for an effective usability test.

The Pragmatic Ergonomist

Dr. Eric Schaffer, Ph.D., CUA, CPE, Founder and CEO of HFI offers practical advice.

Why "how many users" is just the wrong question

Death. Taxes. How-many-users.

Every day in offices around the world usability professionals ask and are asked this question: How many users do we need for our usability test? Its an important question. We want to find most of and the most severe problems. So, we need to test enough people. But usability testing is so expensive, and the cost of testing increases with each participant. So, we don't want to test too many, either.

On the one hand, synthesizing the received theoretical wisdom suggests that there is an answer to this question. And answer is "5." (Virzi 1992; Nielsen and Landauer, 1992) That is, based on a probabilistic formula, you will need to test 5 users to find about 85% of the problems that will trip up 1/3 or more of your users. The number 5 is very concrete. Practitioners like it. 5 is easy to remember.

On the other hand, this question gets debated every year at the CHI conference. You can count on it.. Like death and taxes. The same debate. Given that the UX community (re-)debates this every year, it seems that the wisdom has not been so well received.

Blue! No, Green!... No, 5!

That the number 5 has such staying power says something interesting about human memory and the way people reason. The 5-formula can work. But, like tossing a coin, it's probabilistic. If you keep flipping a coin over and over, it will come up heads half the time. But it can also come up tails nine times in a row.

Similarly, if you run enough usability tests with 5 users, on average you will find most of the errors about most of the time. But if you run only one test (or just a few) with 5 users, it's possible that you will uncover fewer errors than the formula projects. (Spool and Schroeder, 2001; Faulkner, 2003, or you are less ambitious, there is the May, 2004 newsletter.)

There are other challenges with the 5-formula. For instance, to calculate the number of testing participants you need, a priori you need to know how many problems there are to find. If you knew that, likely you wouldn't need to test to find them, eh?

Reach beyond... # of users

Not surprisingly, the debate churned on in San Jose (CHI 2007). But this year, Lindgaard and Chattratichart (2007) threw down a different gauntlet. The obstacle to solving the problem, they said, is the question. "How many users" is the wrong way to think about it.

In usability testing, we are looking for mismatches between the site/app model and the user's mental model on the key and critical tasks. Framed this way, the criterion that determines how many problems get uncovered is how many tasks participants try, not how many participants there are.

To test their claim, Lindgaard and Chattratichart reanalyzed the usability testing data from CUE-4* (Molich, 2003 – Workshop Reference). Within that project, 9 highly experienced teams used think-aloud techniques to independently test the same site. The teams received identical input from the coordinators (site objectives, problem criteria, testing focus). Each team shaped their own testing plan and protocol, conducted the testing, and aggregated the findings into a pre-determined feedback format.

Lindgaard and Chattratichart looked for similarities and differences across the methods and findings reported by each team. Specifically, they were seeking relationships between test design (e.g., # users, # tasks) and number of problems identified.

Their study reports that there was no reliable correlation between the number of users tested and the number of usability problems uncovered. Testing more users did not ensure that that more problems would be discovered. Further, although each of the 9 teams tested 5 users or more, they reported only 7-43% of the known problems, not the 85% predicted by the 5-formula.

In contrast, their analysis showed a significant positive correlation between the number of tasks evaluated and the number of problems uncovered. That is, the more tasks a team included in their testing protocol, the more problems they uncovered.

They conclude that other things being equal (e.g., quality of recruiting), the better predictor of the productivity of usability testing is the number of tasks participants (try to) complete, not the number of participants who try to complete them.


* The CUE Studies, Molich and Dumas, in press; Molich, Kaasgaard and Karyukin, 2004, among others, compare methods and findings of different teams conducting the same usability test. CUE findings show that different usability testing teams evaluating the same interface report different numbers usability problems, often with very little overlap in the identified. There's clearly more to it than number of users.

References

Faulkner, L. Beyond the five-user assumption: Benefits of increased sample sizes in usability testing. Behavior Research Methods, Instruments & Computers, 35, 3, Psychonomic Society (2003), 379- 383.

Lindgaard, G. and Chattratichart, J. Usability Testing: What Have We Overlooked? CHI 2007 Proceedings, ACM Press (2007).

Molich, R. & Dumas, J. S. Comparative Usability Evaluation (CUE-4). Behaviour & Information Technology, Taylor & Francis (in press).

Molich, R. & Jeffries, R. Comparative expert review. In Proceedings CHI 2003, Extended Abstracts, ACM Press (2003), 1060-1061.

Molich, R., Ede, M. R., Kaasgaard. K., & Karyukin, B. Comparative usability evaluation. Behaviour & Information Technology, 23, 1, Taylor & Francis (2004), 65-74.

Nielsen, J., & Landauer, T. K. A mathematical model of the finding of usability problems. In Proceedings of INTERCHI 1993, ACM Press (1993), 206-213.

Spool, J. & Schroeder, W. Testing Websites: Five users is nowhere near enough. In Proceedings CHI 2001, Extended Abstracts, ACM Press (2001), 285-286.

Virzi, R.A. Refining the test phase of usability evaluation: How many subjects is enough? Human Factors, 34, HFES (1992), 457-468.

I have for years read the debates over which number of users is statistically significant, and yes the minimum of 5 has always been a safe bet. But seriously I don't see how increasing the number of tasks is more beneficial. I am a designer for applications for which I generally test a scope of tasks per feature and there is no need for more tasks when my goal is to a test specific set of features that can be completed with a finite set of tasks. Also the article references sites; in my experience the types of tests we perform to validate the usability of a site are often different than the types of measures used to test the usability of product applications. I think this distinction needs to be made.

Shilpa
HP

Can you please comment on the selling of this idea to clients – three groups of 6-12 participants. This would be helpful because every different user groups you recruit adds to the cost. Is it advisable to separate out as common tasks across groups and special tasks per specific group? Somewhere it also connects to the maturity of usability practice within an organization. Your recommendations can help practitioners in companies sell this to their own management or clients. If an end-to-end software solution provider needs the bandwidth to address usability in projects, are there more automatized tests or techniques that can be provided?

Raju

Reader comments on this and other articles.

The Pragmatic Ergonomist, Dr. Eric Schaffer
Eric

This result is fantastic! It's like trying to find potholes in a city. Not every car hits every pothole in the road. So you need to send a number of cars down each road. But it is even more important to send cars down a larger NUMBER of roads. The key seems to be in more tasks, not just more users. The problem is that you can only run a given number of tasks with a single test participant. More than 60 or perhaps 90 minutes of testing won't work well.

I propose a "Lingaard-Chattratichart Testing Strategy." Test 3 different groups of participants. Put maybe 6 to 12 people in each group. Then have each group do a different basket of tasks. This will allow us to test a LOT of different tasks and should get a far better level of reliability.




Leave a comment here

© 1996-2014 Human Factors International, Inc. All rights reserved  |  Privacy Policy  |   Follow us: