1. Success Rates  2. TimeonTask
When Discount Usability Misleads Management – A Solution  

Summary  Usability folks often interact with marketing folks about progress on Web site design. Do they speak the same language? That is the question. If you tell someone that 80% of your test subjects "succeeded," how might you be misleading them? Know how to qualify your discount usability test results with a "margin of error." 
Episode #287 of Usability Crossroads: "Telling It Like It Is" 
Scene 1: DAY, INTERIOR, HALLWAY You: Hey, I just heard our last usability test of the redesigned shopping cart page only found a couple problems. And we can fix those in the next day or so. We're almost ready to roll. Marketing: Great. What percentage of your test subjects were able to actually get through the shopping cart? You: Well, I don't have the details, but my buddy said we had a success rate of 80%. That's better than the 50% success rate he reported two weeks ago on the existing shopping cart. Marketing: Great. We can take that to the bank. You (smiling): Cut me a check while you're there. Scene 2: DAY, INTERIOR, MARKETING OFFICE Marketing (mumbling while setting up his Excel chart): We know that 1000 people a day put items in our shopping cart and try to check out. And 80% of that means 800 people will complete their order process instead of 500 people. Let's see, with 300 extra people completing their order at $50 a day means an extra $15,000 a day income. Maybe I should ask for a bonus for the usability people. Nah, they're just doing their job.
Scene 3: DAY, INTERIOR, USABILITY OFFICE You (to your buddy who did the test): I just told marketing we found some more problems we can fix on the shopping cart tasks. Solving those two or three problems should get us another bump up in revenue. Your buddy: Yeah, I guess we'll have to wait and see what happens on the Web site. Anyway, it certainly will be better than what we had before the changes. Reality quiz Which is right – Scene 2, Scene 3, Both, or Neither To end the suspense, the answer is "Neither". 
Usability fails to qualify "80%" success rate  Yes, indeed, you "just told marketing we found some more problems". However, you probably mislead your listener because you also said "80%" could do the task – without qualifying what you meant. Qualifying your answer means telling marketing that 80% really only applies to your subjects. Unfortunately, it has many other, inferred meanings – some useful, some downright dangerous.
As you saw in scene 2, marketing interpreted 80% as a real number that was actionable in terms of all the shoppers on your site – not just the subjects in the test. Marketing made the "profit" calculations based on what you said and figured an extra 300 people adds $15,000 to the gross income per day. But wait a minute, you and your usability buddies all subscribe to the "discount usability" model. Namely, you used just a few subjects – with the intent of finding problems – not statistics about your shoppers. Therefore, you may have followed Nielsen's model where you hope to find, on average, about 85% of problems with 5 users. Or, you may have followed Faulkner's update (reported here previously by Kath Straub) where you hope to discover at least 82% of problems with 10 users. Notice the difference between "on average" and "at least"... which would you rather present to you management? "At least" 82% is much more substantive. But we still haven't gotten to the real issue. The real issue is that when you presented the "80%" success rate to management, you failed to give proper context. You failed to indicate the "margin of error" associated with "80%". The margin of error shows that it is not still "certain" that you found all the problems – contrary to your buddy's comment "it certainly will be better than what we had before." Consequently, marketing went off and made definitely wrong conclusions. 
Marketing fails to question "80%" success rate 
Marketing didn't do much better. Marketing failed to ask you how many subjects you had. Marketing apparently didn't know about "margins of error" when applying the test subject results to the population of buyers. These are both technical terms you should know. They each have important practical implications that you simply must have at your command at all times. By the way, many marketing people do know about margins of error. Are you prepared to answer their questions?
Test results like 80% rate of task success merely tell you what happened in the test session. To be honest, in the discount usability model where you use small numbers of subjects, it's a fairly uninteresting number. Your intent was to find some problems. More subjects let you find more problems. Very simple. But for people interested in making an inference about their population of buyers the 80% is a dangerous number unless it undergoes some transformation. The 80% must be qualified with a margin of error. (See Rea and Parker, 1997, for the whole story on making inferences from samples.) 
Voting polls use "margin of error" to qualify success rate 
This challenge is no different than qualifying the results of a voting poll. For example, when 49% of prospective voters claim they will vote for Candidate A and 51% claim they will vote for Candidate B, do we have a tie or will Candidate B win? We all know that such polls come with a "margin of error" typically plus or minus (+/) 3%. In this case, the candidates tied because the +/ 3% causes the 49% and 51% to overlap. Note that to get a margin of error as small as +/3% you need 1068 subjects. Whew. Fewer subjects mean bigger margin of error. 385 subjects get you a margin of error of +/ 5%. 97 subjects get you a margin of error of +/ 10%. You see where we are going with small numbers of subjects.

The discount usability "fix" on margin of error 
Recent advances in calculating margins of error for small samples will let you properly qualify your discount usability test results. Jeff Sauro, a usability professional specializing in usability statistics, created an advanced tool at www.measuringusability.com/wald.htm . It supercedes other marginoferror calculators available on the Web. It also supports discussions about interpreting surveys and test results (e.g., Bartlett, Kotrlik, Higgens, 2001). Specifically, Jeff's special calculator overcomes the shortcomings associated with small sample sizes (as found with discount usability testing). This advanced method is called the "Adjusted Wald Interval" (in contrast to the old method called the "Wald Interval" used by other Web calculators). His calculator shows the lower margin of error should be 32% and the upper margin of error should be +16%. These would show up as "Yerror" bars in Excel, like this: Read more about this in the paper by Jeff and James R. Lewis, Estimating Completion Rates From Small Samples Using Binomial Confidence Intervals: Comparisons and Recommendations. They will present it at the upcoming 2005 Human Factors and Ergonomics Society conference (Sept 2005). 
Use a 95% confidence level 
Let's apply Jeff's calculator to our scenario. The usability team used 10 subjects and 8 passed. So enter those numbers...
What does the 95% Confidence level mean? It means that if you did the same test 100 times, 95 of those tests would yield results that fall within the +/ margin of error. So you would qualify your 80% success rate with a plus and minus "margin of error" wide enough to handle a commonly accepted confidence level. Conducting the "same test" means the same tasks, similar subjects with similar backgrounds, and similar testing environment, etc. Are you happy with a 95% level of confidence in your results? For a medical device, you may want a plus/minus range that encompasses 99% or even 99.99% out of 100 tests. However, for practical purposes (and nonlethal interface designs) 95% is commonly accepted. 
Presenting your low and high bounds for success rates 
OK, now press "Calculate" and see your results.
First, check the "Wald" results. You'd get these results from a typical Web "marketing" calculator. (They work for surveys or tests in which you have more than 150 subjects.) For an 80% success rate, it shows a "low bound" of 55.21% (55%) and a "high bound" of 104.79% (105%). The real results over many tests will fall between these two bounds. Right away, you see one of the limitations of a small sample size. The Wald calculations "blow up" and give a result more than 100%. Meanwhile, the +/ margin of error (given in the right column) shows what you must add and subtract (+/) (24.79%) to the success rate (80%) to get the low and high boundaries. Second, check the "Adjusted Wald" results. Use the Adjusted Wald results because they take into account the small number of subjects (under 150). Note the lower bound is now 48.29% (48%) instead of 55%. The upper bound is now 95.62% (96%) instead of 105%. Everything said, even with this additional accuracy, you see that the range between 48% and 96% is a large departure from saying "80% success rate". Therefore, be reluctant to report success rates with small sample sizes. They mislead the unwary listener. Jeff's calculator shows two other types of results: "Exact" and "Score" (see illustration above). Use these measures when you have really high or low success rates (above 95% or below 5%). They let you compare with other calculators that use those methods. Most people don't need to worry about these methods. 
Choosing which success rate to present 
Note that the "P" column ("Probability" of success = success rate) for the Adjusted Wald shows 71.96% (OK, make it 72%). Why does it not show the 80% we started with (8 out of 10)? This is where the method gets its name, the Adjusted Wald method. It's adjusted because you're adjusting the success rate, then computing the margin of error. This new adjusted success rate has a slightly higher chance of being closer to the true population than the unadjusted 80%.
This is a choice you get to make. Jeff explains on his Web page that you could choose to report either of the two success rates (72% or 80%). However, you must stick with the low and high numbers he presents for the Adjusted Wald method. On the one hand, 72% is useful because you get a "symmetric" margin of error – the upper margin of error matches the lower margin of error. You can just add and subtract the "margin of error" given in the rightmost column of the Results. On the other hand, 80% is useful because, well, you just don't have to explain it to anybody. That's pretty much the main reason. If using 72% will cause confusion, just use 80%. But make sure you handcalculate both the minus and the plus margins of error for your charting program. (In Excel, enter data into the "Yerror bars" dialog, discussed below.) The lower margin of error will be big, and the upper margin of error will be small. 
Next steps in making your slide show...  Now you know how to rewrite Scene 1 above.
Marketing: Great. What percentage of your test subjects were able to actually get through the shopping cart? You: We don't collect that kind of data. We do discount usability testing to look for problems, not success rates. What kind of margin of error do you need? With 100 people, we can get about a plus or minus 10%. A tighter margin of error takes more people. Did you know we can do remote unattended usability tests pretty economically? Now you have a dialog worth repeating. With discount usability testing of 10 subjects, margins of error like 32% and +16% necessarily accompany an 80% success rate. So you can see the how the success rate of 80% can be misleading when given by itself. That 80% success rate is really the range of 48% to 96% success rates. This range is called the "confidence interval" and it impacts all conclusions using it. For example, the marketing calculations of increased profit should have used this confidence interval rather than just 80% for the success rate. If you really, really, really (yup, 3 "reallys") want to compare 80% with a 50% success rate using 10 subjects, the best you can do is to say that 80% success rate allows you to "expect some improvement". Note that a 50% success rate with 10 subjects gives a confidence interval from 24% to 77%. The overlap of these two confidence intervals prohibits you from saying there is any certainty of improvement. You only have a "chance" of improvement. (In subsequent reports, we'll discuss the advantages of other measures like satisfaction and time on task for comparing pre and postredesign data with small numbers of subjects.) The oval in the illustration below shows the areas of overlap between the two margins of error. The overlap indicates possible test outcomes that could occur in future testing with similar subjects and the same tasks. The overlap shows that the outcomes could have been the same many times out of 100.

How to say it like it is... 
Jeff is a true educator. Last month, he received his master's degree from Stanford University in how to teach statistics for usability purposes. Check out his new tutorial on confidence intervals at www.measuringusability.com/stats. Meanwhile, pull out Excel and remember to rightclick on the little blue bar in your chart. Select "Format data series". You'll get the following dialog. Fill out the "custom" fields with the +/ margins of error you get from Jeff's calculator. 
Conclusion and future topics in quantitative usability 
Discount usability testing pays its way by helping you discover problems before your site or application goes public. It was not really meant to determine a success rate. But, discovering problems is important. As Jakob Nielsen and others have said about discount usability engineering "any data is data" and "anything is better than nothing." We continue to endorse that position.
Meanwhile, many things can be said about "quantitative usability" that help position or clarify the significance of your usability testing even with small sample sizes. For example, how can you characterize the efficiency of small sample sizes in discovering problems? Some problems have a greater chance of being found by any of your test subjects. Other problems have less chance of being found. In an upcoming newsletter, we'll discuss how you can report the efficiency of your discount usability testing in a manner that makes sense. James Lewis, the esteemed coauthor with Jeff on the paper mentioned here, is one of the leaders striving to characterize small sample test results. We'll discuss some of his and other's findings in future newsletters. 
References  Bartlett, J.E., Kotrlik, J.W., Higgins, C.C. (2001). Determining appropriate sample size in survey research. Information Technology, Learning, and Performance Journal, 19 (1), pp. 4350.
Rea, L.M. and Parker, R.A. (1997) Designing and Conducting Survey Research, 2nd Ed., JosseyBass, Inc. Sauro, J & Lewis, J.R (In Press). Estimating completion rates from small samples using binomial confidence intervals: comparisons and recommendations. Proceedings of the Human Factors and Ergonomics Society Annual Meeting (HFES 2005), Orlando, FL. 
Send your comments to the author  Article: Success Rates  TimeonTask 