User experience consulting, staff augmentation, training and products

Summary

Usability folks often interact with marketing folks about progress on Web site design. Do they speak the same language? That is the question. If you tell someone that 80% of your test subjects "succeeded," how might you be misleading them? Know how to qualify your discount usability test results with a "margin of error."

Episode #287 of Usability Crossroads: "Telling It Like It Is"

Narrative processing does much more than influence decision making and memory – it provides an "emotional" interpretation of our experience. It helps us develop our attitudes, preferences, and even our self-concept (Markus & Kunda, 1986). The stories we create and use to make decisions provide meaning and are, in that sense, the full "user experience."

Scene 3: DAY, INTERIOR, USABILITY OFFICE

You (to your buddy who did the test): I just told marketing we found some more problems we can fix on the shopping cart tasks. Solving those two or three problems should get us another bump up in revenue.

Your buddy: Yeah, I guess we'll have to wait and see what happens on the Web site. Anyway, it certainly will be better than what we had before the changes.

Reality quiz

Which is right – Scene 2, Scene 3, Both, or Neither

To end the suspense, the answer is "Neither".

Usability fails to qualify "80%" success rate

Yes, indeed, you "just told marketing we found some more problems". However, you probably mislead your listener because you also said "80%" could do the task – without qualifying what you meant. Qualifying your answer means telling marketing that 80% really only applies to your subjects. Unfortunately, it has many other, inferred meanings – some useful, some downright dangerous.

As you saw in scene 2, marketing interpreted 80% as a real number that was actionable in terms of all the shoppers on your site – not just the subjects in the test. Marketing made the "profit" calculations based on what you said and figured an extra 300 people adds $15,000 to the gross income per day.

But wait a minute, you and your usability buddies all subscribe to the "discount usability" model. Namely, you used just a few subjects – with the intent of finding problems – not statistics about your shoppers.

Therefore, you may have followed Nielsen's model where you hope to find, on average, about 85% of problems with 5 users. Or, you may have followed Faulkner's update (reported here previously by Kath Straub) where you hope to discover at least 82% of problems with 10 users.

Notice the difference between "on average" and "at least"... which would you rather present to you management? "At least" 82% is much more substantive.

But we still haven't gotten to the real issue. The real issue is that when you presented the "80%" success rate to management, you failed to give proper context. You failed to indicate the "margin of error" associated with "80%".

The margin of error shows that it is not still "certain" that you found all the problems – contrary to your buddy's comment "it certainly will be better than what we had before." Consequently, marketing went off and made definitely wrong conclusions.

Marketing fails to question "80%" success rate

Marketing didn't do much better. Marketing failed to ask you how many subjects you had. Marketing apparently didn't know about "margins of error" when applying the test subject results to the population of buyers. These are both technical terms you should know. They each have important practical implications that you simply must have at your command at all times. By the way, many marketing people do know about margins of error. Are you prepared to answer their questions?

Test results like 80% rate of task success merely tell you what happened in the test session. To be honest, in the discount usability model where you use small numbers of subjects, it's a fairly uninteresting number.

Your intent was to find some problems. More subjects let you find more problems. Very simple. But for people interested in making an inference about their population of buyers the 80% is a dangerous number unless it undergoes some transformation. The 80% must be qualified with a margin of error. (See Rea and Parker, 1997, for the whole story on making inferences from samples.)

Voting polls use "margin of error" to qualify success rate

This challenge is no different than qualifying the results of a voting poll. For example, when 49% of prospective voters claim they will vote for Candidate A and 51% claim they will vote for Candidate B, do we have a tie or will Candidate B win?

We all know that such polls come with a "margin of error" typically plus or minus (+/-) 3%. In this case, the candidates tied because the +/- 3% causes the 49% and 51% to overlap.

Note that to get a margin of error as small as +/-3% you need 1068 subjects. Whew. Fewer subjects mean bigger margin of error. 385 subjects get you a margin of error of +/- 5%. 97 subjects get you a margin of error of +/- 10%. You see where we are going with small numbers of subjects.

The discount usability "fix" on margin of error

This challenge is no different than qualifying the results of a voting poll. For example, when 49% of prospective voters claim they will vote for Candidate A and 51% claim they will vote for Candidate B, do we have a tie or will Candidate B win?

We all know that such polls come with a "margin of error" typically plus or minus (+/-) 3%. In this case, the candidates tied because the +/- 3% causes the 49% and 51% to overlap.

Note that to get a margin of error as small as +/-3% you need 1068 subjects. Whew. Fewer subjects mean bigger margin of error. 385 subjects get you a margin of error of +/- 5%. 97 subjects get you a margin of error of +/- 10%. You see where we are going with small numbers of subjects.

Use a 95% confidence level

Let's apply Jeff's calculator to our scenario. The usability team used 10 subjects and 8 passed. So enter those numbers...

What does the 95% Confidence level mean? It means that if you did the same test 100 times, 95 of those tests would yield results that fall within the +/- margin of error. So you would qualify your 80% success rate with a plus and minus "margin of error" wide enough to handle a commonly accepted confidence level.

Conducting the "same test" means the same tasks, similar subjects with similar backgrounds, and similar testing environment, etc.

Are you happy with a 95% level of confidence in your results? For a medical device, you may want a plus/minus range that encompasses 99% or even 99.99% out of 100 tests. However, for practical purposes (and non-lethal interface designs) 95% is commonly accepted.

Presenting your low and high bounds for success rates

OK, now press "Calculate" and see your results.

First, check the "Wald" results. You'd get these results from a typical Web "marketing" calculator. (They work for surveys or tests in which you have more than 150 subjects.) For an 80% success rate, it shows a "low bound" of 55.21% (55%) and a "high bound" of 104.79% (105%). The real results over many tests will fall between these two bounds. Right away, you see one of the limitations of a small sample size. The Wald calculations "blow up" and give a result more than 100%. Meanwhile, the +/- margin of error (given in the right column) shows what you must add and subtract (+/-) (24.79%) to the success rate (80%) to get the low and high boundaries.

Second, check the "Adjusted Wald" results. Use the Adjusted Wald results because they take into account the small number of subjects (under 150). Note the lower bound is now 48.29% (48%) instead of 55%. The upper bound is now 95.62% (96%) instead of 105%.

Everything said, even with this additional accuracy, you see that the range between 48% and 96% is a large departure from saying "80% success rate". Therefore, be reluctant to report success rates with small sample sizes. They mislead the unwary listener.

Jeff's calculator shows two other types of results: "Exact" and "Score" (see illustration above). Use these measures when you have really high or low success rates (above 95% or below 5%). They let you compare with other calculators that use those methods. Most people don't need to worry about these methods.

Choosing which success rate to present

Note that the "P" column ("Probability" of success = success rate) for the Adjusted Wald shows 71.96% (OK, make it 72%). Why does it not show the 80% we started with (8 out of 10)? This is where the method gets its name, the Adjusted Wald method. It's adjusted because you're adjusting the success rate, then computing the margin of error. This new adjusted success rate has a slightly higher chance of being closer to the true population than the unadjusted 80%.

This is a choice you get to make. Jeff explains on his Web page that you could choose to report either of the two success rates (72% or 80%). However, you must stick with the low and high numbers he presents for the Adjusted Wald method.

On the one hand, 72% is useful because you get a "symmetric" margin of error – the upper margin of error matches the lower margin of error. You can just add and subtract the "margin of error" given in the rightmost column of the Results.

On the other hand, 80% is useful because, well, you just don't have to explain it to anybody. That's pretty much the main reason. If using 72% will cause confusion, just use 80%. But make sure you hand-calculate both the minus and the plus margins of error for your charting program. (In Excel, enter data into the "Y-error bars" dialog, discussed below.) The lower margin of error will be big, and the upper margin of error will be small.

Next steps in making your slide show...

Now you know how to rewrite Scene 1 above.

Marketing: Great. What percentage of your test subjects were able to actually get through the shopping cart?

You: We don't collect that kind of data. We do discount usability testing to look for problems, not success rates. What kind of margin of error do you need? With 100 people, we can get about a plus or minus 10%. A tighter margin of error takes more people. Did you know we can do remote unattended usability tests pretty economically?

Now you have a dialog worth repeating.

With discount usability testing of 10 subjects, margins of error like -32% and +16% necessarily accompany an 80% success rate. So you can see the how the success rate of 80% can be misleading when given by itself.

That 80% success rate is really the range of 48% to 96% success rates. This range is called the "confidence interval" and it impacts all conclusions using it. For example, the marketing calculations of increased profit should have used this confidence interval rather than just 80% for the success rate.

If you really, really, really (yup, 3 "reallys") want to compare 80% with a 50% success rate using 10 subjects, the best you can do is to say that 80% success rate allows you to "expect some improvement". Note that a 50% success rate with 10 subjects gives a confidence interval from 24% to 77%.

The overlap of these two confidence intervals prohibits you from saying there is any certainty of improvement. You only have a "chance" of improvement. (In subsequent reports, we'll discuss the advantages of other measures like satisfaction and time on task for comparing pre- and post-redesign data with small numbers of subjects.)

The oval in the illustration below shows the areas of overlap between the two margins of error. The overlap indicates possible test outcomes that could occur in future testing with similar subjects and the same tasks. The overlap shows that the outcomes could have been the same many times out of 100.

Conclusion and future topics in quantitative usability

Discount usability testing pays its way by helping you discover problems before your site or application goes public. It was not really meant to determine a success rate. But, discovering problems is important. As Jakob Nielsen and others have said about discount usability engineering "any data is data" and "anything is better than nothing." We continue to endorse that position.

Meanwhile, many things can be said about "quantitative usability" that help position or clarify the significance of your usability testing even with small sample sizes. For example, how can you characterize the efficiency of small sample sizes in discovering problems?

Some problems have a greater chance of being found by any of your test subjects. Other problems have less chance of being found. In an upcoming newsletter, we'll discuss how you can report the efficiency of your discount usability testing in a manner that makes sense.

James Lewis, the esteemed co-author with Jeff on the paper mentioned here, is one of the leaders striving to characterize small sample test results. We'll discuss some of his and other's findings in future newsletters.

References

Bartlett, J.E., Kotrlik, J.W., Higgins, C.C. (2001). Determining appropriate sample size in survey research. Information Technology, Learning, and Performance Journal, 19 (1), pp. 43-50.

Rea, L.M. and Parker, R.A. (1997) Designing and Conducting Survey Research, 2nd Ed., Jossey-Bass, Inc.

Sauro, J & Lewis, J.R (In Press). Estimating completion rates from small samples using binomial confidence intervals: comparisons and recommendations. Proceedings of the Human Factors and Ergonomics Society Annual Meeting (HFES 2005), Orlando, FL.

Message from the CEO, Dr. Eric Schaffer — The Pragmatic Ergonomist

Reader comments

John Imms DST Systems

Terrific article and a real service to call attention to Sauro's work and practical Web site. The HFI newsletters consistently provide insightful, dare I say useful, summaries and perspectives to the practicing professional. And they are free! May they bring you many eager customers. Thanks

Robert Thompson
Option One Mortgage Company

Very good article. It is important to understand when communicating the meaning of our testing results.

Dann Nebbe
Principal Finacial Group

A very thought-provoking article about how to communicate usability results (and set expectations) for tests with low numbers of participants. Having a high-level understanding of statistics, I really appreciate having the tool to use to calculate confidence intervals, too.

Cool stuff and UX resources

Summary

Episode #287 of Usability Crossroads: "Telling It Like It Is"

Usability fails to qualify "80%" success rate

Marketing fails to question "80%" success rate

Voting polls use "margin of error" to qualify success rate

The discount usability "fix" on margin of error

Use a 95% confidence level

Presenting your low and high bounds for success rates

Choosing which success rate to present

Next steps in making your slide show...

Conclusion and future topics in quantitative usability

References

Message from the CEO, Dr. Eric Schaffer — The Pragmatic Ergonomist

Leave a comment here

Reader comments

John Imms DST Systems

Robert Thompson
Option One Mortgage Company

Dann Nebbe
Principal Finacial Group

Subscribe

Follow us

Privacy policy

Reviewed: 18 Mar 2014

Cookies

Privacy

The Information HFI Collects

How HFI Uses Information

Disclosure of Information

Personal Information as Provided by User

Security of Information

Correcting, Updating, Accessing or Removing Personal Information

Contacting HFI

Terms and Conditions for Public Training Courses

Reviewed: 18 Mar 2014

Cancellation of Course by HFI

Cancellation of Course by Participants (All regions except India)

Cancellation / Transfer by Participants (India)

Cancellation / Transfer by Participants (Online Courses)

No Recording Permitted

Course Materials Copyright

Cool stuff and UX resources

When Discount Usability Misleads Management – A Solution

Summary

Episode #287 of Usability Crossroads: "Telling It Like It Is"

Usability fails to qualify "80%" success rate

Marketing fails to question "80%" success rate

Voting polls use "margin of error" to qualify success rate

The discount usability "fix" on margin of error

Use a 95% confidence level

Presenting your low and high bounds for success rates

Choosing which success rate to present

Next steps in making your slide show...

Conclusion and future topics in quantitative usability

References

Message from the CEO, Dr. Eric Schaffer — The Pragmatic Ergonomist

Leave a comment here

Reader comments

John Imms DST Systems

Robert Thompson Option One Mortgage Company

Dann Nebbe Principal Finacial Group

Subscribe

Follow us

Privacy policy

Reviewed: 18 Mar 2014

Cookies

Privacy

The Information HFI Collects

How HFI Uses Information

Disclosure of Information

Personal Information as Provided by User

Security of Information

Correcting, Updating, Accessing or Removing Personal Information

Contacting HFI

Terms and Conditions for Public Training Courses

Reviewed: 18 Mar 2014

Cancellation of Course by HFI

Cancellation of Course by Participants (All regions except India)

Cancellation / Transfer by Participants (India)

Cancellation / Transfer by Participants (Online Courses)

No Recording Permitted

Course Materials Copyright

Robert Thompson
Option One Mortgage Company

Dann Nebbe
Principal Finacial Group