About HFI   Certification   Tools   Services   Training   Free Resources   Media Room  
 Site MapWe help make companies user-centric   
Human Factors International Home Page Human Factors International Home Page
Free Resources

UI Design Newsletter – November, 2004

In This Issue

Observer Effects in Usability Testing... or, how to collect data without messing it up

John Sorflaten, Ph.D., CUA,
CPE, Project Director at HFI, discusses the effect of the observer on usability testing and the differing results between laboratory and unmoderated remote testing.

Person in lab and person at home doing usability test

The Pragmatic Ergonomist

Dr. Eric Schaffer, Ph.D., CUA, CPE, Founder and CEO of HFI offers practical advice.

Observer Effects in Usability Testing...

The big picture

The hallmark of modern particle physics centers on the "uncertainty principle". Namely, you can know either the position of a so-called wave-particle or its momentum, but not both. The reason for this conundrum is that the very act of observation changes "reality" from probability (the wave-particle might be here) to actuality (hah, gotcha pinned down).

Could the same hold true for usability testing?

Enter reality

As Bob Dylan once asked, "what is real"? Have you wondered about the effect of "thinking out loud" and whether it causes your test subject to be more attentive than they otherwise might be? Or perhaps the think-aloud effort competes for attention and distracts the subject from engaging fully in their task? As a clue to how this works, see if you can remember any moments when your test subject started to fade out... they squished up their brows, hunched over the keyboard, stared intensely at the monitor, and simply faded out... no talk, no anything. Of course, your standard procedure is to say "tell me what you're thinking," or "tell me what you're looking at." (Your measure of professional attainment is the variety of ways in which you can ask these questions without sounding like a parrot-savant.) But are you actually derailing an intense thought process with these innocent reminders? Have you occasionally just let your subject ruminate, get lost in airy spaces of problem solving, hoping they would soon come back to terra firma and report their findings to you? But then, can they remember the exquisite and lengthy chain of logic that lead them to the "wrong choice"?

Giving up to win

Like Aikido, Kung Fu, maybe even kick boxing, the art of winning often resolves down to the art of leverage... using our opponent's momentum and making it go where you want it to go rather than landing on your soft body. Definitely, this is the art of altering wave-particle probabilities towards your favor.

In usability testing we have some choices, too. How can we reduce the probabilities of getting hit with bad data? Can we shape the trajectory of the testing event to allow subject insights we know can improve our design?

Van Den Haak and colleagues (2003) investigated the effect of thinking out loud and compared it to an alternative approach called "retrospective" reporting. They compared two approaches to the same usability testing scenarios – university students trying to use an online library catalog.

Subjects were 2rd and 3th year Dutch students majoring in the same subject and with some knowledge of online catalogs.

Twenty of these students completed 7 tasks using the normal "concurrent" think aloud method we all use, and twenty others did the same tasks with no verbal comments during the task. However, the latter did post-test commentaries during their video review of their performance.

Test parity achieved

The outcomes fell into two categories: problems that the facilitator observed and problems that the subject verbalized.

In the concurrent, think-aloud method, the facilitator observed more problems happening than during the "retrospective" test where the subject did not talk during the task.

So, why were there more problems for the concurrent, think-aloud method? Did the test subject pay more attention to their task and thus recognize more problems? Or did the extra effort of talking during the test "cause" more problems? More on this later.

When considering verbalization of problems, the advantage switched to the retrospective method. While watching their videos, the retrospective subjects offered many more comments than the concurrent subjects.

This implies that the concurrent subjects may not have said what they really felt – they failed to report all their observations. Or, as suggested by the researchers, the retrospective subjects were able to report additional problems that were not related to the observed problems.

In both cases, the researchers indicate that the combined observed and verbalized problems came out the same. Thus, they concluded the two methods of testing were more or less equally sensitive to problems. Consequently, they suggest, you could use either test method, depending on how easily subjects can verbalize during the test.

If subjects have difficulty speaking during the task (due to heavy mental work-out), then the retrospective method is fine. Otherwise, use the concurrent method because it takes only half the time (you don't need to review the video with the subject).

Probability again

However, the researchers asked if there was still some evidence that the extra cognitive work of concurrently thinking out loud during the test influenced the overall outcome. Indeed, they calculated that the concurrent subjects completed only 2.6 tasks successfully, whereas the retrospective subjects completed 3.3 tasks successfully.

Although this difference was not statistically significant, it was close enough to suggest that the extra workload of thinking out loud could influence the results.

Again, we are left with wave-particle duality and a probabilistic interpretation of the results. Well, if you must come to a concrete solution, the statistics say there was no real difference. Just a probability of a difference. You get to decide...

Now, a real difference in results

A different research group investigated the influence of an unmoderated "remote testing environment" compared to the typical "laboratory" environment.

Schulte-Mecklenbeck and Huber (2003) asked 40 German university students to complete certain Web-based tasks in a typical lab setting. Meanwhile, they had 32 similar students do the same at their homes, using an automated, unmoderated, remote testing paradigm.

Interestingly, the students in the lab completed the tasks using about twice as many clicks and twice as much time compared to the students at home. Whew!

Was the actual task twice as hard in the lab? Or was the perceived pressure twice as much in the lab?

Observer affects the observed

Since the task was identical in both cases, the authors conclude that subjects in the lab felt more pressure to perform well, and thus made more effort to do well. The authors suggest that they perceived the facilitator as an "authority figure". (How much authority do you command in your lab? Any at all?)

Also, subjects may have perceived the lab setting as "more important" than using the Web at home. (Well, home is for relaxing...but wait, maybe that's where your Web site is most often used?)

In both cases, we see that the observer has influenced the results. In this case the automated test, with subjects out of view of an overseer, appears to have produced more honest results if that is the environment used in the real-life version of the tasks.

These results differ from prior research comparing Web vs. lab. Prior research suggests that Web and laboratory behaviors are about the same. For example, results from a psychological test taken on the Web were comparable to test norms obtained through pencil and paper. In other cases, subjects responded to line drawings and photographs presented on the Web with results similar to face-to-face responses. Even risk-seeking in a lottery setting was similar between Web and laboratory settings.

What was different about this particular test? Why should this laboratory experience generate more diligent responses?

The authors speculate that the nature of the task lends itself to a greater range of responsiveness among subjects. The tasks in this study were open-ended, information-seeking tasks. That is, participants were not seeking a "correct answer." Rather, they sought to find enough information to make a decision.

Thus, participants decided for themselves when to stop. They decided when they had found enough information to make a decision.

Does this sound like the tasks your Web site supports? Probably so, if you work on one of the many large-scale information sites on the Internet – or even as found on Intranets.

In any event, we just saw an example where the act of observation definitely influenced the outcome. The uncertainty principle operates in usability testing, just as in testing the behavior of sub-atomic particles.

Back to the surface again

Just to give us a sense of the normal reality found in classical physics, let's take a look at another comparison study.

Thomas Tullis, a well-known usability author and researcher, worked with his colleagues (2002) to check out whether automated, unmoderated, remote usability testing gives similar results as laboratory testing. Subjects were employees of a US corporation.

In contrast to the study reported above, Tullis used "closed-ended" tasks. That is, the results were either right or wrong. Does that sound like some of your testing outcomes?

If so, you can have some assurance that the laboratory setting doesn't upset your results. Tullis found that for the 13 tasks they used, subjects in the lab gave similar results as subjects in the unmoderated, remote testing environments. His team found similar task success rates and similar task times. No authority effect, here.

Actually, Tullis and team were more interested in whether the unmoderated remote testing was as effective for finding problems as the lab environment. They found that remote testing worked well and had benefits that complemented the lab environment. (Do you recall the benefits of testing more subjects? See our May, 2004 newsletter.)

Aha! More subjects can be better, they found. Whereas the lab setting found 9 issues, the remote test found 17 issues. Well, what do we expect if the lab only has 8 subjects and the remote test has 88 subjects?

Small is good, too

Interestingly, the law of diminishing returns did not penalize the lab environment unduly. Tullis and crew felt that both the lab and remote environments discovered the three major problems (overloaded home page, general terminology problems, and unclear navigation wording).

Plus, seven out of the nine problems found in the lab were also found in the remote test.

However, more subjects can be better, as we said. And that was the benefit of the remote test. After all, we test to find problems, don't we?

Benefits of both unobserved and observed testing

What other influences of the remote testing environment appear valuable, aside from the absence of the observer?

Tullis and group found they got greater diversity of user types in task experiences, computer experiences, and individual characteristics. They also got more hardware variety, such as screen resolution. And they were pleasantly surprised by the completeness and insights of the typed responses.

But the lab offered value, too. For example, the remote test revealed usage of 1024 screen resolution among nearly all subjects and revealed a problem with small fonts. The lab setting forced usage of 800 resolution and resulted in detection of excessive scroll requirements. The lab revealed that certain navigation options were overlooked, although the remote results showed most subjects found the options anyway over time.

Tullis and group recommend a combination of both remote and lab testing to cover the range of issues.

Certain conclusions probably

So, now we know the observer influences the observation, just like real-life physics.

  1. We saw that possibility first in the case of concurrent "think-aloud" usability testing – fewer tasks were successfully completed in the lab, but the authors said this difference was not statistically certain. It could be a fluke on the experiment.

But certainly, the concurrent, think-aloud method gave more observed problems than the retrospective reporting. And it gave fewer verbalized problems. However, the combined amount of observed and verbalized problems was equivalent between the two environments. Thus, if we have complex tasks that make it hard for subjects to talk aloud when doing tasks, then feel comfortable showing them their video. They can talk plenty during the video.

  1. We saw the influence of the observer increase dramatically in the case of "active information search." Subjects in the lab spent twice as much effort on their tasks compared to remote subjects at home. The lab subjects probably felt obliged to please an "authority figure" appearing in the form of the facilitator. But remember, they worked on an open-ended task – unlike many closed-ended tasks found in transaction-oriented applications. But open-ended tasks are common, too, like found with your information sites.
  2. Meanwhile, the observer does not always get in the way. In our last study, lab-based observation allowed sight of body language, strained perception, and vocal hesitancy that would be missed by unmoderated remote testing. Here, the observer added value. But the larger number of subjects possible in the remote testing also added a lot of new issues, otherwise missed in the lab. So both methods complement each other. Do both.

Amidst these suggestions, we do find some guidelines – albeit, the findings may be qualified by the nationality of the subjects, or their student status, or any other of many differences compared with your target population.

But that's life.

All we can say, like physicists do, is take a chance. And make it work.

New Directions in HCI

Late breaking news

Janni Nielsen of the Copenhagen Business School presented a paper at the India HCI conference today (12/6/2004). In her paper she reports an interesting hybrid approach. She records the task completion without thinking aloud. In this recording she has screen, interaction (including mouse movement), and facial expression. Her participants are trained to move their mouse to the areas of the screen they are paying attention to.

After task completion she shows the recording to participants and has them describe their mental process. She reports that participants have a "mental tape" of the session and can provide substantial insight based on prompting with their recorded interaction. She then tapes the user's interpretation along with the interaction.

Janni Nielsen (2004). Reflections on Concurrent and Retrospective User Testing, Session G2, New Directions in HCI, IHCI 2004, Dec 06-07, Bangalore, India


Tullis, T., Fleischman, S., McNulty, M., Cianchette, C., Bergel, M. (2002). An Empirical Comparison of Lab and Remote Usability Testing of Web Sites, Usability Professionals Association Conference, Orlando, Florida.

Schulte-Mecklenbeck, M. and Huber, O. (2003). Information Search in the Laboratory and on the Web: With or without an Experimenter. Behavior Research & Methods, Instruments & Computers.

Van Den Haak, M.J., De Jong, M.D.T., Schellens, P.J. (2003). Retrospective Versus Concurrent Think-Aloud Protocols: Testing the Usability of an Online Library Catalog. Behavior & Information Technology, 22 (5), 339–351.

Enjoy your newsletter, as always.

A comment concerning the use of the retrospective think-aloud method, with video or without. I find myself a bit concerned about using this method. Has it been tested several times to:

  • discover to what degree participants' recall of task performance is fuzzy, even right after the test is over? Perhaps video usage helps reduce this problem, but I wonder if many participants are really able to replay complex series of thoughts and emotions they encounter when testing a site. Although this is not a great analog, I've read recently that trauma victims' recall of the trauma experience is not a direct replay, although they may perceive it to be accurate. Time's passage and other factors color/limit what they recall.
  • see to what degree participants will make up parts of what they report afterwards, in an effort themselves to understand and justify their behavior? To what degree will what they report be new interpretations of what happened?

Don't know how much these factors impact this method, but the potential threats are at least something that I would stick in as a caveat in a test report.

Otherwise, this article is such a keeper and should help as we decide what sort of testing to do on future redesigns.

Natalie Ferguson, MA, MPH
Centers for Disease Control and Prevention

Excellent information on observer effects that should be followed.

However, one brief point: let's remember that the folks who assist us in our research efforts are to be referred to as "participants" and not as "subjects". The APA publication manual (fifth edition, 2001) refers to subjects in its grammatical essence (i.e., subject-verb agreement) and "participants" as humans.

Dr. Sorflaten used "subjects" 40 times in the newsletter. Kudos to Janni Nelson and Dr. Schaffer for their adherence to APA guidelines!

Further, as I sat in my cube this morning, I overheard a self-proclaimed expert in research methods and statistics remark that the "subjects" in their study... Need I say more?

Victor J. Ingurgio, Ph.D.
Human Factors Laboratory
Atlantic City International Airport

Reader comments on this and other articles.

The Pragmatic Ergonomist, Dr. Eric Schaffer

If you are doing summative testing (where you want to precisely measure user performance), then talk-aloud testing will effect the results. For summative testing you need the most realistic situation and you need to stay out of the way. Unobtrusive measures (like automated testing and clickstream analysis) can be useful.

If you are doing formative testing the situation changes. There you need to learn just what is wrong. You need to be there. You need to get into the participant's head. This will perturb the performance data, but you want insights not precise measurement.

In formative testing, think-aloud works well in most cases. If the task is very complex the retrospective technique may be better (or you can just let the participant go quiet when he/she appears to be overloaded and stops talking naturally). Also, in Asia we find people filter their speech much more (mostly to avoid hurting your feelings). This means we need special methods (like Apala Lahiri Chavan's "Bollywood Technique") to give users permission to be straightforward.

Leave a comment here

© 1996-2014 Human Factors International, Inc. All rights reserved  |  Privacy Policy  |   Follow us: