|
|
|
|
|
Insights from Human Factors International
|
 |
|
In This Issue:
|
|
Observer Effects in Usability Testing...
or, how to collect data without messing it up
|
|
John Sorflaten, Ph.D., CUA, CPE, Project Director at HFI, discusses
the effect of the observer on usability testing and the differing
results between laboratory and unmoderated remote testing.
|

|
|
|
The Pragmatic Ergonomist
|
Dr. Eric Schaffer, Ph.D., CPE, founder and CEO of HFI offers practical
advice.
|
| |
|
| |
|
|
The big picture
|
The hallmark of modern particle physics centers on the "uncertainty
principle". Namely, you can know either the position of a so-called
wave-particle or its momentum, but not both. The reason for this conundrum
is that the very act of observation changes "reality" from probability
(the wave-particle might be here) to actuality
(hah, gotcha pinned down).
Could the same hold true for usability testing?
|
 |
|
Enter reality
|
As Bob Dylan once asked, "what is real"?
Have you wondered about the effect of "thinking out loud" and
whether it causes your test subject to be more attentive than they otherwise
might be? Or perhaps the think-aloud effort competes for attention and
distracts the subject from engaging fully in their task?
As a clue to how this works, see if you can remember any moments when
your test subject started to fade out... they squished up their brows,
hunched over the keyboard, stared intensely at the monitor, and simply
faded out... no talk, no anything.
Of course, your standard procedure is to say "tell me what you're
thinking," or "tell me what you're looking at." (Your measure
of professional attainment is the variety of ways in which you can ask
these questions without sounding like a parrot-savant.)
But are you actually derailing an intense thought process with these
innocent reminders?
Have you occasionally just let your subject ruminate, get lost in airy
spaces of problem solving, hoping they would soon come back to terra firma
and report their findings to you?
But then, can they remember the exquisite and lengthy chain of logic
that lead them to the "wrong choice"?
|
 |
|
Giving up to win
|
Like Aikido, Kung Fu, maybe even kick boxing, the art of winning often
resolves down to the art of leverage... using our opponent's momentum
and making it go where you want it to go rather than landing on your soft
body. Definitely, this is the art of altering wave-particle probabilities
towards your favor.
In usability testing we have some choices, too. How can we reduce the
probabilities of getting hit with bad data? Can we shape the trajectory
of the testing event to allow subject insights we know can improve our
design?
Van Den Haak and colleagues (2003) investigated the effect of thinking
out loud and compared it to an alternative approach called "retrospective"
reporting. They compared two approaches to the same usability testing
scenarios – university students trying to use an online library
catalog.
Subjects were 2rd and 3th year Dutch students majoring in the same subject
and with some knowledge of online catalogs.
Twenty of these students completed 7 tasks using the normal "concurrent"
think aloud method we all use, and twenty others did the same tasks with
no verbal comments during the task. However, the latter did post-test
commentaries during their video review of their performance.
|
 |
|
Test parity achieved
|
The outcomes fell into two categories: problems that the facilitator
observed and problems that the subject verbalized.
In the concurrent, think-aloud method, the facilitator observed
more problems happening than during the "retrospective" test
where the subject did not talk during the task.
So, why were there more problems for the concurrent, think-aloud method?
Did the test subject pay more attention to their task and thus recognize
more problems? Or did the extra effort of talking during the test "cause"
more problems? More on this later.
When considering verbalization of problems,
the advantage switched to the retrospective method. While watching their
videos, the retrospective subjects offered many more comments than the
concurrent subjects.
This implies that the concurrent subjects may not have said what they
really felt – they failed to report all their observations. Or,
as suggested by the researchers, the retrospective subjects were able
to report additional problems that were not related to the observed problems.
In both cases, the researchers indicate that the combined observed
and verbalized problems came out the same.
Thus, they concluded the two methods of testing were more or less equally
sensitive to problems. Consequently, they suggest, you could use either
test method, depending on how easily subjects can verbalize during the
test.
If subjects have difficulty speaking during the task (due to heavy mental
work-out), then the retrospective method is fine. Otherwise, use the concurrent
method because it takes only half the time (you don't need to review the
video with the subject).
|
 |
|
Probability again
|
However, the researchers asked if there was still some evidence that
the extra cognitive work of concurrently thinking out loud during the
test influenced the overall outcome. Indeed, they calculated that the
concurrent subjects completed only 2.6 tasks successfully, whereas the
retrospective subjects completed 3.3 tasks successfully.
Although this difference was not statistically significant, it was close
enough to suggest that the extra workload of thinking out loud could influence
the results.
Again, we are left with wave-particle duality and a probabilistic interpretation
of the results. Well, if you must come to
a concrete solution, the statistics say there was no real difference.
Just a probability of a difference. You get to decide...
|
 |
|
Now, a real difference in results
|
A different research group investigated the influence of an unmoderated
"remote testing environment" compared to the typical "laboratory"
environment.
Schulte-Mecklenbeck and Huber (2003) asked 40 German university students
to complete certain Web-based tasks in a typical lab setting. Meanwhile,
they had 32 similar students do the same at their homes, using an automated,
unmoderated, remote testing paradigm.
Interestingly, the students in the lab completed the tasks using about
twice as many clicks and twice as much time compared to the students at
home. Whew!
Was the actual task twice as hard in the lab? Or was the perceived pressure
twice as much in the lab?
|
 |
|
Observer affects the observed
|
Since the task was identical in both cases, the authors conclude that
subjects in the lab felt more pressure to perform well, and thus made
more effort to do well. The authors suggest that they perceived the facilitator
as an "authority figure". (How much authority do you command
in your lab? Any at all?)
Also, subjects may have perceived the lab setting as "more important"
than using the Web at home. (Well, home is for relaxing...but wait, maybe
that's where your Web site is most often used?)
In both cases, we see that the observer has influenced the results. In
this case the automated test, with subjects out of view of an overseer,
appears to have produced more honest results if
that is the environment used in the real-life version of the tasks.
These results differ from prior research comparing Web vs. lab. Prior
research suggests that Web and laboratory behaviors are about the same.
For example, results from a psychological test taken on the Web were comparable
to test norms obtained through pencil and paper. In other cases, subjects
responded to line drawings and photographs presented on the Web with results
similar to face-to-face responses. Even risk-seeking in a lottery setting
was similar between Web and laboratory settings.
What was different about this particular test? Why should this laboratory
experience generate more diligent responses?
The authors speculate that the nature of the task lends itself to a greater
range of responsiveness among subjects. The tasks in this study were open-ended,
information-seeking tasks. That is, participants were not seeking a "correct
answer." Rather, they sought to find enough information to make a
decision.
Thus, participants decided for themselves when to stop. They decided
when they had found enough information to make a decision.
Does this sound like the tasks your Web site supports? Probably so, if
you work on one of the many large-scale information sites on the Internet
– or even as found on Intranets.
In any event, we just saw an example where the act of observation definitely
influenced the outcome. The uncertainty principle operates in usability
testing, just as in testing the behavior of sub-atomic particles.
|
 |
|
Back to the surface again
|
Just to give us a sense of the normal reality found in classical physics,
let's take a look at another comparison study.
Thomas Tullis, a well-known usability author and researcher, worked with
his colleagues (2002) to check out whether automated, unmoderated, remote
usability testing gives similar results as laboratory testing. Subjects
were employees of a US corporation.
In contrast to the study reported above, Tullis used "closed-ended"
tasks. That is, the results were either right or wrong. Does that sound
like some of your testing outcomes?
If so, you can have some assurance that the laboratory setting doesn't
upset your results. Tullis found that for the 13 tasks they used, subjects
in the lab gave similar results as subjects in the unmoderated, remote
testing environments. His team found similar task success rates and similar
task times. No authority effect, here.
Actually, Tullis and team were more interested in whether the unmoderated
remote testing was as effective for finding problems as the lab environment.
They found that remote testing worked well and had benefits that complemented
the lab environment. (Do you recall the benefits of testing more subjects?
See our May, 2004 newsletter.)
Aha! More subjects can be better, they found. Whereas the lab setting
found 9 issues, the remote test found 17 issues. Well, what do we expect
if the lab only has 8 subjects and the remote test has 88 subjects?
|
 |
|
Small is good, too
|
Interestingly, the law of diminishing returns did not penalize the lab
environment unduly. Tullis and crew felt that both the lab and remote
environments discovered the three major problems (overloaded home page,
general terminology problems, and unclear navigation wording).
Plus, seven out of the nine problems found in the lab were also found
in the remote test.
However, more subjects can be better, as we said. And that was the benefit
of the remote test. After all, we test to find problems, don't we?
|
 |
|
Benefits of both unobserved and observed testing
|
What other influences of the remote testing environment appear valuable,
aside from the absence of the observer?
Tullis and group found they got greater diversity of user types in task
experiences, computer experiences, and individual characteristics. They
also got more hardware variety, such as screen resolution. And they were
pleasantly surprised by the completeness and insights of the typed responses.
But the lab offered value, too. For example, the remote test revealed
usage of 1024 screen resolution among nearly all subjects and revealed
a problem with small fonts. The lab setting forced usage of 800 resolution
and resulted in detection of excessive scroll requirements. The lab revealed
that certain navigation options were overlooked, although the remote results
showed most subjects found the options anyway over time.
Tullis and group recommend a combination of both remote and lab testing
to cover the range of issues.
|
 |
|
Certain conclusions – probably
|
So, now we know the observer influences the observation, just like real-life
physics.
- We saw that possibility first in the case of concurrent "think-aloud"
usability testing – fewer tasks were successfully completed in
the lab, but the authors said this difference was not statistically
certain. It could be a fluke on the experiment.
But certainly, the concurrent, think-aloud method gave more observed
problems than the retrospective reporting. And it gave fewer verbalized
problems. However, the combined amount of observed and verbalized problems
was equivalent between the two environments. Thus, if we have complex
tasks that make it hard for subjects to talk aloud when doing tasks, then
feel comfortable showing them their video. They can talk plenty during
the video.
- We saw the influence of the observer increase dramatically in the
case of "active information search." Subjects in the lab spent
twice as much effort on their tasks compared to remote subjects at home.
The lab subjects probably felt obliged to please an "authority
figure" appearing in the form of the facilitator. But remember,
they worked on an open-ended task – unlike many closed-ended tasks
found in transaction-oriented applications. But open-ended tasks are
common, too, like found with your information sites.
- Meanwhile, the observer does not always get in the way. In our last
study, lab-based observation allowed sight of body language, strained
perception, and vocal hesitancy that would be missed by unmoderated
remote testing. Here, the observer added value. But the larger number
of subjects possible in the remote testing also added a lot of new issues,
otherwise missed in the lab. So both methods complement each other.
Do both.
Amidst these suggestions, we do find some guidelines – albeit,
the findings may be qualified by the nationality of the subjects, or their
student status, or any other of many differences compared with your target
population.
But that's life.
All we can say, like physicists do, is take a chance. And make it work.
|
| |
|
|
Late breaking news
|
Janni Nielsen of the Copenhagen Business School presented a paper at
the India HCI conference today (12/6/2004). In her paper she reports an
interesting hybrid approach. She records the task completion without thinking
aloud. In this recording she has screen, interaction (including mouse
movement), and facial expression. Her participants are trained to move
their mouse to the areas of the screen they are paying attention to.
After task completion she shows the recording to participants and has
them describe their mental process. She reports that participants have
a "mental tape" of the session and can provide substantial insight
based on prompting with their recorded interaction. She then tapes the
user's interpretation along with the interaction.
Janni Nielsen (2004). Reflections on Concurrent and Retrospective User
Testing, Session G2, New Directions in HCI, IHCI 2004, Dec 06-07, Bangalore,
India
|
| |
|
| |
If you are doing summative testing (where
you want to precisely measure user performance), then talk-aloud testing
will effect the results. For summative testing you need the most realistic
situation and you need to stay out of the way. Unobtrusive measures (like
automated testing and clickstream analysis) can be useful.
If you are doing formative testing the situation
changes. There you need to learn just what is wrong. You need to be there.
You need to get into the participant's head. This will
perturb the performance data, but you want insights not precise measurement.
In formative testing, think-aloud works well in most cases. If the task
is very complex the retrospective technique may be better (or you can
just let the participant go quiet when he/she appears to be overloaded
and stops talking naturally). Also, in Asia we find people filter their
speech much more (mostly to avoid hurting your feelings). This means we
need special methods (like Apala Lahiri Chavan's "Bollywood
Technique") to give users permission to be straightforward.
|
 |
|
References
|
Tullis, T., Fleischman, S., McNulty, M., Cianchette, C., Bergel, M. (2002).
An Empirical Comparison of Lab and Remote Usability Testing of Web Sites,
Usability Professionals Association Conference,
Orlando, Florida.
Schulte-Mecklenbeck, M. and Huber, O. (2003). Information Search in the
Laboratory and on the Web: With or without an Experimenter. Behavior Research
& Methods, Instruments & Computers.
Van Den Haak, M.J., De Jong, M.D.T., Schellens, P.J. (2003). Retrospective
Versus Concurrent Think-Aloud Protocols: Testing the Usability of an Online
Library Catalog. Behavior & Information Technology,
22 (5), 339–351.
|
| |
|
Natalie Ferguson, MA, MPH
Centers for Disease Control and Prevention |
Enjoy your newsletter, as always.
A comment concerning the use of the retrospective think-aloud method,
with video or without. I find myself a bit concerned about using this
method. Has it been tested several times to:
discover to what degree participants' recall of task performance is fuzzy,
even right after the test is over? Perhaps video usage helps reduce this
problem, but I wonder if many participants are really able to replay complex
series of thoughts and emotions they encounter when testing a site. Although
this is not a great analog, I've read recently that trauma victims' recall
of the trauma experience is not a direct replay, although they may perceive
it to be accurate. Time's passage and other factors color/limit what they
recall.
see to what degree participants will make up parts of what they report
afterwards, in an effort themselves to understand and justify their behavior?
To what degree will what they report be new interpretations of what happened?
Don't know how much these factors impact this method, but the potential
threats are at least something that I would stick in as a caveat in a
test report.
Otherwise, this article is such a keeper and should help as we decide
what sort of testing to do on future redesigns.
|
 |
Victor J. Ingurgio, Ph.D.
Human Factors Laboratory, Atlantic City International Airport |
Excellent information on observer effects that should be
followed.
However, one brief point: let's remember that the folks who assist us
in our research efforts are to be referred to as "participants"
and not as "subjects". The APA publication manual (fifth edition,
2001) refers to subjects in its grammatical essence (i.e., subject-verb
agreement) and "participants" as humans.
Dr. Sorflaten used "subjects" 40 times in the newsletter. Kudos
to Janni Nelson and Dr. Schaffer for their adherence to APA guidelines!
Further, as I sat in my cube this morning, I overheard a self-proclaimed
expert in research methods and statistics remark that the "subjects"
in their study... Need I say more?
|
|
|
|
Past Issues
|
|