|
|
|
|
|
Insights from Human Factors International
|
 |
|
In This Issue Bob Bailey reviews:
|
|
Speech recognition
|
Why is it taking so long for speech to be used as a primary input method?
|
| |
| |
|
| |
Automatic speech recognition technology has been under development for
over 25 years, but has not yet received widespread use. One of the main
reasons that speech recognition has not gained greater acceptance is that
speech recognition errors are fundamentally different than keying errors.
Most keying errors can be tracked back to users, while most speech errors
are tracked back to mis-recognition of the speech by the computer. In
the latter case, user input simply does not match computer output.
Even though people can dictate faster than they can type, actual throughput
is usually much slower with automatic speech recognition systems than
with keying. A major problem is that error correction takes much longer
with speech. The most commonly used correction methods used with speech
input are:
- deleting and repeating the last phrase,
- deleting and repeating a specific word,
- deleting and selecting a correct word from a list of alternative words,
- typing the correction.
|
| |
| |
|
| Model-Based and Empirical Evaluation of
Multimodal Interactive Error Correction, Suhm, B., Myers, B. and Waibel,
A., CHI 99 Conference Proceedings, 584-591 (1999). |
Past studies have suggested that switching modality could speed up interactive
correction of recognition errors. Suhm, Myers and Waibel (1999) at Carnegie
Mellon University found that switching between modalities eliminated repeated
recognition errors. They found that if users simply repeated their speech
to correct errors, correction accuracy was much lower than if users switched
to a different modality (keyboard and mouse). The correction accuracy
when keying depended on the user’s typing skill. For example, the
fastest typists using "keyboard and mouse" made almost three
times more corrections per minute than did subjects who made corrections
using "voice-only."
They concluded that multimodal correction strategies could reliably expedite
error correction in speech user interfaces.
|
| |
| |
|
| Effect of Error Correction Strategy on
Speech Dictation Throughput, Lewis, J.R., Proceedings of the Human Factors
and Ergonomics Society, 457-461 (1999). |
Throughput is the number of correct words produced per minute. The key
variables are:
- the accuracy of the speech recognition system,
- the speaking rate of the user, and
- the time required to correct errors.
Lewis (1999) at IBM evaluated the performance of participants using a
speech recognition dictation system. The participants received training
in one of two correction strategies, either "voice-only" or
"voice, keyboard and mouse." In both cases, users spoke at about
105 uncorrected words per minute. The multimodal (voice, keyboard, mouse)
corrections were made three times faster than "voice-only" corrections,
and generated 63% more throughput.
|
| |
| |
|
| Patterns of Entry and Correction in Large
Vocabulary Continuous Speech Recognition Systems, Karat, C.M., Halverson,
C., Horn, D. and Karat, J., CHI 99 Conference Proceedings, 568-575 (1999). |
Karat, et.al. (1999) at IBM evaluated three speech recognition products
with their users correcting errors by using either "voice-only"
or "keyboard and mouse." Participants were native English speakers
with good typing skills.
Each person trained one of the speech recognition systems to more readily
recognize their voices and then completed two tasks, copying from a novel
and composing replies to questions. The fastest users spoke at an average
of 107 uncorrected words per minute, which resulted in about 25 corrected
words per minute. The "keyboard-mouse" group completed almost
three times more words per minute than did the "voice-only"
group.
Participants observed that they were usually aware of when a typing error
occurred, but were much less confident of being aware of when a speech
error occurred. Users must either constantly glance at the display for
errors, or rely heavily on proofreading after the speaking has ended.
|
| |
| |
|
| |
It seems that the primary reasons that developers are avoiding speech
for input are that:
- speech recognition systems are still somewhat unreliable, and
- error correction continues to be difficult (and can lead to even more
errors).
|
|
|
|
Past Issues
|
|