About HFI   Certification   Tools   Services   Training   Free Resources   Media Room  
               
 Site MapWe help make companies user-centric   
Human Factors International Home Page Human Factors International Home Page
Free Resources

UI Design Newsletter – November, 2000

In This Issue

The Usability of Punched Ballots

Improving usability in America's voting systems. Calculating the number of test subjects required to find usability problems.

The Usability of Punched Ballots

Improving usability

Theresa LePore, the supervisor of elections in Palm Beach County Florida, has received much criticism for the ballot she designed for this year's presidential election. Actually she made several good decisions. For example, she attempted to improve the ballot for older voters by making the characters larger. Also, she wanted to have all presidential candidates on one page. Her solution was to use what has become known as the "butterfly ballot."

Actually, to ensure adequate reading performance, she should have focused on at least five issues:

  • font size,
  • font type,
  • text versus background color,
  • the light level where the voting occurred, and
  • the overall layout of the ballot.

Font size – For the majority of voters, a font size of 10 points would have been satisfactory. Most books are printed using type that is 10 or 11 points (a "point" is 1/72 of an inch). To accommodate older users, however, the research suggests that the characters should have been at least 12 points (maybe even 14 points). It is acceptable to use smaller fonts sizes when users can move closer to the text (or move the text closer to them) in order to make the image in the eye (angle subtended on the retina) larger.

Using all uppercase letters, which she elected to do, made the characters slightly larger for users. The most recent research on using uppercase versus lowercase letters for names shows clearly that there is no reliable difference between them in reading performance.

Font type – She used a "sans serif" font for the names. This decision was acceptable. There is one study that suggests that people over age 60 read "serif" fonts faster than sans serif fonts. In this case, the speed of reading is not as important as the accuracy of reading. Florida law allows each voter five minutes in the voting booth.

Text vs. background – The fastest and most accurate readability comes from using black text on a white background. This is what she did. The ballot appears to be black print on "white" card stock.

Illumination level – We do not know about the illumination level where the votes were cast. One recent research study found that in 71% of over 50 different "public places" in Florida, the light level was too low for adequate reading. Older adults need more illumination in order to see well.

In general, because the main usability issue was reading accurately, not reading quickly, Ms. LePore did an adequate job of dealing with these basic human factors issues.

Layout of the ballot

The issues surrounding the layout of the ballot are much more difficult to deal with, and are not nearly as easy to detect and resolve. It is difficult, even for usability experts, to identify some types of layout and formatting problems. For this reason, usability professionals make considerable use of usability tests.

Usability testing – Usability tests are intended to identify and correct problems before products are used by large numbers of users. In Ms. LePore's case, she would be interested in finding and fixing most of the serious problems voters would have on Election Day. In her case, a usability test would require several people pretending to vote while using the proposed ballot. While voting, these test participants would be observed by experienced usability testers. The testers would note and record any difficulties that the "voters" appeared to be having.

After voting, the participants would be individually interviewed about any concerns they had or any problems they may have experienced. This information would be used to change the ballot, and then a second round of usability testing would take place. Sometimes it takes three to five (or more) iterations (design, test, redesign) to achieve the desired outcome, i.e., to meet the performance goals for the ballot.

The Buchanan problem

Pat Buchanan got 3,411 machine-counted votes for president in this heavily Democratic county (62% voted for Al Gore and 35% voted for George Bush). The number of votes for Buchanan was higher than he received in any other Florida county. One explanation for the large number of votes related to the way Palm Beach County's punch-card style ballot was laid out for the presidential race. Candidates were listed on both sides of the front page in a vertical row of holes where the voters punched their choices. The top hole was for Bush, listed at top left; the second hole was for Buchanan, listed at top right, and the third hole was for Gore, listed under Bush on the left. The layout is shown below.

Butterfly ballot

Informal evaluations – Theresa LePore designed the ballot and then had it reviewed. Her usability testing, however, was limited in its scope. It initially consisted of seeking approval by two other members of the canvassing board of which she was a member. These two evaluators were intelligent, and highly experienced in conducting elections – one was a county commissioner (Carol Roberts) and the other was a judge (Charles Burton). Even so, the probability of one or the other of these two people detecting the "Buchanan" problem by simply looking at the ballot was very low. I calculated it as being about two chances in 100.

Ms. LePore then sent the ballot to both the Democratic and Republican National Committees for review. If we assume that the two groups had a total of ten people look at the ballot, the probability that one or more people in this group would have found the "Buchanan" problem was also low. I calculated that they had about one chance in ten of finding the problem. Obviously, none of these reviewers identified the "Buchanan" problem.

Number of test participants

Ms. LePore was not familiar with usability testing, but neither are many other highly experienced designers. For example, shortly after the Florida voting issue became known, one highly experienced system developer wrote: "Would usability testing (which often only uses 5-20 people of each background) have caught it? I think so." He links users to Jakob Nielsen's Web site, where Nielsen has suggested that "100% of usability problems can be found using only 15 subjects." Neither is correct in their estimates of the number of test subjects needed.

How many usability test participants would have been required for Ms. LePore to feel confident of finding these types of problems?

This answer can be calculated.* If the voters in Palm Beach county voted for Buchanan at the same rate as those in the other Florida counties, Buchanan would have received around 600 votes, instead of 3,407. Many have proposed that this suggests that about 2,800 votes (3,400 minus 600) were erroneously made. We do not know for sure – the votes may have been correctly made for Buchanan. Keep in mind that Buchanan received over 8,000 votes in Palm Beach County in the 1996 presidential primary when he was running against Bob Dole.

For our purposes, we will assume that the "Buchanan" problem was only a difficulty for about 1% of all the voters (2,800 "erroneous" votes divided by the 269,951 actual and potential Gore voters). My calculations show that Ms. LePore would have required 289 test participants to find 95% of the problems, which most likely would have led to detection of the "Buchanan" problem before the election. Over four-hundred (423) Democratic test participants would have been required to find 99% of the problems.

What most of Ms. LePore's critics are ignoring is that more than 99% of the voters had no trouble voting when using Ms. LePore's ballot. They obviously intended to vote for Mr. Gore and actually did vote for him. Of significant interest to us is what was different about the 1% of people who had problems? Taken further, what could be done to change the ballot so that virtually everyone voted without problems?

There are several possibilities about those that had problems:

  • Were they using ballots that had been printed differently?
  • Were they much older or much younger?
  • Were they more or less intelligent?
  • Were they more or less educated?
  • Were they first-time or long-time voters?
  • Were they much taller or much shorter?
  • Did they forget their glasses?
  • Did they have trouble reading English?
  • Did they vote early in the day or late?
  • Did they vote when they were "fresh" or when they were very tired?
  • Did they have difficulty following instructions?
  • Did they receive no help or instructions, or special instructions from the voting staff?
  • Did they have accessibility problems (low vision, movement control, etc.)?
  • Did they have a condition that would hamper their voting, such as Parkinson's disease?
  • Were they taking a prescriptive (or illegal) drug that affected their concentration?
  • Were they very nervous (did they have high anxiety)?
  • Were they motivated just to vote, not to vote correctly?

A good usability tester would have tried to determine which of the above reasons (and possibly others) most affected the voters. Where possible, the ballot would have been changed to better accommodate the users that had problems.

The "multiple votes" problem

The same reasoning and calculations can be used with the other major problem of multiple votes. In Palm Beach County there were 19,020 other ballots that were not considered valid (they were disqualified) because the voters had voted (punched) for more than one presidential candidate. In the official results, there were 432,286 ballots completed in Palm Beach county. This means that 4.4% of the ballots were considered invalid (19,020/432,286).

The question is how many test participants would have been required to have almost certainly detected the problem? The same formula can be applied. I calculate that they would have required 65 participants to complete a sample ballot, in order to find 95% of the problems (94 subjects to detect 99% of problems). This is far fewer than were required for the "Buchanan" problem because a higher percentage of voters actually ended up making the "multiple votes" error.

The "dimpled ballot" problem

Even the highly publicized "dimpled ballot" problem could have been identified before the election.

Palm Beach county had the initial machine count on November 7, then a machine recount on November 8, and then the absentee ballots were added. They then manually counted all 432,286 ballots cast. After the manual recount, Gore had gained about 215 more votes then did Bush. The manual recount was complicated by about 3,300 ballots that did not have clear punches for either candidate. These included those that were mispunched (hole in the wrong place), partially punched (the chad was still hanging), pin-hole punched (some light could be seen through the hole), some that were almost punched (dimpled), etc. Each of these ballots were closely reviewed by the three-member canvassing board.

Would it have been possible to have done a usability test that would have identified these punched-card variations before the election? It would have required highly experienced usability testers. They would have required truly representative test participants, the actual ballots (not samples), some of the actual Votomatic punchcard machines and styluses, and test items that were truly representative of the voting experience (including the ability to not vote for certain candidates). The number of subjects needed to detect 95% of the errors would have been 115, and to detect 99% would have been 166.

Problems associated with using the punchcard machines have been known for many years. Many changes have been made to the machines to reduce these problems. In addition a set of instructions on how to vote is provided on (a) the sample ballots, (b) the actual ballots and (c) the walls of the voting booth itself in large letters . The instructions say:

"STEP 3 – To vote, hold the voting instrument straight up. Punch straight down through the ballot card for the candidates of your choice." (The bolding was on the voter's instructions in the ballot.)

One final point should be made. To help shift some of the responsibility for having each voter's ballot counted to the voter, a final instruction in all capital letters, is shown at the bottom of the "instructions" page:

"AFTER VOTING, CHECK YOUR BALLOT CARD TO BE SURE YOUR VOTING SELECTIONS ARE CLEARLY AND CLEANLY PUNCHED AND THERE ARE NO CHIPS LEFT HANGING ON THE BACK OF THE CARD."

Conclusion

My conclusion is that Theresa LePore should not be so severely criticized for making design decisions that led to the "Buchanan" and "Multiple votes" problems. In the past, few (if any) ballots (and their related instructions) have received the kind of rigorous usability testing that would have identified these problems before the actual election. Having a certain number of voter problems, and uncounted votes, has been more or less considered an acceptable part of holding elections with millions of voters. For elections that were not too close, the traditional ways of casting and counting votes has been "good enough."

Generally, usability testing has been considered too expensive. I figure that it would have cost about $20,000 to run the necessary performance tests on LePore's Palm Beach ballot. These usability tests would have enabled ballot designers to find and rectify the "Buchanan" problem, the "Multiple votes" problem, and maybe even the "dimpled ballot" problem. The two presidential candidates spent about one billion dollars trying to get elected.

Footnote

*Calculation of required number of test participants:
A reasonable estimate of the number of participants required to detect the problem can be made by using the formula: 1-(1-p)n, where p = the probability of the usability problem occurring, and n = the number of test participants required.
Calculating the Number of Test Subjects
Many people have requested an explanation on how to use the binomial probability formula used for calculating the number of subjects needed. Hopefully, the following information will help in clarifying the major issues.

The original reference for the formula, as it relates to usability testing, goes back to Bob Virzi at GTE in 1990. Virzi's article was followed by one from Jim Lewis at IBM in 1993, and another one by Lewis in 1994. Many statistics books contain the formula for calculating a binomial probability, but these two sources have usability-related examples. I have taken their original write-ups and added new information plus some examples in the third edition of my Human Performance Engineering textbook (pp. 210-215).

The actual formula is 1-(1-p)n, where p is the probability of the usability problem occurring, and n is the number of test participants required. Based on the Palm Beach county voting returns, we know that "p" is 0.01, and we are interested in finding out "n." In other words, we are trying to find a problem that is only a difficulty for one out of 100 people ("p"), and we want to estimate the number of subjects necessary to feel confident that we can find this problem (or problems).

Generally, we apply this formula to determine the minimum number of test subjects needed to find a certain percentage of the usability problems in a system or in a Web site. Unfortunately, we never know how many usability problems actually exist in a new system, and we do not know what percent of the actual problems each test subject (or heuristic evaluator) will help us find. Virzi originally proposed that it was .40 (Virzi, 1990) and Nielsen has been advocating .31 (Alertbox: March 19, 2000).

The major problem with either the .40 or the .31, or any similar numbers, is that they represent the proportion of usability problems found by one evaluator (or one test subject) over the total found by all evaluators (or all test subjects). The number of usability problems found by all evaluators is not the actual number of usability problems in a system (see Bailey, et.al., 1992). The evaluators will miss finding or experiencing certain problems, and they will think that a relatively large number of issues are usability problems when they are not problems. We usually refer to these latter problems as "false alarms" (Catani and Biers, 1998; Stanton and Stevenage, 1998; Rooden, et.al., 1999). Based on the studies just referenced, there can be as many as two false alarms for every true problem.

Lewis at IBM (1994) reported on a study where his participants were test subjects. They used a system where he had created numerous usability problems ("salted the mine"). They experienced a combined total of 145 problems. He calculated that the average likelihood of any one subject experiencing a problem was .16 (obviously this is far less than .40 or .31).

If a system truly contained 145 usability problems, and if each person experienced only about 16% of all the problems, and if we had five participants, we could use the formula to calculate what percent of the problems all five subjects would be expected to uncover.

1-(1-.16)5 = 1-(.84)5 = 1-.42 = .58

Using the five test subjects, we would expect to find about 58% of the problems. If we used ten test subjects, what percent of the 145 problems would we expect them to uncover?

1-(1-.16)10 = 1-(.84)10 = 1-.17 = .83

Using the ten test subjects, we would expect to find about 83% of the usability problems. To put it another way, we would expect to find and (hopefully) fix those problems that could pose some difficulty to about four out of five users. The major assumption here is that each subject will, on average, experience about 16% of the problems.

Usability professionals try to use the appropriate number of subjects that will enable them to accomplish the goals of a usability test as efficiently as possible. If we use too many, we can increase the cost and development time of a system. If we use too few, we may fail to detect some serious problems, and could reduce the overall usability of the product. When designing Palm Beach county's ballot, Ms. LePore used far too few.

The Buchanan (butterfly ballot) problem provided a unique experience for usability professionals. It provided us with one of the numbers we usually do not have - The actual (true) proportion of people who had difficulty voting because of one or more usability problems related to the ballot. There were 2,800 "erroneous" votes made by 272,532 actual and potential Gore voters. This was about one out of 100 or 0.01. In other words, 99% of the users (voters) dealt effectively with the ballot's usability-related problems, but 1% did not.

The question then becomes, how many test subjects would have been needed to find (identify) the usability problems that posed difficulties to this relatively small number (1%) of users? Most usability testers never worry about these problems because the cost (in terms of the time needed to conduct the tests, and the large number of test subjects needed) for finding these difficulties is too great for most systems. Obviously, if the penalty for making errors was serious injury, loss of life, huge "support" costs, losing millions of dollars in sales, or a lost presidential election, then it may be worth the money to find and fix the problems.

I applied the binomial probability formula to estimate the number of usability test subjects Ms. LePore would have needed. In this case, "p" is .01, which is the probability of the usability problem occurring. Without building a special program to solve for "n," we simply increased "n" in the formula until we found the number of subjects needed to find either 95% or 99% of the ballot problems using a usability test.

  • Showing the ballot to two co-workers: 1-(1-.01)2 = 1-(.99)2 = 1-.98 = .02
  • Showing the ballot to 10 people in the parties: 1-(1-.01)12 = 1-(.99)12 = 1-.89 = .11
  • Using "Jakob's 15 subjects:" 1-(1-.01)15 = 1-(.99)15 = 1-.86 = .14
  • Using 50 subjects: 1-(1-.01)50 = 1-(.99)50 = 1-.61 = .39
  • Using 100 subjects: 1-(1-.01)100 = 1-(.99)100 = 1-.37 = .63
  • Using 200 subjects: 1-(1-.01)200 = 1-(.99)200 = 1-.13 = .87
  • Using 250 subjects: 1-(1-.01)250 = 1-(.99)250 = 1-.08 = .92
  • Using 289 subjects: 1-(1-.01)289 = 1-(.99)289 = 1-.05 = .95
  • Using 423 subjects: 1-(1-.01)423 = 1-(.99)423 = 1-.01 = .99

Another way of thinking about the problem is that if any one participant has a low probability of having difficulties with the ballot, which the actual numbers show, the total number of participants needed to find difficulties like the Buchanan (butterfly ballot) problem can become very high. In this case, 289 subjects would be needed to find 95% of those problems that are only difficulties to a very small number (1%) of voters. Four hundred and twenty-three would be needed to find 99%.

References

Bailey, R.W. (1996), Human Performance Engineering: Designing High Quality, Professional User Interfaces for Computer Products, Applications and Systems, Prentice Hall: Englewood Cliffs, NJ.

Bailey, R.W., Allen, R.W. and Raiello, P. (1992), Usability testing vs. heuristic evaluation: A head-to-head comparison, Proceedings of the Human Factors Society 36th Annual Meeting, 409-413.

Catani, M. B. and Biers, D. W. (1998), Usability evaluation and prototype fidelity: Users and usability professionals, Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting, 1331-1335.

Lewis, J.R. (1994), Sample sizes for usability studies: Additional considerations, Human Factors, 36(2), 368-378.

Lewis, J.R. (1993), Problem discovery in usability studies: A model based on the binomial probability formula, Proceedings of the 5th International Conference on Human-Computer Interaction, 666-671.

Rooden, M.J., Green, W.S. and Kanis, H. (1999), Difficulties in usage of a coffeemaker predicted on the basis of design models, Proceedings of the Human Factors and Ergonomics Society - 1999, 476-480.

Stanton, N.A. and Stevenage, S.V. (1998), Learning to predict human error: Issues of acceptability, reliability and validity, Ergonomics, 41(11), 1737-1747.

Virzi, R.A. (1990), Streamlining the design process: Running fewer subjects, Proceedings of the Human Factors Society 34th Annual Meeting, 291-294.




Leave a comment here

© 1996-2014 Human Factors International, Inc. All rights reserved  |  Privacy Policy  |   Follow us: