Of all the research instruments out there, the Likert scale has to be one of the most popular. In the world of elite sport, Likert scales play a prominent role in wellness monitoring. Many if not most of our clients fill in or analyse a wellness survey every day.
Yet, despite the Likert scale’s ubiquity (i.e., Likert scales not only appear in sports science but also in medicine, psychology, social science, etc.), what is it that makes a Likert scale valid and reliable?
In this blog post, I am going to explore just one aspect of Likert scale quality: the number of response categories.
Since psychologist Rensis Likert (actually pronounced Lick-ert) invented the Likert scale in 1932, people have tried to optimise the number of scale points.
Scales from anywhere between just two points up to a whopping 101 points have been suggested at one time or another.
Below I have very briefly summarised some of the key take-aways from the review Optimal number of response categories in rating scales (Preston & Colman, 2010) and from the chapter Question and Questionnaire Design in the Handbook of Survey Research (Krosnick & Presser, 2000).
Most studies use the reliability of a Likert scale as the metric to measure the optimal number of scale points, where a Likert scale is reliable if it gives you the same results when you measure the same unchanged object or event.
•Schutz and Rucker wrote in 1975 that the number of response categories have little to no effect on the results. This finding is certainly the exception in the literature. The majority of the literature unequivocally asserts that the number of response categories does actually matter, but Schutz and Rucker’s results are nonetheless worth mentioning.
•Garner (1960) argued that having more response categories maximised the amount of information obtained. In other words, more categories allows the researcher to better discriminate between participants. He suggested that maximum information is obtained by using more than twenty response categories.
•Conversely, Green and Rao (1970) found that seven response categories maximised the information obtained, with little extra information gained after seven categories.
•Jones (1968) examined respondent preferences for scales with two or seven response categories and reported that the dichotomous scale was seen as less “accurate”, less “reliable”, less “interesting” and more “ambiguous” than the seven-point scale. Respondents clearly showed a preference for multiple-category scales over dichotomous scales.
•Cicchetti, Showalter and Tyrer (1985) showed using Monte Carlo simulations that reliability increased when comparing seven-point scales to two-point scales, but beyond seven points — and even up to 100 response categories — no increase in reliability was found. The researchers concluded that eight, nine, ten or even 100-point scales show no more reliability than a seven-point scale.
•Also using simulations, Lissitz and Green (1975) found the same effect when comparing two, three, five, seven, nine and fourteen-point scales.
•Oaster (1989), Finn (1972), Nunnally (1967) and Ramsay (1973) reported that reliability is maximised with seven-point scales. But Jenkins & Taber (1977), McKelvie (1978) and Remmers and Ewart (1941) found evidence for higher reliability for five-point scales.
Preston and Colman (2000) found that scales with seven, eight, nine or ten categories were more reliable than scales with two, three or four categories. Furthermore, they found that respondents preferred the ten-point scale, followed by the seven-point scale and the nine-point scale.
The above evidence are snippets that come from only two reviews and are not sports-science specific. Nonetheless, in my experience it is safe to say that the literature says that Likert scales should have somewhere between 5-10 scale points.
In most situations, I believe the optimal number of categories is seven. For the following reasons:
1. Miller (1956) showed that the average human mind has a span of apprehension capable of distinguishing about seven different items.
2. Dickinson & Zelligner (1980) and Krosnick & Berent (1993) argue that respondent preference and reliability is higher when all points are labelled with words as opposed to numbers. When numbers are used, respondents have to translate the scale points into verbal definitions anyway (Krosnick & Presser, 2010). It is possible to provide an effective verbal label for each point on a scale containing up to seven points, but doing so becomes more difficult as the number of scale points increases beyond that length.
3. O’Muircheartaigh (1999) found that adding midpoints to rating scales improved the reliability and validity of ratings. Structural equation modelling of error structures revealed that omitting the middle alternative led respondents to randomly select one of the moderate scale points closest to where a midpoint would appear, which suggests that adding a midpoint is desirable. Thus, Likert scales should have an odd number of response categories in order to include a mid-point, which, in respect of the above arguments, points to seven being the optimal number of response categories.
Likert scales containing either five, seven, nine or ten response categories have the most backing in the literature in terms of validity, reliability, discriminatory power and respondent preferences. However, despite around a century of research, different researchers may still have different opinions as to what constitutes the optimal number of scale points.
Have Your Say!
ATTENTION FOLLOWERS: We need your input! In your professional opinion, what is the optimal number of scale points in a Likert scale? ????
— Fusion Sport (@FusionSport) June 2, 2017
1. Cicchetti, D. V., Showalter, D., & Tyrer, P. J. (1985). The effect of number of rating scale categories on levels of inter-rater reliability: a Monte-Carlo investigation. Applied Psychological Measurement, 9, 31:36.
2. Dickinson, T. L., & Zellinger, P. M. (1980). A comparison of the behaviorally anchored rating mixed standard scale formats. Journal of Applied Psychology, 65, 147-154.
3. Finn, R. H. (1972). Effects of some variations in rating scale characteristics on the means and reliabilities of ratings. Educational and Psychological Measurement, 34, 885:892.
4. Garner, W. R. (1960). Rating scales, discriminability and information transmission. Psychological Review, 67, 343:352.
5. Green, P. E., & Rao, V. R. (1970). Rating scales and information recovery: How many scales and response categories to use? Journal of Marketing, 34, 33:39.
6. Jenkins, Jr., G. D., & Taber, T. D. (1977). A Monte-Carlo study of factors affecting three indices of composite scale reliability. Journal of Applied Psychology, 62, 392:398.
7. Jones, R. R. (1968). Differences in response consistency and subjects’ preferences for three personality inventory response formats. In Proceedings of the 76th Annual Convention of the American Psychological Association (pp. 247:248).
8. Krosnick, J. A., & Berent, M. K. (1993). Comparisons of party identification and policy preferences: the impact of survey question format. American Journal of Political Science, 37, 941-964.
9. Krosnick, J.A., & Presser, S. (2010). Question and Questionnaire Design. In: Handbook of Survey Research (2nd Edition). Elsevier.
10.Lissitz, R. W., & Green, S. B. (1975). Effect of the number of scale points on reliability: a Monte-Carlo approach. Journal of Applied Psychology, 60, 10:13.
11. McKelvie, S. J. (1978). Graphic rating scales: How many categories? British Journal of Psychology, 69, 185:202.
12. Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological Review, 63 (2), 81:97.
13. Nunnally, J. C. (1967). Psychometric theory. New York: McGraw-Hill.
14. Oaster, T. R. F. (1989). Number of alternatives per choice point and stability of Likert-type scales. Perceptual and Motor Skills, 68, 549:550.
15. O’Muircheartaigh, C., Krosnick, J. A., & Helic, A. (1999, May). Middle alternatives, acquiescence, and the quality of questionnaire data. Paper presented at the American Association for Public Opinion Research Annual Meeting, St. Petersburg, FL.
16. Preston, C.C., & Colman, A.M. (2000). Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences. Acta Psychologica, 104, 1:15.
17. Ramsay, J. O. (1973). The effect of number of categories in rating scales on precision of estimation of scale values. Psychometrika, 38, 513:533.
18. Remmers, H. H., & Ewart, E. (1941). Reliability of multiple-choice measuring instruments as a function of the Spearman-Brown prophecy formula. Journal of Educational Psychology, 32, 61:66.
19. Schutz, H. G., & Rucker, M. H. (1975). A comparison of variable configurations across scale lengths: an empirical study. Educational and Psychological Measurement, 35, 319:324.