Criterion-Referenced Assessment
Criterion-referenced testing (CRT) was introduced in the United States in the 1960s as a response to the need for assessments that could determine what persons knew and could do in relation to a well-defined domain of knowledge and skills, rather than in relation to other persons [3 , 15 ]. With the CRT score information, the level of proficiency of candidates can be determined, and in many cases, diagnostic information can be provided that will be helpful to candidates in working on their weaknesses. Today, the uses of CRTs are widespread in education, the military, and industry [9 ]. What follows first in the entry, is a brief description of the differences between norm-referenced tests (NRTs) and CRTs. It is because of the fundamental differences that a number of challenges have arisen regarding CRTs – standard-setting and estimating reliability, to name two, and it is these technical challenges that are the focus of this entry. NRTs, on the other hand, have received extensive research and development over the years, and from a technical perspective, there are few remaining challenges to overcome for their effective use. Differences Between Norm-referenced and Criterion-referenced TestsCriterion-referenced tests and norm-referenced tests serve different purposes and these differences have implications for test development and evaluation. Norm-referenced tests are primarily intended to distinguish or compare examinees on the construct measured by the test. Examinees are basically rank-ordered based on their test scores. For the rank ordering to be reliable, the test itself needs to spread out the examinees so that the always-present measurement errors do not distort too much the ranking that would be obtained if true scores had been used. This means that a good norm-referenced test will spread out the examinee scores, and to do that, items of middle difficulty and high discriminating power are usually best – test score variability needs to be maximized to the extent possible, given constraints on such things as test content and test length. Test score reliability is judged by the stability of the examinee rankings or scores over parallel-form administrations or test-retest administrations of the test. Proxies for the double administration of the test or administration of parallel forms of the test come from single-administration reliability estimates such as corrected split-half and internal consistency estimates (i.e., correlation between scores derived from two halves of a test, and then adjusting the correlation upward by the Spearman–Brown formula to predict the reliability of the full-length test; or coefficient alpha for polytomous response data, and the KR-20 and KR-21 formulas for binary data). Validity is established by how well the scores serve their intended purpose. The evidence might come from criterion-related or construct validity studies (see Validity Theory and Applications ). CRTs, on the other hand, are intended to indicate an examinee’s level of proficiency in relation to a well-defined domain of content. Usually scores are interpreted in relation to a set of performance standards that are set on the test score reporting scale. Primary focus in item selection is not on the item statistics as it is when building an NRT, though they are of concern (for example, items with negative point biserial correlations would never be selected), but rather primary focus in item selection is on the content match of items to the content domain being measured by the test. Test items are needed that insure the content validity of the test and so content is a primary consideration in item selection. That there may be limited score variability in the population of examinees is not of any significance, since examinee scores, independent of other examinees, are compared to the content domain covered by the test, and the performance standards in place for test score interpretation and test score uses. Today with many state criterion-referenced tests, examinees, based upon their test scores, are assigned to one of four performance levels: Failing, Basic, Proficient, and Advanced. Performance standards are the points on the reporting scale that are used to sort examinees into the performance levels. For criterion-referenced credentialing exams, normally only two performance levels are used: passing and failing. Reliability is established, not by correlational statistics as is the case with NRTs, but rather by assessing the consistency of performance classifications of examinees over retests and parallel forms. Proxies for the concept of decision consistency estimatedPage 436 | Top of Article from single administrations are also possible and will be discussed later in this entry. Validity is typically assessed by how well the test items measure the content domain to which the test scores are referenced. Validity also depends on the performance standards that are set for sorting candidates into performance categories. If they are set improperly (perhaps set too high or too low because of a political agenda of those panelists who set them), then examinees will be misclassified (relative to how they would be classified if true scores were available, and a valid set of performance standards were in place), and the validity of the resulting performance classifications is reduced. What is unique about CRTs is the central focus on the content measured by the test, and subsequently, on how the performance standards are set, and the levels of decision consistency and accuracy of the resulting examinee classifications. These technical problems will be addressed next. Setting Performance StandardsSetting performance standards on CRTs has always been problematic (see [1 ]) because substantial judgment is involved in preparing a process for setting them, and no agreement exists in the field about the best choice of methods (see Setting Performance Standards: Issues, Methods ). One instructor may be acceptable to set performance standards on a classroom test (consequences are usually low for students, and the instructor is normally the most qualified person to set the performance standards), but when the stakes for the testing get higher (e.g., deciding who will receive a high school diploma, or a certificate to practice in a profession), multiple judges or panelists will be needed to defend the resulting performance standards. Of course, with multiple panelists and each with their own opinion, the challenge is to put them through a process that will converge on a defensible set of performance standards. In some cases, even two or more randomly equivalent panels are set up so that the replicability of the performance standards can be checked. Even multiple panels may not appease the critics: The composition of the panel or panels, and the number of panel members can become a basis for criticism. Setting valid performance standards involves many steps (see [4 ]): Choosing the composition of the panel or panels and selecting a representative sample of panel members, preparing clear descriptions of performance levels, developing clear and straightforward materials for panels to use in the process, choosing a standard-setting method that is appropriate for the characteristics of the test itself and the panel itself (for example, some methods can only be used with multiple-choice test items, and other methods require item statistics), insuring effective training (normally, this is best accomplished with field testing in advance of the actual standard-setting process), allowing sufficient time for panels to complete their ratings participate in discussions and revising their (this activity is not always part of a standard-setting process), compiling the panelists’ ratings and deriving the performance standards, collecting validity from the panelists, analyzing the available data, documenting the process itself. Counting variations, there are probably over 100 methods for setting performance standards [1 ]. Most of the methods involve panelists making judgments about the items in the test. For example, with the Angoff method, panelists predict the expected performance of borderline candidates at the Basic cut score, the Proficient cut score, and at the Advanced cut score, on all of the items on the test. These expected item scores at a cut score are summed to arrive at a panelist cut score, and then averaged across panelists to arrive at an initial cut score for the panel. This process is repeated to arrive at each of cut scores. Normally, discussion follows, and panelists have an opportunity to revise their ratings, and then the cut scores are recalculated. Sometime during the process panelists may be given some item statistics, or consequences of particular cut scores that they have set (e.g., with a particular cut score, 20% of the candidates will fail). This is known as the Angoff method. In another approach to setting performance standards, persons who know the candidates (called ‘reviewers’) and who know the purpose of the test might be asked to sort candidates into four performance categories: Failing, Basic, Proficient, and Advanced. A cut score to distinguish Failing from Basic on the test is determined by looking at actual test score distributions of candidates who were assigned to either the Failing or Basic categories by reviewers. A cut score is chosen to maximize the consistency of the classifications between candidates based on the test and the reviewers. The process is then repeated for the other cut scores. This is knownPage 437 | Top of Article as the contrasting groups method. Sometimes, other criteria for placing cut scores might be used, such as doubling the importance of minimizing one type of classification error (e.g., false positive errors) over another (e.g., false negative errors). Many more methods exist in the measurement literature: Angoff, Ebel, Nedelsky, contrasting groups, borderline group, book-mark, booklet classification, and so on. See [1 ] and [5 ] for complete descriptions of many of the current methods. Assessing Decision Consistency and AccuracyReliability of test scores refers to the consistency of test scores over time, over parallel forms, or over items within the test. It follows naturally from this definition that calculation of reliability indices would require a single group of examinees taking two forms of a test or even a single test a second time, but this is often not realistic in practice. Thus, it is routine to report single-administration reliability estimates such as corrected split-half reliability estimates and/or coefficient alpha. Accuracy of test scores is another important concern that is often checked by comparing test scores against a criterion score, and this constitutes a main aspect of validity [8 ]. With CRTs, examinee performance is typically reported in performance categories and so reliability and the validity of the examinee classifications are of greater importance than the reliability and validity associated with test scores. That is, the consistency and accuracy of the decisions based on the test scores outweighs the consistency and the accuracy of test scores with CRTs. As noted by Hambleton and Slater [7 ], before 1973, it was common to report a KR-20 or a corrected split-half reliability estimate to support the use of a credentialing examination. Since these two indices only provide estimates of the internal consistency of examination scores, Hambleton and Novick [6 ] introduced the concept of the consistency of decisions based on test scores, and suggested that the reliability of classification decisions should be defined in terms of the consistency of examinee decisions resulting from two administrations of the same test or parallel forms of the test, that is, an index of reliability which reflects the consistency of classifications across repeated testing. As compared with the definition of decision consistency (DC) given by Hambleton and Novick [6 ], decision accuracy (DA) is the ‘extent to which the actual classifications of the test takers agree with those that would be made on the basis of their true scores, if their true scores could somehow be known’ [12 ]. Methods of Estimating DC and DAThe introduction of the definition of DC by Hambleton and Novick [6 ] pointed to a new direction for evaluating the reliability of CRT scores. The focus was to be on the reliability of the classifications or decisions rather than on the scores themselves. Swaminathan, Hambleton, and Algina [19 ] extended the Hambleton–Novick concept of decision consistency to the case where there were not just two performance categories: where pii is the proportion of examinees consistently assigned to the i-th performance category across two administrations, and k is the number of performance categories. In order to correct for chance agreement, based on the kappa coefficient (see Rater Agreement – Kappa ) by Cohen [2 ], which is a generalized proportion agreement index frequently used to estimate inter-judge agreement, Swaminathan, Hambleton, and Algina [20 ] put forward the kappa statistic which is defined by: where p is the proportion of examinees classified in the same categories across administrations, and pc is the agreement expected by chance factors alone. The concepts of decision consistency and kappa were quickly accepted by the measurement field for use with CRTs, but the restriction of a double administration was impractical. A number of researchers introduced single-administration estimates of decision consistency and kappa, analogous to the corrected split-half reliability that was often the choice of researchers working with NRTs. Huynh [11 ] put forward his two-parameter ‘bivariate beta-binomial model’. His model relies on the assumption that a group of examinees’ ability scores follow the betaPage 438 | Top of Article distribution with parameters α and β, and the freuency of the observed test scores x follow the beta-binomial (or negative hypergeometric) distribution with parameters α and β. The model is defined by the following: where n is the total number of items in the test, and B is the beta function with parameters α and β, which can be estimated either with the moment method – making use of the first two moments of the observed test scores – or with the maximum likelihood (ML) method described in his paper. The probability that an examinee has been consistently classified into a particular category can then be calculated by using the beta-binomial density function. Hanson and Brennan [10 ] extended Huynh’s approach by using the four-parameter beta distribution for true scores. Subkoviak’s method [18 ] is based on the assumpxtions that observed scores are independent and distributed binomially, with two parameters–the number of items and the examinee’s proportion-correct true score. His procedure estimates the true score for each individual examinee without making any distributional assumptions for true scores. When combined with the binomial or compound binomial error model, the estimated true score will provide a consistency index for each examinee, and averaging this index over all examinees gives the DC index. Since the previous methods all deal with binary data, Livingston and Lewis [12 ] came up with a method that can be used with data including either dichotomous, polytomous, or the combination of the two. It involves estimating the distribution of the proportional true scores Tpusing strong true score theory [13 ]. This theory assumes that the proportional true score distribution has the form of a four-parameter beta distribution with density where Beta is the beta function, and the four parameters of the function can be estimated by using the first four moments of the observed scores for the group of examinees. Then the conditional distribution of scores on an alternate form (given true score) is estimated using a binomial distribution. All of the previously described methods operate in the framework of classical test theory (CTT). With the popularization of item response theory (IRT), the evaluation of decision consistency and accuracy under IRT has attracted the interest of researchers. For example, Rudner ([16 ], [17 ]) introduced his method for evaluating decision accuracy in the framework of IRT. Rudner [16 ] proposed a procedure for computing expected classification accuracy for tests consisting of dichotomous items and later extended the method to tests including polytomous items [17 ]. It should be noted that Rudner referred to θ and as ‘true score’ and ‘observed score’ respectively in his papers. He pointed out that because of the fact that for any given true score θ, the corresponding observed score is expected to be normally distributed, with a mean θ and a standard deviation of se(θ), the probability of an examinee with a given true score θ of having an observed score in the interval [a, b] on the theta scale is then given by where φ(Z) is the cumulative normal distribution function. He noted further that multiplying (5) by the expected proportion of examinees whose true score is θ yields the expected proportion of examinees whose true score is expected to be in the interval [a, b], and summing or integrating over all examinees in interval [c, d] gives us the expected proportion of all examinees that have a true score in [c, d] and an observed score in [a, b]. If we are willing to make the assumption that the examinees’ true scores (θ) are normally distributed, the expected proportions of all examinees that have a true score in the interval [c, d] and an observed score in the interval [a, b] are given by where se(θ) is the reciprocal of the square root of the test information function at θ which is the sum ofPage 439 | Top of Article the item information functions in the test, and f(θ) is the standard normal density function Θ(Z) [16 ]. The problem with this method, of course, is that the normality assumption is usually problematic. Reporting of DC and DATable 1 represents a typical example of how DC of performance classifications is being reported. Each of the diagonal elements represents the proportion of examinees in the total sample who were consistently classified into a certain category on both administrations (with the second one being hypothetical), and summing up all the diagonal elements yields the total DC index. It is a common practice now to report ‘kappa’ in test manuals to provide information on the degree of agreement in performance classifications after correcting for the agreement due to chance. Also reported is the ‘conditional error’, which is the measurement error associated with test scores at each of the performance standards. It is helpful because
it indicates the size of the measurement error for examinees close to each performance standard. The values of DA are usually reported in the same way as in Table 1, only that the cross-tabulation is between ‘true score status’ and the ‘test score status’. Of course, it is highly desirable that test manuals also report other evidence to support the score inferences from a CRT, for example, the evidence of content, criterion-related, and construct and consequential validity. Appropriate Levels of DC and DAA complete set of approaches for estimating decision consistency and accuracy are contained in Table 2. Note that the value of DA is higher that of DC because the calculation of DA involves one set of observed scores and one set of true scores which are supposed to be without any measurement error due to improper sampling test questions, flawed test items, problems with the test administration and so on, while the calculation of DC involves two sets of observed scores. The levels of DC and DA required in practice will depend on the intended uses of the CRT and the number of performance categories. There have not been any established rules to help determine the levels of decision consistency and accuracy needed for different kinds of educational and psychological assessments. In general, the more important the educational decision to be made, the higher consistency and accuracy need to be. References[1] Cizek, G. ed. (2001). Setting Performance Standards: Concepts, Methods, and Perspectives. Lawrence Erlbaum, Mahwah. Page 440 | Top of Article [2] Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement 20, 37–46. [3] Glaser, R. (1963). Instructional technology and the measurement of learning outcomes, American Psychologist 18, 519–521. [4] Hambleton, R.K. (2001). Setting performance standards on educational assessments and criteria for evaluating the process, in Setting Performance Standards: Concepts, Methods, and Perspectives, G. Cizek, ed., Lawrence Erlbaum, Mahwah, pp. 89–116. [5] Hambleton, R.K., Jaeger, R.M., Plake, B.S. & Mills, C.N. (2000). Setting performance standards on complex performance assessments, Applied Measurement in Education 24(4), 355–366. [6] Hambleton, R.K. & Novick, M.R. (1973). Toward an integration of theory and method for criterion-referenced tests, Journal of Educational Measurement 10(3), 159–170. [7] Hambleton, R.K. & Slater, S. (1997). Reliability of credentialing examinations and the impact of scoring models and standard-setting policies, Applied Measurement in Education 10(1), 19–38. [8] Hambleton, R.K. & Traub, R. (1973). Analysis of empirical data using two logistic latent trait models, British Journal of Mathematical and Statistical Psychology 26, 195–211. [9] Hambleton, R.K. & Zenisky, A. (2003). Advances in criterion-referenced testing methods and practices, in Handbook of Psychological and Educational Assessment of Children, 2nd Edition, C.R. Reynolds & R.W. Kam-phaus, eds, Guilford, New York, pp. 377–404. [10] Hanson, B.A. & Brennan, R.L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models, Journal of Educational Measurement 27, 345–359. [11] Huynh, H. (1976). On the reliability of decisions in domain-referenced testing, Journal of Educational Measurement 13, 253–264. [12] Livingston, S.A. & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores, Journal of Educational Measurement 32, 179–197. [13] Lord, F.M. (1965). A strong true score theory, with applications, Psychometrika 30, 239–270. [14] Massachusetts Department of Education. (2001). 2001 Massachusetts MCAS Technical Manual, Author, Malden. [15] Popham, W.J. & Husek, T.R. (1969). Implications of criterion-referenced measurement, Journal of Educational Measurement 6, 1–9. [16] Rudner, L.M. (2001). Computing the expected proportions of misclassified examinees, Practical Assessment, Research & Evaluation 7(14). [17] Rudner, L.M. (2004). Expected classification accuracy, in Paper Presented at the Meeting of the National Council on Measurement in Education, San Diego. [18] Subkoviak, M.J. (1976). Estimating reliability from a single administration of a criterion-referenced test, Journal of Educational Measurement 13, 265–276. [19] Swaminathan, H., Hambleton, R.K. & Algina, J. (1974). Reliability of criterion-referenced tests: a decision-theoretic formulation, Journal of Educational Measurement 11, 263–267. [20] Swaminathan, H., Hambleton, R.K. & Algina, J. (1975). A Bayesian decision-theoretic procedure for use with criterion-referenced tests, Journal of Educational Measurement, 12, 87–98. RONALD K. HAMBLETON AND SHUHONG LI Source Citation (MLA 8th Edition) Encyclopedia of Statistics in Behavioral Science, edited by Brian S. Everitt and David C. Howell, vol. 1, Wiley, 2005, pp. 435-440. Gale Virtual Reference Library, http://link.galegroup.com/apps/doc/CX2589800142/GV…. Accessed 26 Apr. 2018. Gale Document Number: GALE|CX2589800142 View other articles linked to these index terms:Page locators that refer to this article are not hyper-linked.
|