Giving meaning to test scores
By Catherine Close, PhD, Psychometrician
Test scores are of much interest to parents and educators. We all want our children to achieve their best—so, we frequently use tests to measure what a student has learned and can do as a result of instruction. Sometimes we want to gauge progress, but often we make important decisions such as placement in intervention or advanced classes, grade promotion, and so on.
If we are going to use test scores for these types of decisions, we need to ensure that the scores are meaningful. So, let’s talk about how we give meaning to scores.
First, consider the quality of the test. As we discussed in a previous blog, we look for high reliability of scores and convincing evidence of validity.
Second, even with a good test, scores are inherently meaningless. Yes, that’s true. Take something commonplace such as taking your temperature. Whatever your temperature is, you wouldn’t know whether it’s good or bad without something to compare it to. That something is the knowledge that 98.6° F is considered the normal body temperature.
In this case, 98.6° F is what we call a standard or a norm to which we compare our temperature measurements. Knowing that standard immediately gives meaning to your number as low or high or whatever the case might be.
The same is true for educational test scores. For scores to have meaning, we must have a well-defined standard or norm to which we compare the scores. We define that standard or norm using two main approaches: a criterion-referenced approach and a norm-referenced approach. Let me say here that you couldn’t tell the difference between a criterion-referenced test and a norm-referenced test just by looking at one. This is because the difference is in the scores. We report different scores based on the interpretations we want to make.
Let’s take an example of a third grade student. This student has tested in Renaissance Star Reading® in the first month of the school year and scored at 365. What are some of the possible interpretations of that score?
For criterion-referenced interpretations, we look for scores that describe the specific knowledge and skills that the student has most likely achieved. The standard against which the 365 score is judged can be as simple as percentage of tasks performed correctly (e.g., answering 80% of the items correctly); or performance levels such as Below Basic, Basic, Proficient, and Advanced. The standard in this case is commonly referred to as a criterion. Whatever the criterion, the goal is to assess where the student’s level of knowledge and skills is in relation to that criterion. We then simply state whether the student has mastery or not, is proficient or not, and so on.
In computerized adaptive tests, students don’t see the same exact questions so the percentage of questions answered correctly is not a meaningful metric. However, we can use item response theory to compute what we call domain scores. Domain scores range from 0 to 100 and show the average level of skill mastery for students who obtain a given score. If our third grader – based on the 365 score- has a domain score of 84 in language, this means that he or she has likely mastered 84% of the content in the language domain. A similar interpretation holds across the other domains tested in Star Reading and you can see the profile of this student’s expected content mastery, domain by domain. If your criterion is say 70% mastery in all domains, then this student is clearly above that in the language domain and you might be looking for more challenging materials to keep the student engaged.
Another type of score that may be of interest is the student’s performance level. If the Star Reading cut-score that defines the proficient category for third graders is 375, then this student falls short by 10 score points. But, since this is the first month of the school year, you might feel confident that they’ll be proficient in reading at the end of third grade.
The Star assessments also provide criterion-referenced benchmark scores at the state level. A benchmark is the lowest level of performance considered acceptable. If Star Reading is linked to your state test, the state benchmarks show you the reading proficiency level on the state test that corresponds to the 365 Star Reading score. In this sense, the Star state benchmarks provide a glimpse into the third grader’s most likely proficiency level in the state test at the end of year.
For norm-referenced interpretations, we look for scores that compare the performance of our third grader to the performance of other third grade students across the nation. These other third grade students form what we call a norm group. That norm group doesn’t always have to be other students in the same grade. It can also be students of the same age, same special status such as English language learners (ELL) or special education, among others.
However, it’s more common in the United States to compare students with other students in the same grade across the nation. Some of the scores that we look for here are percentile ranks (PR) and grade equivalents (GE). Suppose our third grader’s score of 365 corresponds to say a PR of 50 and a GE of 3.1. The PR of 50 means that this student did better than 50% of the third grade students in the national norm group who test in the first month of the school year. The GE score of 3.1 means that this student is performing like the average student in the first month of third grade. There might also be standard scores such as the normal curve equivalents (NCE) that are reported due to their convenience for algebraic manipulations but the PRs and GEs are the most widely used norm-referenced scores.
Star also provides norm-referenced benchmark scores at the district and at the school level. These benchmarks are based on the existing Starnational norms and the nationally accepted recommendations defined as follows:
- At/Above Benchmark = At/above 40th percentile
- On Watch = 25th to 39th percentile
- Intervention = 10th to 24th percentile
- Urgent Intervention = Below 10th percentile
These norm-referenced benchmarks can be modified to be situation specific but the default percentile ranks presented above are helpful in determining who needs intervention and who is clearly above benchmark. Our third grader’s score has PR=50 which is above benchmark.
How to Determine the Best Score Interpretation for your Program
Because the norm-referenced and the criterion-referenced interpretations serve different purposes, none is better than the other. You choose based on your testing needs. If the goal is to pin down mastery of content, then the criterion-referenced approach is the fitting choice. If comparing a student’s performance to a norm group is important, then the norm-referenced approach rules the day. If you want both the norm-referenced and the criterion-referenced interpretations, you can certainly have both! Whatever the need, it’s good to keep in mind the following: norm-referenced scores don’t really tell you what the student has learned in terms of content mastery as they are used to compare student based on some norm group of interest; criterion-referenced scores don’t allow us to compare performance across students as we are interested in the skills each student has mastered in relation to some criterion.
I hope I’ve provided you with insight into why scores mean different things, the reasoning behind those differences, and how your needs for score use determine which interpretation is best for you. We’ve also written previous posts about other facets of test scores, which can be accessed using the links below: