The basics of test score reliability for educators

By: Catherine Close, Ph.D., Psychometrician
 
Reliability is a broad topic, broad enough to be thesis worthy. So, when I set out to summarize “all things reliability”—in two pages, no less—I didn’t know where to start. Naturally, I procrastinated. When my problem didn’t magically go away on its own, I realized I only needed to reflect on the interactions I’ve had with educators, and here we are!

In my previous blog post on reliability and validity, we defined reliability as the consistency of test scores across multiple test forms or multiple testing occasions. Because there are various types of reliability, it follows that educators want to know which type is relevant to their testing scenario as well as what degree of reliability is considered acceptable. Simple enough, right?

Before we proceed, I’ll introduce a statistic that we psychometricians use to quantify reliability. Don’t worry, you’ll like this one! Because once you get it, you’ll have the key to understanding the numbers people use to cite the degree of reliability for any test scores. I’ll simply refer to this statistic as a correlation coefficient. Let’s spell out the two ideas contained in this term:

  • Correlation – When it comes to reliability, think of the correlation as the way we determine whether a test ranks students in a similar manner based on their scores across two test forms or two separate measurement occasions.
  • Coefficient – The coefficient comes in when we assign a number to the correlation. The reliability coefficient is the number we use to quantify just how reliable test scores are.

What is an acceptable level of reliability?

Reliability is a matter of degree, with values (correlation coefficients) ranging from 0 to 1. Recall from my previous blog post that reliable weight readings might look like the dots in this wheel.

Reliability Circle

The dots are slightly scattered because no measurement is without some degree of uncertainty, not even your bathroom scale! As a result, perfect reliability—a correlation coefficient of 1—doesn’t exist. That said, higher reliability values are preferred.  Psychometric literature cites .70 or greater as acceptable. Needless to say, you may want values greater than, say, .85 for high-stakes decisions such as grade promotion or placement for special education services, among others.

To really understand reliability, understanding correlation coefficients and their acceptable levels is a start. Next, you should know what type of reliability is most relevant to the type of test in question and your testing situation.

Types of reliability

Suppose you want to test algebra skills, and you have two algebra test forms, each made up of 34 multiple-choice items. The two forms are designed to be as similar as possible, and it shouldn’t matter which form you use. One way psychometricians determine the consistency of scores across the two forms is to test the same students with both forms and compute the correlation coefficient for the two sets of scores. This correlation is called parallel-forms reliability.

Reliability Types

Now suppose you have only one form of the algebra test. To determine whether scores from this single test form are consistent, we administer the test twice to the same group of students. The correlation coefficient relating these scores is called test-retest reliability. To put this in the context of the Star assessments, both Renaissance Star Math® and Renaissance Star Reading® have aggregate test-retest reliability greater than .90.

In our busy classrooms, teachers may have time to administer a single algebra test form once, not twice! It may seem paradoxical to judge score consistency from a single test administration, but we can do this by using one of two approaches: split-half reliability or internal consistency reliability.

The Split-half reliability approach splits student responses on the 34 algebra items into two halves, scores the halves, and computes a correlation coefficient between the two sets of scores.

Internal-consistency reliability, the other approach, treats each item as a single administration. So our 34 algebra items would be viewed as 34 different test administrations! We want all 34 items to correlate highly with one another so that high-ability students tend to score high on each item and low-ability students tend to score low on each item. The Star assessments have a high degree of internal consistency; overall, it’s .85 for Renaissance Star Early Literacy® and .97 for both Star Math and Star Reading.

Finally, not all tests are made up of multiple-choice items. Sometimes tests will administer essays or ask students to perform some tasks to demonstrate certain abilities. A human judge is needed to score an essay or judge performance quality. Two judges will usually judge the same task. In these cases, we are concerned with the degree of agreement between the judges in assigning scores to show inter-rater reliability.

From our discussion you can probably see that the reliability measure of interest depends on your specific testing plan. You don’t need all of the reliability types discussed above to judge the consistency of test scores.  For example, if your program administers multiple-choice tests in a single administration, then internal consistency reliability should be enough. On the other hand, if you use assessments that require human scorers, you should look at inter-rater reliability.

Reliability and computerized adaptive tests

Another note is that computerized adaptive tests (CATs)—such as Star—are different from fixed-form tests such as paper-and-pencil tests.  Students see different items with each CAT administration, whereas the items in the fixed forms remain the same. This presents challenges when using reliability labels that were originally designed for the traditional non-adaptive fixed forms. For example, test-retest reliability for an adaptive assessment presents the same test but with mostly different items on retest. You might want to refer to test-retest for CATs as “alternate-forms reliability,” but that term happens to also be the synonym for parallel-forms reliability!  In addition, psychometricians have to compute a different type of internal consistency measure for adaptive tests (other than the usual Cronbach’s alpha approach that requires the same items in each test). I hope curious readers will appreciate knowing about the labeling dilemma surrounding CATs and reliability, as well as the need to compute reliability values specific to CATs.

Do educators play a role in ensuring the reliability of test scores?

Yes, they do! Test publishers report the reliability values obtained under standardized administration conditions devoid of as many distractions as possible. Departure from these administration conditions could and does introduce unwanted uncertainties that make scores less consistent. By ensuring fidelity of assessment administration, educators can be confident that they are doing their best to maintain test-score consistency.

Now that you know what to look for in test-score consistency, you have a solid foundation in reliability. Next up, we will learn how validity evidence provides further assurance of test score quality. I hope you look forward to that final post in my series on the basics of reliability and validity.

Curious to learn more? Click the button below to explore Renaissance Star 360®.

Catherine Close, PhD, Psychometrician
Catherine Close, Ph.D., is a psychometrician at Renaissance who primarily works with the Star computerized adaptive tests team.
Catherine Close, PhD, Psychometrician
Catherine Close, PhD, Psychometrician
Catherine Close, Ph.D., is a psychometrician at Renaissance who primarily works with the Star computerized adaptive tests team.

10 Comments

  1. Randy Hoover says:

    Well done, but it needs to be explicitly noted that when a test is not valid, reliability is moot.

    • Catherine Close, PhD, Psychometrician Catherine Close says:

      Thank you, Randy. You make a great point in that validity is what we are aiming for, but to get there we need to do some things, and ensuring test scores are consistent is one of those things. Hence, reliability is a necessary (though not sufficient) requirement for validity. If you are interested in reading more, I briefly addressed this in my previous blog post, “Understanding the Reliability and Validity of Test Scores.”

  2. Andrea Quinn says:

    Catherine, thanks for your explanations – where does standard error measure fall into this story? Isn’t that important as well?

  3. Tristan says:

    Where does the standard of error of measurement fit into this conversation?

    • Catherine Close, PhD, Psychometrician Catherine Close says:

      That is an excellent question, and I could write a full blog post on just that! Briefly, the standard error of measurement is what we use to quantify measurement error. Recall that no measurement is without some degree of uncertainty, and that uncertainty is due to measurement error. It follows that as measurement error increases, reliability decreases. We want the standard error of measurement to be as small as possible to maximize reliability.

  4. G Gordon says:

    If the STAR Reading tests are so reliable, why do I have students whose scores look like a heartbeat when graphed? When a student scores a 12.9 GE, does the next STAR test start that student at that level and then decrease the difficulty of the questions if the student misses some? I get very concerned when I get students whose GE scores jump around: 12.9 to 6.4 to 10.6 to 8 and so on.

    • Renaissance Catherine Close says:

      I completely agree with you about GE scores. As a matter of fact, fluctuations in GE scores are not unique to the STAR assessments! The GE score is simply used to show what the average student would score at a particular testing time in a specific grade. It doesn’t really tell you what the student knows, and it’s not the best score to look at for graphing trends. The scaled score is better for that purpose, and in STAR, it’s the scaled score that drives the trend line in the Student Progress Monitoring Report.

      Sometimes fidelity issues with test administration are to blame for wild fluctuations in any student scores, but assuming you’re sure that’s not the issue in your case, there are other potential reasons, such as measurement error, fluctuations in student performance, and regression to the mean. For more about these reasons, you might find this guidance document helpful.

  5. Bryan says:

    I’m curious to hear more about test-retest or alternative forms reliability given the adaptive nature of the test. When you retest, are they simply starting from scratch or does the retest pick up from where the first test ended?

    Here’s a real-world example – suppose I have two 5th graders who are on exactly the same skill level. Student A is taking the test for the first time, student B is taking the test for the 5th time. Assuming they are exactly equal in skill level, how close will their results be? Does a student who has a historical test record of multiple data points have more precise data compared to a similar student taking the test for the first time?

    • Catherine Close, PhD, Psychometrician Catherine Close says:

      The short answer is no, students who have tested previously with STAR will have slightly more efficient test sessions, but the reliability of these tests should be similar to the reliability of students taking their first STAR test.

      To explain according to your example, student A is testing for the first time, so STAR does not “know” the student’s achievement level. In this case, the software automatically assigns a starting point that is typical/average for fifth graders. As the student progresses through the test, the software scores each item response and—via the efficient CAT engine—homes in on his achievement level with great precision by the end of the test. In student B’s case, a previous STAR test gives the software a starting achievement level, and this is where the new test starts off. So instead of assuming student B is average, we start her off at her estimated achievement level. The difference here is not one of precision (both students will have precise scores) but rather of efficiency in terms of the number of items needed to fully capture achievement levels. Student A might need a few more items for the software to converge on his achievement level, whereas student B’s achievement estimate is captured sooner based on her prior testing history.

      Both STAR Math and STAR Reading contain 34 items per test to ensure that we can reliably measure all students, even the student testing for the first time. At the end of the test, all things being equal, the two students should score similarly if they have matched achievement levels.

Leave a Reply

Your email address will not be published. Required fields are marked *