Understanding the reliability and validity of test scores

By: Catherine Close, PhD, Psychometrician

Toward the end of my master’s degree program, I called my parents and excitedly announced that I was entering a doctoral program in psychometrics. A few uncomfortable moments later, my father responded, “That sounds…interesting.” Encouraged, I explained that a big part of my job would be assessing test scores for reliability and validity. More silence, followed by my mother piping up about her latest adventure. I knew I had lost them at “psychometrics,” and you may ask, “So?” Well, I admit that the word psychometrics hardly makes people jump for joy, but it can actually be quite fun! It’s also becoming an area more and more educators are expected to understand as they choose the best assessments for accountability and informing instruction.

Let’s talk reliability and validity as crucial considerations in determining the quality of tests. What first comes to mind when you think of the words reliability and validity in general? You might think of your reliable car, and that valid argument you made while discussing something over dinner with friends. In fact, reliability and validity are just ordinary words to most people. With the important role that assessments continue to play in K12 education, most educators are now more familiar with the technical underpinnings of these terms, but throw in mathematical equations with squiggly notation and that sense of calm is sucked out of the room. Equations aside, the message around reliability and validity is surprisingly clear.


Reliability and validity, demystified

Take the widely used example of the bathroom scale. If you repeatedly step on the scale, you get the same reading. You could say the weight measurements are consistent—or reliable—because the scale shows the same weight each time you step on it. With educational tests, we say that test scores are reliable when they are consistent from one test administration to the next. By definition, reliability is the consistency of test scores across multiple test forms or multiple testing occasions.


Now, suppose this same bathroom scale is off by 5 pounds. Because the scale is reliable, you still get consistent weight measurements every time you weigh yourself, but the measurements are not accurate because they are off by 5 pounds! In this case, although the recorded weights are reliable, they are not valid measures of how much you weigh. Conversely, if the scale were calibrated just right, you’d get a weight measurement that is both reliable and valid, each time. In the context of educational testing, validity refers to the extent to which a test accurately measures what it is intended to measure for the intended interpretations and uses of the test scores.

How are reliability and validity related?

Simply stated, reliability is concerned with how precisely a test measures the intended trait; validity has to do with accuracy, or how closely you are measuring the targeted trait.

In order to be valid, a score must be reliable. However, just because a score is reliable does not mean it’s valid. The three wheels below help to drive this point home. If you think of the innermost circle as your true weight measurement, you’ll notice that the first wheel has weight recordings that vary wildly each time we step on our bathroom scale; it’s clear the weight measurements are not reliable—and thus not valid. The second wheel shows reliable but not valid weight measurements that might come from that sneaky scale that is off by 5 pounds. Lastly, only the properly calibrated scale will give us both precise and accurate weight measurements as shown in the last wheel. Because reliability is a necessary requirement for validity, we commonly confirm reliability before collecting validity evidence.


If reading this has piqued your interest, you’ll be glad to know that defining reliability and validity is only the beginning. In future blog posts I will use examples from the STAR assessments to delve into different types of reliability and validity you may have heard or read about.

Catherine Close, PhD, Psychometrician
Catherine Close, PhD, is a psychometrician at Renaissance Learning who primarily works with the STAR computerized adaptive tests team.
Catherine Close, PhD, Psychometrician
Catherine Close, PhD, Psychometrician
Catherine Close, PhD, is a psychometrician at Renaissance Learning who primarily works with the STAR computerized adaptive tests team.


  1. Maria Austin says:

    This is a simple and enlightening article that offers excellent conversation about the reliability and validity of test scores. You found common ground that relaxes the audience and makes us want to learn more about the topic.

    I will feel comfortable directing instructors and administrators to the article, which will allow them to gain a depth of knowledge required to use the data effectively.

  2. Catherine Close, PhD, Psychometrician Catherine Close says:

    Thank you, Maria. I am happy to hear that this resonated with you and will find its way into meaningful conversations around test scores.

  3. Mark L. Davison says:

    I will use this in our introductory course on measurement.

  4. Liz Owen, PhD says:

    Catherine, this is a wonderful, clear explanation of reliability and validity. The first really crystal clear explanation I’ve read, in fact, that boils reliability and validity down to their simple essence without distortion or oversimplification. As a learning scientist and educational data miner, I know this is a very rare phenomenon among academics…and I believe it is a sign of true mastery. In other words, thank you! I look forward to reading more.

  5. R Collins says:

    Can you tell me what the reliability and validity scores are for your tests at Renaissance Learning (elementary level) please?

  6. Catherine Close, PhD, Psychometrician Catherine Close says:

    We have summarized the reliability and validity of the STAR assessments in The Research Foundation for STAR Assessments, which can be accessed online at http://doc.renlearn.com/KMNet/R001480701GCFBB9.pdf. See the “Psychometric Properties” chapter on pp. 19-27 for details. Thanks for your question!

  7. Craig Stoffberg says:

    Hi Catherine, thank you for your post. I have recently re-tested two Year 5 students’ STAR Reading test scores. The first student was re-tested a week after the initial test, while the second student was re-tested six weeks after the initial test. In both cases, scaled scores increased raising reading ages by two years respectively. How do I explain these scores considering the doubt of validity that these differences shed?

  8. Renaissance Renaissance Learning says:

    Hi Craig, Thanks for your question. Tests are imperfect. As a result, every educational test score contains some degree of measurement error. STAR test scores can fluctuate from one administration to the next for a variety of reasons, which are explained in a document you can access online at http://doc.renlearn.com/KMNet/R001355624GC3A4D.pdf. I’m happy to help if you have any further questions. Catherine

  9. Linda Schoen says:

    I am questioning the accuracy of the star test that provide the grade level the student scored . I’ve seen students score 2-3 years above their current grade level , yet when given reading assessments such as a DRA they are at or below grade level . How accurate are the grade level assigned to the score ? Test almost seems dumbed down to have levels so high . So in return the school takes away services because they say per the star test they are at grade level now . Help

    • Catherine Close, PhD, Psychometrician Catherine Close, PhD, Psychometrician says:

      Thank you for your question. Based on your comment, it seems that you are referencing the grade equivalent scores, also known as GE scores. A GE Score tells you how a student’s scale score compares to the average performance of other students in the nation. As an example, if the student is in sixth grade and they receive a GE score of 7.2, this means that the student’s score is as high as the average score of a seventh grader in the second month of the school year. While that may be indicative of superior performance by the sixth grader, a GE scores does not tell you what the student knows or can do; GE scores simply compare students to other students in a norm group and they shouldn’t be used to show what the student has learned to do.

      There are other types of scores that would be relevant for showing the student’s instructional reading level. An example is the Star Reading Instructional Reading Level (IRL) score that gives an estimate of the grade level of written material that the student can most effectively be taught. A Star Reading IRL of 5.0 indicates that the student will most likely learn using materials written at the fifth-grade level.

      The following link provides a summary of scores reported in Star Reading and how to interpret them: http://doc.renlearn.com/KMNet/R001316312GB442F.pdf.

      I have also written a blog on different types of scores and how to interpret them that you might find useful: http://www.renaissance.com/2016/05/12/giving-meaning-to-test-scores/.

      If you still have questions, don’t hesitate to ask us!

Leave a Reply

Your email address will not be published. Required fields are marked *