Explore 170 years of American assessments
Today is test day.
In a quiet classroom, children stare at the preprinted sheets in front of them. Some students squirm; they’re nervous. The assessment is timed and they only have 60 minutes to answer all of the questions before them. It’s a lot of pressure for kids who are, for the most part, only 12 or 13 years old. Over the next several weeks, children all over the city will experience similar levels of anxiety as they take identical tests.
Their teacher is nervous too. This test will be used to judge not only students’ abilities but also the quality of their schooling. The results will be public. Parents will discuss them—and so will administrators, legislators, and other school authorities. In short, there’s a lot riding on this test.
When the reports finally roll in that summer, the scores are dismal. On average, students answer only 30% of the questions correctly. Citizens are in shock. Newspapers are packed with articles and letters to the editor, some attacking and some praising the results. People passionately debate the value of the assessment.
Pop quiz: What year is it?
You might think it was 2015. That year, thousands of eighth-grade students across the country took the National Assessment of Educational Progress (NAEP) reading assessment, a paper-and-pencil assessment that took up to 60 minutes for students to complete.
In 2015, only 34% of eighth graders scored proficient on NAEP, triggering an onslaught of media coverage debating the quality of American education—as well as the quality of the tests themselves.
In reality, it’s 1845, and these children are taking America’s first mandated written assessment. It’s the first time external authorities have required that students take standardized written exams in order to measure their ability and achievement levels, but it won’t be the last. Over the next 170 years, standardized testing will become a widespread and, well, standard part of American education.
Let’s briefly explore how assessment has evolved—and, in many ways, stayed the same—between 1845 and today. As a comparison, we’ll note major innovations in transportation along the way.
In 1845, the first reported mandated written assessment in the United States takes place in Boston, Massachusetts. While only 530 students take this first assessment, thousands follow in their footsteps as standardized written assessments spread across the country in the decades following.
The same year, Robert William Thomson patents the first vulcanized rubber pneumatic tire—the type of tire now used on cars, bicycles, motorcycles, buses, trucks, heavy equipment, and aircraft. At this point, however, only the bicycle has been invented—and it still uses wooden wheels banded with iron.
While educators have always adapted to meet the needs of their students (consider the Socratic method of tailoring questions according to a student’s specific assertions, which has been around for more than two thousand years), the first formal adaptive test does not appear until 1905. Called the Binet-Simon intelligence test—and commonly known today as an intelligence quotient, or IQ, test—it features a variable starting level. The examiner then selects item sets based on the examinee’s performance on the previous set, providing a fully adaptive assessment.
Just two years earlier, in 1903, Orville and Wilbur Wright made history with the first flight of their airplane in Kitty Hawk, North Carolina. By 1905, the brothers are already soaring around in the Wright Flyer III, sometimes called the first “practical” fixed-wing aircraft.
Exactly seven decades after the first mandated written assessment in Massachusetts, the first multiple-choice tests are administered in Kansas in 1915. These three tests—one for grades 3–5, one for grades 6–8, and one for high school—are collectively known as the “Kansas Silent Reading Tests.” Devised the year prior by Dr. Frederick J. Kelly, each test consists of 16 short paragraphs and corresponding questions. Students have five minutes to read and answer as many questions as possible.
Standardization and speed seem to be hot topics in this decade. Only a few years earlier, in 1913, Henry Ford installed the first moving assembly line for the mass production of cars.
One of the most famous standardized academic assessments in the world is born: the Scholastic Aptitude Test (SAT). The first administration is in 1926. Students have a little more than an hour and a half—97 minutes to be exact—to answer 315 questions about nine subjects, including artificial language, analogies, antonyms, and number series. Interestingly, the SAT comes after the similarly named Stanford Achievement Test, which was first published in 1922 (to differentiate the two, the Stanford tests are known by their edition numbers, the most recent version being the “Stanford 10” or “SAT-10”).
In 1927, just one year after the first SAT, the Sunbeam 1000 HP Mystery becomes the first car in the world to travel over 200 mph. The same year, production of the iconic Ford Model T comes to an end after more than 15 million cars have rolled off the assembly line.
Although multiple-choice tests were invented two decades earlier, it’s not until 1936 that they can be scored automatically. This year, the IBM 805 Test Scoring Machine sees its first large-scale use for the New York Regents exam. Experienced users can score around 800 answer cards per hour—the speed limited not by the machine itself but by the operator’s ability to insert cards into the machine and record the scores.
Meanwhile, in the world of transportation, the world is introduced to the first practical jet aircraft. The Heinkel He 178 becomes the world’s first turbojet-powered aircraft to take flight in 1939.
The SAT’s main rival is born in 1959, when the first American College Testing (ACT) is administered. Each of its four sections—English, mathematics, social studies, and natural sciences—takes 45 minutes to complete for a total test time of three hours.
That same year, in the skies above, the turbojet powers a new airspeed record as the Convair F-106 Delta Dart becomes the first aircraft to travel faster than 1,500 mph.
Although planning began in 1964, the first National Assessment of Educational Progress (NAEP) takes place in 1969. Instead of today’s more well-known reading and math assessments, the first NAEP focuses on citizenship, science, and writing. It combines paper-and-pencil tests with interviews, cooperative activities, and observations of student behavior. There are no scores; NAEP only reports the percentage of students who could answer a question or complete an activity.
Also in 1969, Neil Armstrong and Edwin “Buzz” Aldrin become the first humans to set foot on the moon. A few months later, Charles Conrad and Alan Bean become the third and fourth individuals to take a stroll on the lunar surface. Back on earth, the first Boeing 747 takes flight.
It’s hard to pinpoint the very first computerized adaptive test (CAT): A few claim David J. Weiss develops the first one in either 1970 or 1971; others give this honor to Abraham G. Bayroff of the US Army Behavioral Research Laboratory, who experimented with “programmed testing machines” and “branching tests” in the 1960s; and some point earlier still to the work of the Educational Testing Service (ETS) in the 1950s. Regardless, computerized adaptive testing gains great momentum in the 1970s. In 1975, the first Conference on Computerized Adaptive Testing takes place in Washington, DC. By the end of the decade, the US Department of Defense has started investigating the possibility of large-scale computerized adaptive testing.
In the middle of the decade, in July 1976, the Lockheed SR-71 Blackbird shoots across the sky at a whopping 2,193 mph—setting an airspeed record that has yet to be broken.
Computerized adaptive tests start moving out of the laboratory and into the real world. One of the first operational computerized adaptive testing programs in education is the College Board’s ACCUPLACER college placement tests. In 1985, the four tests—reading comprehension, sentence skills, arithmetic, and elementary algebra—are used in a low-stakes environment to help better place students into college English and mathematics courses.
For some students, the ACCUPLACER might be their first experience with a computer—but not for all of them. In 1981, IBM introduced its first personal computer, the IBM 5150. A few years later, in 1984, Apple debuted the first Macintosh. This decade also sees the Rutan Voyager fly around the globe without stopping or refueling, making it the first airplane to do so. The 1986 trip takes the two pilots nine days and three minutes to complete.
After nearly 20 years of research and development, the computerized adaptive version of the Armed Services Vocational Aptitude Battery—more commonly known as the CAT-ASVAB—earns the distinction of being the first large-scale computerized adaptive test to be administered in a high-stakes setting.* First implemented in 1990 at a select number of test sites, the CAT-ASVAB goes nationwide in 1996 in part thanks to its reduced testing time and lower costs in comparison to the paper-and-pencil version (called the P&P-ASVAB). Today, the CAT-ASVAB takes about half the time (1.5 hours) of the P&P-ASVAB (3 hours).
(1996 also sees the advent of the first Renaissance Star Reading® assessment, a computerized adaptive test that quickly measures students’ reading levels.)
Another brainchild of the 1970s also comes to fruition in this decade: The Global Positioning System (GPS) is declared fully operational in 1995, with 27 satellites orbiting the globe.
The new millennium ushers in a new era of American testing. The No Child Left Behind Act of 2001 (NCLB) mandates state testing in reading and math annually in grades 3–8 and once in high school. While the Improving America’s Schools Act of 1994 (IASA) had previously required all states to develop educational standards and assess students, not all states were able to comply—and those that did not faced few consequences. This time, things are different and states must comply with NCLB or risk losing their federal funding.
The new millennium also sees GPS come to consumer electronics when the United States decides to stop degrading GPS signals used by the public. For the first time, turn-by-turn navigation is possible for civilians. This decade also sees the introduction of Facebook (2004), YouTube (2005), the iPhone (2007), and the Tesla Roadster (2008).
Have we reached testing overload? A 2014 report titled Testing Overload in America’s Schools finds that average students in grades 3–5 spend 15 hours taking district and state exams each year. Students in grades 6–8 spend even more time, with 16 hours each year spent on standardized assessments. On average, students in grades 3–8 take 10 standardized assessments each year, although some are found to take as many as 20 standardized tests in a single year. Their younger and older counterparts generally take 6 standardized tests per year, totaling four hours per year in grades K–2 and nine hours per year in grades 9–12.
This means a typical student may take 102 standardized tests before graduating high school, and some will take many more than that!
But things are changing. The passage of the Every Student Succeeds Act (ESSA) in 2015—which replaces NCLB—doesn’t eliminate mandated assessments, but it does offer states new levels of flexibility and control over their assessments. Around the same time, states across the nation reconsider the benefits and drawbacks of mandated assessments. Several eliminate high school graduation exams. Some limit the amount of time districts can devote to testing. Others discontinue achievement tests for specific grades or subjects. A few allow parents and guardians to opt their children out of some or even all standardized exams.
Meanwhile self-driving cars navigate city streets, flying drones deliver groceries to customers’ doors, the Curiosity rover is taking selfies on Mars, and you can order almost anything—from almost anywhere in the world—right from your phone.
Over 170 years ago, it took more than three weeks to get from New York to Los Angeles by train and one hour to finish the country’s first mandated written exam. Today the trip requires less than six hours in an airplane, but many assessments still take an hour or longer—and students take many more tests than they used to.
But do they need to be so long? With all of the technological innovations over the years and the great leaps in learning science, is it possible to create shorter tests that still provide educators with meaningful data?
It can be done—and it has been done. In our next post, we explore the technology and test designs that make it possible to get reliable, valid data in 20 minutes or less.
*Some claim this honor should go to the Novell corporation’s certified network engineer (CNE) examination, the Education Testing Service’s (ETS) Graduate Record Examination (GRE), or the National Council of State Boards of Nursing’s (NCSBN) NCLEX nursing licensure examination, all of which debuted computerized adaptive tests in the early 1990s.
Could a shorter assessment be your best assessment? Get essential tips for understanding and evaluating today’s assessments when you download Tests and Time: An Assessment Guide for Education Leaders.
Assessment Systems. (2017, January 11). A History of Adaptive Testing from the Father of CAT, Prof. David J. Weiss [Video file]. Retrieved from: https://www.youtube.com/watch?v=qb-grX8oqJQ
Bayroff, A. G. (1964). Feasibility of a programmed testing machine (BESRL Research Study 6U-3*). Washington, DC: US Army Behavioral Research Laboratory.
Bayroff, A. G. & Seeley, L. C. (1967). An exploratory study of branching tests (Technical Research Note 188). Washington, DC: US Army Behavioral Research Laboratory.
Beeson, M. F. (1920). Educational tests and measurements. Colorado State Teachers College Bulletin, 20(3), 40-53.
Fletcher, D. (2009, December 11). Brief history: Standardized testing. Time. Retrieved from: http://content.time.com/time/nation/article/0,8599,1947019,00.html
Gamson, D. A., McDermott, K. A., & Reed, D. S. (2015). The Elementary and Secondary Education Act at fifty: Aspirations, effects, and limitations. RSF: The Russell Sage Foundation Journal of the Social Sciences 1(3), 1-29. Retrieved from https://muse.jhu.edu/article/605398
IACAT. (n.d.) First adaptive test. Retrieved from: http://iacat.org/node/442
IBM. (n.d.) Automated test scoring. Retrieved from: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/testscore/
Jacobsen, R. & Rothstein, R. (2014, February 26). What NAEP once was, and what NAEP could once again be. Economic Policy Institute. Retrieved from: https://www.epi.org/publication/naep-naep/
Lazarin, M. (2014). Testing overload in America’s schools. Washington, DC: Center for American Progress.
Luecht, R. M. & Sireci, S. G. (2011). A review of models for computer-based testing. College Board. Retrieved from: https://files.eric.ed.gov/fulltext/ED562580.pdf
McCarthy, E. (2014, March 5). Take the very first SAT from 1926. Mental Floss. Retrieved from: http://mentalfloss.com/article/50276/take-very-first-sat
McGuinn, P. (2015). Schooling the state: ESEA and the evolution of the US Department of Education. RSF: The Russell Sage Foundation Journal of the Social Sciences, 1(3). Retrieved from: https://www.rsfjournal.org/doi/full/10.7758/RSF.2015.1.3.04
National Center for Education Statistics (NCES). (2012). NAEP: Measuring student progress since 1964. Retrieved from: https://nces.ed.gov/nationsreportcard/about/naephistory.aspx
National Center for Education Statistics (NCES). (2017). Timeline for National Assessment of Educational Progress (NAEP) Assessments from 1969 to 2024. Retrieved from: https://nces.ed.gov/nationsreportcard/about/assessmentsched.aspx
Pommerich, M., Segall, D. O., & Moreno, K. E. (2009). The nine lives of CAT-ASVAB: Innovations and revelations. In D. J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Retrieved from: https://pdfs.semanticscholar.org/dab4/36470022fa4819d8d6256727ff869aaf58cb.pdf
Reese, W. J. (2013). Testing wars in the public schools: A forgotten history. Cambridge, MA: Harvard University Press.
Seeley, L. C., Morton, M. A., & Anderson, A. A. (1962). Exploratory study of a sequential item test (BESRL Technical Research Note 129). Washington, DC: US Army Behavioral Research Laboratory.
Strauss, V. (2017, December 6). Efforts to reduce standardized testing succeeded in many school districts in 2017. Here’s why and how. The Washington Post. Retrieved from: https://www.washingtonpost.com/news/answer-sheet/wp/2017/12/06/efforts-to-reduce-standardized-testing-succeeded-in-many-school-districts-in-2017-heres-why-and-how/?utm_term=.bf6c68cbe156
US Congress Office of Technology Assessment. (1992). Testing in American schools: Asking the right questions (Publication No. OTA-SET-519). Washington, DC: US Government Printing Office.
Van de Linden, W. J., & Glas, C. A. W. (2000). Computerized adaptive testing: Theory and practice. Dordrecht, Germany: Kluwer Academic Publishers.
Winship, A. E. (Ed). (1917). Educational news: Kansas. New England and National Journal of Education, 85(21), 582-586.