Can Essay Tests Really Make the Grade?

By RICHARD LEE COLVIN

Dec. 31, 1997 12 AM PT

TIMES EDUCATION WRITER

MONTEREY — Wanted, said the classified ad: 100 temps. College degree needed. Shift work. No benefits. $8/hour.

The task? Grading tests.

To be specific: Sitting at a desk for 5 1/2 hours at a stretch, deciphering paragraphs scrawled by 9-year-olds. Or watching computer screens display, one after another, eighth-graders’ reflections on a short story. The question? “Why did Eugene give Billy comics to keep?”

It’s often dreary work, like any pressured chore on an assembly line, say people who do the grading for a testing company with offices here and in Sacramento.

Yet this assembly line plays a central role in a highly touted trend in standardized testing--the use of what educators term the “open-ended response item.”

Or, in plain language, the essay question.

For decades, standardized testing in America meant multiple-choice, use of a No. 2 pencil to shade in A, B, C or D.

But today, tests given by 39 states, including California, ask students to do more than “bubble in” answers. They must put down their thoughts in longhand as well, analyzing the motivation of characters or explaining scientific concepts.

When President Clinton pushed this year for national tests for fourth-graders in reading and for eighth-graders in math, he called for the reading exams to include two essays and the math exams to include two “extended’ responses.

The reasoning is simple: Life’s big tests are not multiple choice, so why should we only test that way?

“A test that’s all multiple choice would send the wrong message about what kids need to move into society and the job market and higher education,” said Illinois State University professor John A. Dossey, who helped draw up Clinton’s national testing proposal.

But the abandonment of pure multiple-choice exams poses fundamental questions about standardized testing:

Is the primary purpose to accurately compare students and schools--or to communicate what should be taught? And can the answers to essay questions be graded objectively?

It’s one thing when a computer can process millions of multiple-choice tests. But how do we guarantee that the essay of a child in Montana is judged by the same standard as one in Maine?

Consider the arithmetic with a national test: 8 million kids x 2 essays = 16 million essays.

Exactly who will grade them all?

To do it before school lets out for the summer would take 14,000 “readers.” Half would need to have math degrees . . . and be willing to work for $5.50 to $10 per hour.

Which brings us back to that classified ad, used recently to line up a crew to grade essays for testing giant CTB/McGraw Hill. Workers were needed for tests from Wisconsin, one of 16 states whose exams are trucked to California for scoring.

The part-timers who sit at folding tables and 36 computer stations here--largely retirees and former teachers--are far removed from the lofty debates over testing that have raged from Sacramento to Washington. In those places, officials argue about whether standardized tests stigmatize minorities or whether more testing is essential to hold America’s schools accountable.

Here, the issue on a typical day is whether to give a point for the word “chips.” Students had been asked to speculate about things from the modern world that could not have been found in a garbage dump 100 years ago.

But in mentioning “chips,” did a student mean modern computer chips--which obviously could not have been in a dump in 1898--or potato chips? Or wood chips?

The scorers awarded the point. “We give them the benefit of the doubt,” said Aliton Fairchild, a white-haired retired teacher.

“The process is imperfect,” said Stephen B. Dunbar, a University of Iowa testing expert. “It’s an art.”

It’s also not quite what most people have in mind when they think of standardized tests.

American educators pioneered the use of “objective” tests. In 1845, Horace Mann, who fostered the Massachusetts common school, the forerunner to today’s public schools, wasn’t satisfied with the oral exams customary back then.

Mann ordered written tests to compare schools and teachers and “determine, beyond appeal or gainsaying, whether the pupils have been faithfully and competently taught.”

Decades later, such exams--including true-false and fill-in-the-blank questions--were seen as a cheap way to show what taxpayers were getting for their money, notes Robert Rothman in the book “Measuring Up.”

In 1929, the University of Iowa developed a “basic skills” test, which later spawned electronic machines that could quickly grade exams from an entire county--or beyond.

The 1920s also saw the Scholastic Aptitude Test created, based on intelligence tests given to World War I Army recruits. After World War II, with vets flooding colleges, it replaced the essays that had been used for admission to elite Eastern colleges. The SAT seemed a model of democracy, as Iowa farm boys faced the same questions as New England preppies.

The 1950s brought statewide exams in reading and math across the country. Today, U.S. public schools administer more than 100 million standardized tests a year.

From Mann’s time, however, not everyone has been enamored with such exams. Some teachers have always complained that they measure memorization and “test-taking ability” rather than wisdom--or the quality of lessons.

In recent years, the most vocal critics formed a national arm, FairTest, to push for abolition of “mass testing of young children” as a basis for making decisions about their lives, such as whether they will graduate from high school.

At the heart of their indignation: the fact that different groups perform differently on such tests. That girls score lower in math, for instance, and blacks and Latinos score more poorly overall.

Such differences in scores are condemned as evidence of bias--not of a lack of ability, preparation or effort.

Testing is also caught in the philosophical debate that pits “basic skills” advocates against those who contend that traditional approaches hamper creativity and “true understanding.” In testing, the reformers believe it is misguided to use only questions that assume there is “one right answer.”

So in the late 1980s, some states developed tests that required students to perform--to compose persuasive letters or write lab reports on science experiments.

By 1993, California--characteristically--had taken the approach to new levels. Its eighth-grade CLAS test (California Learning Assessment System) in math had only eight multiple-choice questions. There were an equal number of “constructed response” questions--in which students wrote out their calculations--and two “open-ended” questions, in which they drew graphs or wrote essays.

One asked students to write to their principal, explaining how long it would take to replant 3,000 trees that burned. They were to assume that one student planted two trees the first day, two students planted two the second, etc.

Students who got the math wrong, but wrote sophisticated letters, got better scores than those who got the answer right, but were less articulate.

It was an early hint that a well-intentioned idea, put in practice, might not add up.

CLAS tests died a merciful death after two years, not merely for rewarding wrong answers.

Some parents said an essay made students reveal personal family details. There also were too few questions to show reliably what students knew. And to save money, not all the essays were scored.

But California’s stumble did not slow what the U.S. Office of Technology Assessment called “nothing short of a revolution” in testing.

Vermont’s new tests had students compile a “portfolio” of writing, including a poem and entry from a science journal. In Kentucky, students worked in groups and the group was scored, rather than individuals.

Again, there were problems. In Vermont, teachers graded their own students--and gave them higher marks than backup readers did. In Kentucky, the group work was scrapped as being impractical to grade.

In 1995, six testing experts reported to the Kentucky General Assembly that abandoning multiple-choice tests had had severe “negative consequences.”

Certainly, some standardized testing programs have found workable ways to go beyond multiple choice. England and other European countries have long required students to write “papers”--two- or three-hour examinations that include several lengthy essays on various subjects--in certain compulsory subjects as well as others that are optional.

England has dramatically scaled back a recent effort to test younger students through weeks of projects because the exams consumed too much teaching time. It is still not unusual, however, for university-bound students to expound on topics such as “Can one say: ‘To each his own truth’?”

In the United States, essays are part of Advanced Placement tests, which measure knowledge of the most rigorous high school material. Even the National Assessment of Educational Progress--the “Nation’s Report Card” of fourth-, eighth- and 12th-graders--requires some written answers.

But those answers are short, and the tests are given only to a sampling of students. What’s more, they yield no individual or even school district scores--simply rating whole states.

Clinton, in contrast, wants a national test to grade each student, with half the 90-minute exams devoted to essay-type questions. He also wants to return each student’s paper, creating the prospect of millions of mailboxes being opened and parents comparing their children’s work--and grades--with their neighbors’.

Testing experts like Ed Roeber of the Council of Chief State School Officers relish the prospect of newspapers publishing the problems that require more than “bubbled in” answers. “People will be amazed at the math we expect,” he said,

But he and others acknowledge the practical problems. Start with the cost once a computer can’t do all the scoring: estimates for the national test run as high as $30 per pupil. Then there’s the staffing needed to read up to 16 million essays in a matter of weeks.

“You’d have to round people up off the street to get warm bodies to see if they could qualify as scorers,” Roeber said.

Only a handful of professional testing companies compete for the jobs grading the standardized tests given across the nation. Their officials are upfront about the limits of scoring done by humans.

“If you give an essay to 100 teachers nationwide and even if you give them a scoring guide, they’re going to approach it differently,” said Brad Thayer, an executive of Iowa-based National Computer Systems, the largest scoring firm in the country. “Some will grade it more heavily on grammar. Others on the content.”

Testing researchers say that even with training two readers will agree only 60% to 80% of the time on an essay graded on a 1-to-5 scale.

But if essays are what educators want, companies such as Thayer’s will grade them,

To “standardize” the process as much as possible, they prepare their scorers using written guidelines (a “rubric”) and sample (“anchor”) papers, which exemplify each possible score. Once the work starts, scorers are checked by “backreaders.” Some contracts call for two gradings of every paper, with disputes settled by a supervisor.

But the ideal is elusive.

That was the theme as 10 current or former “readers” of the company in Monterey, CTB/McGraw Hill, shared war stories recently.

There was the tale of a co-worker, “an older guy,” who wouldn’t give fourth-graders a break on their essays. “His feeling was that if they don’t put a comma in the right place, the student shouldn’t get any points,” said a fellow reader, a moonlighting lawyer.

Another reader--a single mother putting her daughter through medical school--bristled at instructions for grading a short essay on the 1-to-3 scale. Any response written in capital letters, no matter how brilliant, could not get a 3. But the only way to get the lowest score, a 1, was to write an “unintelligible” answer. The result? Answers differing widely in quality earned a 2.

“Someone who hardly knows the alphabet and a Hemingway who doesn’t know the capitalization rules will score the same,” she said.

About 500 readers work off and on at CTB/McGraw-Hill’s Sacramento facility; most are hired through a temp agency. A bachelor’s degree is the main criteria, and it can be a challenge for the testing company to get such temps to docilely accept the rigors of assembly line grading.

“You might find an educated individual who might . . . react fairly problematically to a production process like this,” said Brenda Williams, CTB/McGraw’s vice president for scoring. “Sometimes they feel like they are being treated like children.”

There is no getting around the tension inherent in the mission: How can you formularize grading when students are supposed to be as creative as possible?

Take the attempt to draft a way to score an essay on a Florida exam given in November.

Eighth-graders were asked to discuss the verse of a poem: “When you say FISH/What do you see?/Scribblings of light/quick as waves?”

Readers were having trouble agreeing. Their supervisors’ solution? Students who used three words--”light,” “quick” and “waves”--would earn two points, no matter what else they wrote.

It had become, in effect, a glamorized version of multiple choice.

Clearly, the hopes invested in essay questions as a means of improving education--and making tests fairer--have not yet been realized.

Nor has the goal of helping low-scoring groups. Several studies have found that minority students actually do worse when math and reading tests require essays.

“There are no magic bullets in the field of education,” said Kathy Christie, an analyst for the Education Commission of the States, a publicly funded information clearinghouse in Colorado.

So there has been some reconsideration of--and appreciation for--multiple-choice tests.

Even Vermont and Kentucky have started using some of the old-style exams.

In response to criticism that girls consistently scored lower on its PSAT tests, the College Board this fall decided to add a writing portion. But the testing service decided the most practical way to measure writing skills was through multiple-choice questions. Test takers must identify errors in sentence structure, for instance, or agreement of subjects and verbs.

“We would like to do an [essay test], but it’s very costly,” said Wayne Camara, the board’s executive director for research.

In California, where the CLAS test has not been replaced, the only state exams given now are voluntary, to qualify high school students for a special Golden State Diploma. Like other honors tests, they include writing--on a historical subject, for instance.

The state seems headed for a middle ground. In the spring, second- through 11th-graders will take a traditional standardized test advocated by Gov. Pete Wilson. In a few years, students also will take a test with more “open-ended” components, based on the state’s first standards, now being written, specifying what they should know in math, reading, social studies and science.

But that second test will not yield scores for individual students. It is designed mostly to get schools to take the standards seriously.

In this sense, California may be part of a trend again--one recognizing the limits of essay questions.

Hawaii doesn’t give individual grades on its history exams, but invites parents to school to read--and comment on--student essays.

To let the public judge the impact of its grants to rural schools, the Annenberg Foundation asks that student work be displayed in auditoriums. Community members then help judge it--much like the judging of jam at a county fair.

For the last four years, the Long Beach Unified School District has tested students in grades 2 through 10 in math and reading using essay-type exams. The goal is not to compare students but help teachers.

Each spring, thousands gather for a one-day scoring bee.

Two teachers read each test. They agree 80% of the time, said Lynn Winters, the district’s head of research. When they don’t, a third person weighs in.

The teachers record on index cards any patterns they observe. Perhaps third-graders don’t understand complete sentences. Or they’re rusty on capitalization. Throughout the grading, the teachers talk.

“They tend to come to a consensus . . . about what does a good, a mediocre and a poor paper look like,” said John McVeigh, the district’s middle school writing coach.

Back in the classroom, that consensus helps guide their teaching and how they evaluate students’ day-to-day work.

But along with Long Beach’s enthusiasm for essays, it gives other tests: Timed, 10-minute math quizzes along with standardized multiple-choice exams--in reading and other basic skills--also taken by students across the country.

“We do it because outside benchmarks are valued,” Winters said. “It’s OK that we think we’re doing a good job. But we’d also like to see how we do compared to the best of the best of them, and multiple-choice tests are great for taking the temperature of a system.”

(BEGIN TEXT OF INFOBOX / INFOGRAPHIC)

Grading Essay Answers

A Florida standardized test this year had eighth-graders read a short story called “A Rupee Earned: An Armenian Tale.” It told of a “lazybones” son who had not inherited his father’s affection for hard work and thrift. The boy’s mother protects him from his father’s demands that he earn a single rupee--giving her pampered son one instead. On his deathbed, however, the father is able to teach his son that honest work carries its own reward.

Students had to write answers to 12 questions based on the story. Teams of readers hired by CTB / McGraw Hill to grade the exams were given scoring guidelines and sample answers to help standardize their grading.

This question, for example, was worth a maximum of 2 points:

What are the differences between the ways the mother and father treat their son? Use details and information from the story to support your answer.

****

Instructions for graders:

“A top score-point response describes the differences between the mother’s and father’s treatment of their son, and how that treatment affects him.”

****

Example of an essay that earned the maximum score of 2 points:

****

Further instructions to graders list possible supporting details:

*--*

Treatment Result Mother: Pampers/protects son Laziness Encourages idleness Lack of appreciation for value of money Encourages pursuit of leisure Lack of appreciation for value of hard work Father: Requires son to work hard Willingness to work Requires son to earn Appreciation for value his own money of money Requires son to prove Appreciation of value of himself hard work

*--*

****

2 points: Student shows “complete understanding” of passage and writes short essay that is “accurate, complete and fulfills all the requirements of the task.”

1 point: Student shows “partial understanding” of passage. Information may be “essentially correct” but is too general and simplistic.

No points: Essay is “inaccurate, confused and / or irrelevant.”