Why the new SAT isn't as transparent as the College Board wants you to believe

By Jay Rosner

April 29, 2016 11:29 AM PT

Several hundred thousand high school students will take the second administration of the new SAT on May 7. Students who took the first offering on March 5 are still waiting for their scores, which will not be sent to them until later this month.

While the long wait may bother students, there are more significant issues regarding the public’s access to critical SAT data. The College Board calls the new SAT “profoundly transparent,” but it won’t release so-called item-level data — information about how students nationwide fared on particular questions — to the public. In fact, it hasn’t released such statistics since 2000. That makes it difficult for the public to scrutinize why certain demographic groups perform so much better on the SAT than others.

On average, we know that boys outscore girls by a few points on the verbal section (recently renamed reading and writing) and by more than 30 points on the math section. We also know that whites outscore blacks and Latinos on verbal (by 98 points and 80 points, respectively) and on math (by 106 points and 76 points, respectively). These gaps have been constant for decades. In recent years, Asian American students have been outscoring white students considerably on math and have reduced their verbal shortfall to a few points — the only gaps to have changed significantly in recent memory.

There are well-known external factors that contribute to these imbalances. Our culture discourages girls from excelling in math, and black and Latino children often attend weaker schools. But what if test design is also to blame?

Education Testing Services, which writes exams for the College Board, pretests all potential questions before finalizing a given SAT. It assumes that a “good” question is one that students who score well overall tend to answer correctly, and vice versa.

That’s problematic because, as mentioned, girls score lower than boys on math, and black students score lower than white students. So if, on a particular math question, girls outscore boys or blacks outscore whites, it has almost no chance of making the final cut. This process therefore perpetuates disparities, virtually guaranteeing a test that’s ultimately easier for some populations than others.

ETS does have a system for ensuring that, in its words, “examinees of comparable achievement levels respond similarly” to each question, called Differential Item Functioning. Basically, ETS separates students according to performance — those who scored in the 200s in one group, those who scored in the 400s in another. If, within those groups, boys do far better on a given question than girls, or white students surpass black students, then ETS eliminates it.

See the shortcoming? “Achievement level” is a euphemism for performance on the SAT; there’s no external metric. In order to believe DIF results in a test that’s fair to everyone, you have to believe the test is fair to everyone.

But evidence from the last time the College Board released item-level data gives me reason to doubt that’s the case.

Below are two SAT math questions from the same October 2000 test — the most recent test for which item-level data are publicly available. The questions were equally difficult; each was answered correctly by only 45% of test takers. Only the first, however, produced dramatically inequitable results in terms of race and gender.

1) When a coin is tossed in an experiment, the result is either a head or a tail. A head is given a point value of 1 and a tail is given a point value of -1. If the sum of the point values after 50 tosses is 14, how many of the tosses must have resulted in heads?

(A) 14 (B) 18 (C) 32 (D) 36 (E) 39

2) The sum of five consecutive whole numbers is less than 25. One of the numbers is 6. Which of the following is the greatest of the consecutive numbers?

(A) 6 (B) 7 (C) 8 (D) 9 (E) 10

More than half of boys, 55%, answered the first question correctly (by choosing C), but only 37 % of girls did. Similarly, 47% of whites answered correctly, but only 24% of blacks did. That’s what I call a question with two “large skews.”

On question 2, 49% of boys and 41% of girls answered correctly (by choosing A). That’s what I call a “medium skew.” Meanwhile 45% of whites and 35% of blacks answered it correctly. Roughly 7,000 more girls and 4,000 more black students picked the correct answer compared to the coin-toss question. (Remember, the questions were of equal difficulty overall.)

In all, 13 of the 60 math questions on the October 2000 test had large skews favoring boys over girls, and 22 of 60 had large skews favoring whites over blacks.

It’s not just item-level data that the College Board keeps from the public; it also rarely releases combined family income and race/ethnicity data.

I can’t prove definitively that large-skew questions have appeared on the SAT in the years since the October 2000 test — because the data are not public. If ETS had eliminated them, however, we’d probably have seen at least a small change in girls’ and black students’ scores relative to boys’ and white students’ scores. We haven’t.

In my experience, most folks who are not psychometricians (those constructing bubble tests) consider large-skew questions unfair; that’s particularly true when they are shown other questions of similar overall difficulty that have smaller skews. Reasonable people can disagree on that point, or might object that a few large-skew questions on a long test don’t really matter. If there were no large-skew questions, disparities wouldn’t disappear entirely overnight. But why not release the data so we can have an open conversation?

And it’s not just item-level data that the College Board keep from the public; it also rarely releases combined family income and race/ethnicity data that would allow researchers to make comparisons such as how high-income black students’ average math scores compare to those of low-income white students.

One might assume that affluent students of all races/ethnicities score higher than all low-income students. But the last time the College Board released such data, in a 2001 report, that wasn’t the case. On the math section, black students in the highest income group scored, on average, lower than white students in the lowest income group. Again, without transparency it’s difficult to identify the cause of this troubling disparity so that we can address the problem.

(The ACT, the other national college admissions test and the SAT’s competitor, has never publicly released either item-level data or combined income and race/ethnicity data; however, its overall group score gaps closely parallel those of the SAT, and it uses the same test construction methods.)

SAT scores matter: College admissions officers still weigh them heavily in making their decisions. Undoubtedly, there are many reasons behind persistent SAT achievement gaps, and the College Board insists that test design is not to blame. But what’s the harm in transparency so we don’t have to take the College Board at its word?

Jay Rosner is the executive director of The Princeton Review Foundation. He took the June 6 and Jan. 23 SATs.

Follow the Opinion section on Twitter @latimesopinion and Facebook