At the June 27 Supplemental Education Services (SES) and School Choice Summit, the Department of Education released the first results of a study commissioned from the RAND Corporation, a meta-analysis of studies done by other researchers examining the effects of both No Child Left Behind programs to help students in schools failing to meet adequate yearly progress requirements become proficient in math and reading as measured by state tests.

This posting focuses on RAND's findings on SES regarding"the relationship between student achievement and participation in the Title I... supplemental educational services options.." Specifically, "[h]ow does the achievement of students who use (SES) compare to their own prior levels of achievement and to theachievement of eligible students who do not choose to use (SES)?"

The researchers' conclusion, as expressed in RAND's press release, subsequently trumpeted by the Department and the SES providers' spokesperson Education Industry Association Executive Director Steve Pines, appears straightforward.

The headline: "Study Finds Students in Underperforming Schools Benefit From Supplemental Educational Services Under No Child Left Behind."

The first line in the main body of text:
"Students in underperforming schools generally made statistically significant gains in math and reading after participating in supplemental educational services such as tutoring and remediation, according to a study conducted by the RAND Corporation for the U.S. Department of Education."

The political results were predictable.

The Secretary said:
"This report reinforces what I hear from parents from across the country—that SES is helping more students achieve.... SES is a lifeline for students who need more resources and parents who want more options. And research shows that students benefiting from SES are improving in both their reading and math skills."

Steve Pines' statement was similar: "This scientifically rigorous study indicates that the federally funded tutoring program had a statistically significant, positive effect on student achievement in reading and math in five out of seven large urban school districts examined during the past several school years.... The independent research firm, RAND Corporation, has documented what parents from all backgrounds have known for years -- tutoring works!"

In the report itself, we find a bit more detail on RAND's key finding: "Students participating in supplemental educational services in seven districts had, on average, a statistically significant math achievement gain of 0.09 of a standard deviation above the overall district mean ((with) significance at the 5 percent level.).... These effect sizes are not easily translated into publicly understood metrics, such as the proportion of students achieving proficiency. Therefore, to give context to the results, readers might compare the effect sizes to the average black-white achievement gap across these districts, which is nine-tenths of a standard deviation in both reading and math...

As your editor has explained before, the problem here is the significance of "statistically significant."  It sounds impressive, but all it says is that we can attribute a difference in achievement to tutoring. It says nothing about the size of that difference - it might be a one percent increase on a standardized test. Nor does it tell us how much we have to pay to get it, and SES providers charge the taxpayer fees in the vicinity of $1000 per child. 

Your editor is not an evaluator of educational programs, but has searched a bit to find one describing the meaning of statistical significance in terms the average informed citizen can understand:

"It is an unfortunate circumstance that statistical methods used to test the null hypothesis are commonly called tests of statistical significance. Equally unfortunate is the tendency to make statements of the type, "The difference between the experimental and control group  was significant at the .05 level," or the correlation between the two variables was significant at the .05 level." The word "significant" misleads professional practitioners and the lay public into thinking that the research results are important for this reason. In fact, even researchers and research journal editors might be swayed into thinking that a research result is important because it is statistically significant....

In fact, a statistically significant result only tells us that the null hypothesis can be rejected at some level of certainty - assuming that certain conditions (most importantly, random sampling from a defined population) have been satisfied. Rejecting the null hypothesis means that we accept the alternative, namely, that the difference between the  experimental and control groups is not a consequence of sampling error, but rather that it is  a real difference…. [T]ests of statistical significance say virtually nothing about the importance of a research result…." (Emphasis added.)

And the meaning of effect size:

Effect size (ES) is a statistic used to determine the magnitude of a research result.… The alleged advantage of ES is that it says something about the practical significance of a research result. The term "practical significance" implies a research result that will be viewed as… important by teachers, school administrators, policy makers, and others  concerned about the day-to-day workings of education and efforts to improve it.
 
The claim that ES is a good measure of the practical significance of a research result; or at least a better measure than tests of statistical significance; involves an assumption that matters of magnitude are important to practitioners. The assumption is doubtless warranted. Practitioners like to find that students' test scores are going up or that some intervention is working to produce measurable gains in student learning. Judgments of "going up" and "is working" require a discernible magnitude of difference between two groups or gain over time within one group.

If a discernible magnitude of difference is good and therefore important, this does not mean that all discernible magnitudes of difference are equally good and therefore equally important. Herein lies a problem in using ES as a measure of the importance of a research result. How much more important is, for example, an ES of .81 than an effect size of .33?

To determine the importance of a particular ES statistic, one must first comprehend what an ES signifies. It is an abstract statistic typically derived from mean scores and standard deviations. I doubt that effect sizes are comprehensible to the vast majority of practitioners.

Researchers evidently agree with this view, because they sometimes express an ES statistic in terms of percentile differences. For example, an ES of 1.00 means that the average individual in one group scored at the 84th percentile of the other group's score distribution. This is a slightly more comprehensible expression of magnitude for practitioners. However, I think they - and perhaps researchers as well - would be hard pressed to explain how much difference is encompassed between the 50th percentile and 84th percentile of a score distribution, especially when the  characteristics of the outcome measure and the population being studied are under-specified - as they often are - in research reports.

Setting aside the issue of whether ES is a comprehensible statistic, researchers are faced  with the problem of judging the importance of a research result expressed as an ES. Glass  proposed that an ES of magnitude .33 or greater  has practical significance for education. However, there are two reasons why an ES of .33 - or any other proposed ES -  is suspect as a criterion for judging the importance of a research result.

The first reason is that ES treats all measurement scales alike. For example, an ES does not signify whether two groups differed by 3 points on a 50-point scale or by 3 points on a  10-point scale; yet practitioners are likely to view the latter difference as potentially more important than the former difference.

The second reason is that ES does not express the shape or variability of the score distributions of the two groups... While shape or variability of score distributions  typically are not important to researchers (judging by my reading of many published studies), they are - or should be - important to practitioners. As a public good, education must serve all children, and therefore the importance of an intervention should be judged not only by whether it raises the mean score of a group, but also by whether it raises the scores of students at all points in the distribution.
(Your editors note' NCLB requires all students to achieve proficiency, not that a school on average achieves student proficiency) ES does not provide the information for judging the importance of a research result by this criterion.

What your editor finds interesting about the RAND study is the result not reported. Just what kind of test gain did students who took advantage of SES experience experience - on average, or "at all points in the distribution"?  Reading the RAND report, your editor can't figure it out. Maybe a knowledgeable evaluator could make some inferences.  It's unlikely the average educator, legislator or legislative staffer, reporter, or informed citizen would be able to do so.

Your editor does know two things.

First, if the gains from tutoring had a huge impact on student test scores, this fact would not be one any reader would have to tease out of the report. It would be in the press release and one of the bold bullet points on page one of the executive summary. Absence in the text implies absence in the analysis.

Second, this is a meta study - a study of studies already done. What we know from those studies is that, in terms of raising test scores,
the practical effects of SES are tiny. Steve Pines of EIA admits as much in other fora. So what we have here from RAND is "no news" packaged by the Department and Pines as "new news." The song remains the same: the improvements we find are hard to see but, they are there and they are attributable to SES. 

RAND has no control over how its results are used by the Department. Having been on both sides of the negotiation of a RAND report's release (many as project leader, fewer as sponsor), your editor feels reasonably confident that unfavorable analysis, findings, commentary and recommendations not specifically related to the study questions were eliminated, and any that were directly related to the questions but less than complimentary to the SES program were polished down as close as possible to the point of banality without clearly contradicting whatever truth the study team found but did not reveal in the report.

As a one-time sponsor of RAND research, your editor can say that you learn more from the draft submitted by RAND for discussion than the one finally released to the public - but this is true of all research where a sponsor's review is an input to the final product.  In this case, what's left adds to the SES research base. To the initiated, what is not said - and the way it is not said, provides important clues into the politics of SES research. But, taking the discussion to a higher level, no one should require training in what Cold Warriors examining the closed society of the Soviet Union called "content analysis" to make sense of government-sponsored education science. We ought to get everything the researchers found and concluded, with appropriate caveats on what's supported by the analysis and what's speculation. After all, it's the taxpayers' money, and the taxpayers that the Secretary and Pine's are trying to influence with their  statements on the report's release.


Your editor has always taken the goal of 100% student proficiency by 2014 contained in the 2002 No Child Left Behind Act as seriously as the nation took President Kennedy's 1962 promise to place a man on the moon and return him safely to earth by the end of the decade. Both were audacious in the sense that they were impossible dreams when stated, but plausible given the pace of innovation in disciplines relevant to the objective.  Where the two seem to differ is in the politicization of science and technology. Despite the beliefs of hoax theorists, the events of our race to the moon were not faked - and recognizing that "bad news doesn't get better with age" was essential to our success in that endeavor. In contrast, throughout the implementation of NCLB, the Administration appears unwilling to employ science objectively. Instead it has become a political tool. And this can only hurt a nation whose claim to international prowess and public well-being has relied on the best science. And it sure doesn't do much good if we are paying $1000-plus (10% of our average per pupil expenditure) to raise some kids' test scores a few points.

Your editor does not want to terminate SES - far from it. There is forward movement. Bravo! But tutoring services for disadvantaged students are not as ready for prime time as Pines and the Secretary suggest. It's one thing for parents to spend their own money on tutoring programs that may or may not work. Taxpayer funds for a nationwide program to improve the test scores of poor students in public schools that are not getting them to proficiency, should only be spent on providers that add a lot more than a few points to test scores. By this standard, SES is not ready to be a consumer market subsidized by the taxpayer. SES firms are earning revenues - and there's nothing wrong with that per se - but the price charged to the taxpayer is just too high relative to the results we are getting for for students. Until there is a real value proposition for the taxpayers as a consumer good, SES should be "sent back to the lab" and treated in NCLB II as a smaller-scale R&D effort.