Whether and how value-added models for teacher evaluation — which use student standardized test scores to assess teachers — only becomes more controversial in education as time goes on. Here is a new approach on the subject from Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, “Value-Added Measures in Education,” provides an accessible review of the technical and practical issues surrounding these models. This appeared on the blog of the non-profit Albert Shanker Institute.
Now that the election is over, the Obama administration and policymakers nationally can return to governing. Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.
In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.
In many respects, The Race was well designed. It addresses an important problem – the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The administration also smartly embraced the idea of multiple measures of teacher performance.
But they also made one decision that I think was a mistake. They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process.
The idea of combining the measures has some advantages. For example, as I wrote in mybook on about value-added measures, combined measures have greater reliability and probably better validity as well. But there is also one major issue: Teachers by and large do not like or trust value-added measures. There are some good reasons for this: The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance. There is more debate about whether the measures are, in any given year, providing useful information about “true” teacher performance (i.e., whether they are valid).
The larger problem is that policymakers have tended to look at the teacher evaluation problem like measurement experts rather than school leaders. Measurement experts naturally want validity and reliable measures—ones that accurately capture teacher effectiveness. School leaders, on the other hand, can and should be more concerned about whether the entire process leads to valid and reliable conclusions about teacher effectiveness. The process includes measures, but also clear steps, checks and balances, and opportunities to identify and fix evaluation mistakes. It is that process, perhaps as much as the measures themselves, that instills trust in the system among educators. But the idea of combining multiple measures has short-circuited discussion about how the multiple measures—and especially value-added—could be used to create a better process.
One possible process comes from the medical profession. It is common for doctors to “screen” for major diseases, using procedures that can identify all the people who do have the disease, but some who do not (the latter being false positives). Those who are positive on the screening test are given another “gold standard” test that is more expensive but almost perfectly accurate. They do not average the screening test together with the gold standard test to create a combined index. Instead, the two pieces are considered in sequence.
Ineffective teachers could be identified the same way.
Value-added measures could become the educational equivalent of screening tests. They are generally inexpensive and somewhat inaccurate. As in medicine, a value-added score, combined with some additional information, should lead us to engage in additional classroom observations to identify truly low-performing teachers and to provide feedback to help those teachers improve. If all else fails, within a reasonable amount of time, after continued observation, administrators could counsel the teacher out or pursue a formal dismissal procedure.
The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers. They are statistically noisy, for example, and so many low-performers will get high scores by chance. For this reason, value-added would not be the sole screener. Instead, some other measure could also be used as a screener. If teachers failed on either measure, then that would be a reason for collecting additional information. (This approach also solves another problem discussed later.)
There is a second way in which value-added could be used as a screener – not of teachers, but of their teacher evaluators. To explain how, I need to say more about the “other” measures in an evaluation system. Almost every school system that has moved to alternative teacher evaluations has chosen to also use classroom observations by peers, master teachers, and/or school principals. The Danielson Framework, PLATO, and others are now household names among educators. Classroom observations have many advantages: They allow the observer to take account of the local context. They yield information that is more useful to teachers for improving practice. And we can increase their reliability by observing teachers more often.
The difficulty is that these measures, too, have validity and reliability issues. Two observers can look at the same classroom and see different things. That problem is more likely when the observers vary in their training. Also, some observers might know teachers’ value-added scores and let those color their views during the observations – they might think, “I already know this teacher is not very good so I will give her a low score.”
Value-added measures might actually be used to fix these problems with classroom observations. To see how, note that researchers have found consistent, positive correlations between value-added and classroom observations scores. They are far from perfect correlations (mainly because of statistical noise), but they provide a benchmark against which we can compare (validate, if you will) the scores across individual observers. Inaccurate classroom observation scores would likely show up as low correlations with value-added. Conversely, if observers were having their scores influenced by value-added, then the correlations might be very high, which might also be a red flag.
In these cases, an additional observer might be used to make sure the information is accurate. In other words, value-added can screen the performance of not only teachers, but observers as well. Used in these ways, value-added would be a key part of the system but without being the determining factor in personnel decisions.
0 komentar:
Posting Komentar