The Great Teacher Evaluation Evaluation: New York Edition
A couple of weeks ago, the New York State Education Department (NYSED) released data from the first year of the state’s new teacher and principal evaluation system (called the “Annual Professional Performance Review,” or APPR). In what has become a familiar pattern, this prompted a wave of criticism from advocates, much of it focused on the proportion of teachers in the state to receive the lowest ratings.
To be clear, evaluation systems that produce non-credible results should be examined and improved, and that includes those that put implausible proportions of teachers in the highest and lowest categories. Much of the commentary surrounding this and other issues has been thoughtful and measured. As usual, though, there have been some oversimplified reactions, as exemplified by this piece on the APPR results from Students First NY (SFNY).
SFNY notes what it considers to be the low proportion of teachers rated “ineffective,” and points out that there was more differentiation across rating categories for the state growth measure (worth 20 percent of teachers’ final scores), compared with the local “student learning” measure (20 percent) and the classroom observation components (60 percent). Based on this, they conclude that New York’s “state test is the only reliable measure of teacher performance” (they are actually talking about validity, not reliability, but we’ll let that go). Again, this argument is not representative of the commentary surrounding the APPR results, but let’s use it as a springboard for making a few points (most of which are not particularly original).
First, and most basically, if estimates derived from the state tests are the “only reliable measure of teacher performance,” we’re in big trouble, since it means that most teachers simply cannot be evaluated (as they don’t teach in tested grades/subjects). In that case, it would seem that the only responsible recommendation would be to scrap the entire endeavor. I doubt that’s what SFNY is trying to argue here, and so it follows that they might want to be more careful with their rhetoric.
If, on the other hand, they had advocated for increasing the weight, or importance, assigned to the state growth model results, this at least is a defensible suggestion (approached crudely though it sometimes is). But weighting, even putting aside all the substantive issues (e.g., value judgments, reliability), is a lot more complicated in practice than in theory. In reality, the “true weight” of any given measure depends on how much it varies compared with the other measures. (It should also be pointed out that theShanker Blog » The Great Teacher Evaluation Evaluation: New York Edition: