AI Proves Adept At Bad Writing Assessment

May 21, 2024

AI is not responsible for the rise if bad writing assessment, but it is promising to provide the next step in that little journey to hell.

Let me offer a quick recap of bad writing assessment, much of which I experienced first hand here in Pennsylvania. The Keystone State a few decades back launched the Pennsylvania System of School Assessment (PSSA) writing assessment. Assessing those essays from across the state was, at first, a pretty interesting undertaking-- the state selected a whole boatload of teachers, brought them to a hotel, and had them spend a weekend scoring those assessments.

I did it twice. It was pretty cool (and somewhere, I have a button the state gave us that says "I scored 800 times in Harrisburg). Much about it was not entirely impressive. Each essay was scored by two teachers, and for their scores to "count" they had to be identical or adjacent-- and on a five point scale, the odds are good that you'll meet that standard pretty easily. We were given a rubric and trained for a few hours in "holistic grading", and the rubric was pretty narrow and focused, but still left room for our professional judgment.

But then the state, like many others, stopped using teachers. It was easier to put an ad on craigslist, hire some minimum wage workers, train them for half a day, and turn them loose. Cheaper, and they didn't bring up silly things like whether or not a student's argument made sense or was based on actual true facts. (Here’s a great first person account of that work— thanks, Seth K.)

Pennsylvania used this system for years, and my colleagues and I absolutely gamed it. We taught our students, when writing their test essays, to do a couple of simple things.

* Fill up the whole page. Write lots, even if what you're writing is repetitive and rambling.

* Use a couple of big words (I was fond of "plethora"). It does not matter whether you use them correctly or not.

*Write neatly (in those days the essays were handwritten).

* Repeat the prompt in your first sentence. Do it again at the end. Use five paragraphs.

Our proficiency rates were excellent, and they had absolutely nothing to do with our students' writing skills and everything to do gaming the system.

The advent of computer scoring of essays has simply extended the process, streamlining all of its worst qualities. And here comes the latest update on that front, from Tamara Tate, a researcher at University California, Irvine, and an associate director of her university’s Digital Learning Lab, her latest research-- "Can AI Prove Useful In Holistic Essay Scoring"-- written up by Jill Barshay in Hechinger.

The takeaway is simple-- in a fairly big batch of essays, ChatGPT was identical or within a point (on a six point scale) of human scorers (actual matching 40% of the time, compared to 50% for humans). This is not the first research to present this conclusion (though much previous "research came from companies trying to sell their robo-scorer), with some claims reaching the level of absurdity.

The criticism of this finding is the same one some of us have been expressing for years-- it says essentially that if we teach humans to score essays like a machine, it's not hard to get a machine to also score essays like a machine. This seems perfectly okay to people who think writing is just a mechanical business of delivering probable word strings. Take this defense of robo-grading from folks in Australia who got upset when Dr. Les Perelman (the giant in the field of robograding debunkery) pointed out their robograder was junk:

He rightly suggested that computers could not assess creativity, poetry, or irony, or the artistic use of writing. But again, if he had actually looked at the writing tasks given students on the ACARA prompts (or any standardized writing prompt), they do not ask for these aspects of writing—most are simply communication tasks.

Yes, their "defense" is that the test only wants bad-to-mediocre writing anyway, so what's the big deal?

The search for a good robogradcer has been ongoing and unsuccessful, and Barshay reports this piece of bad news.

Earlier versions of automated essay graders have had higher rates of accuracy. But they were expensive and time-consuming to create because scientists had to train the computer with hundreds of human-graded essays for each essay question. That’s economically feasible only in limited situations, such as for a standardized test, where thousands of students answer the same essay question.

So, the industry will be trying to cut corners because it's too expensive to do the job even sort of well-ish.

Tate suggests that teachers could "train" ChatGPT on some sample essays, but would that not create the effect of requiring students to try to come close to those samples? One of Perelman's regular tests has been to feed a robograder big word nonsense, which frequently gets top scores. Tate says she hasn't seen ChatGPT do that; she does not say that she's given it a try.

And Tate says that ChatGPT can't be gamed. But then later, Barshay writes:

The next step in Tate’s research is to study whether student writing improves after having an essay graded by ChatGPT. She’d like teachers to try using ChatGPT to score a first draft and then see if it encourages revisions, which are critical for improving writing. Tate thinks teachers could make it “almost like a game: how do I get my score up?”

Yeah, that sounds like gaming the system to me.

Tate has some other odd observations, like the idea that "some students are too scared to show their writing to a teacher until it's in decent shape," a problem more easily solved by requiring them to turn in a rough draft than by running it by ChatGPT.

There are bigger questions here, really big ones, like what happens to a student's writing process when they know that their "audience" is computer software? What does it mean when we undo the fundamental function of writing, which is to communicate our thoughts and feelings to other human beings? If your piece of writing is not going to have a human audience, what's the point? Practice? No, because if you practice stringing words together for a computer, you aren't practicing writing, you're practicing some other kind of performative nonsense.

As I said at the outset, the emphasis on performative nonsense is not new. There have always been teachers who don't like teaching writing because it's squishy and subjective and personal-- there is not, and never will be, a Science of Writing--plus it takes time to grade essays. I was in the classroom for 39 years--you don't have to tell me how time-consuming and grueling it is. There will always be a market for performative nonsense with bells and whistles and seeming-objective measurements, and the rise of standardized testing has only expanded that market.

But it's wrong. It's wrong to task young humans with the goal of satisfying a computer program with their probable word strings. And the rise of robograders via large language models just brings us closer to a future that Barshay hints at in her final line:

That does give me hope, but I’m also worried that kids will just ask ChatGPT to write the whole essay for them in the first place.

Well, of course they will. If a real human isn't going to bother to read it, why should a real human bother to write it, and so we slide into the kafkaesque future in which students and teachers sit silently while ChatGPT passes essays back and forth between output and input in an endless, meaningless loop.

J.W. Ellenhall (novelist)

I absolutely agree with this post. Evaluations should not be done by AI if the real goal is to teach children to think independently & critically with real intelligence. This is also one of the reasons that I don’t use AI on my SubStack publications: I want a real connection with real people.

Expand full comment

Sheila Resseger

absolutely brilliant post! And on the topic of automated scoring of essays, I wrote this almost exactly 8 years ago regarding the RI state assessment (then the PARCC):

Unfortunately for us in RI, we have a Strategic Plan for Public Education 2015-2020 that emphasizes “innovative” teaching/learning via digital programming. In addition, our Commissioner of Education, Ken Wagner, is enthusiastically committed to the full monty of digitized learning, going so far as to praise the use of automated scoring for essay responses on the state assessment, the PARCC (Partnership for Assessment of Readiness for College and Careers). Commissioner Wagner’s remarks on this topic can be found on the video of the April 5, 2016 meeting of the RI Council on Elementary and Secondary Education.

When Wagner was making his glowing comments about how great automated scoring is, and how fortunate we are in RI to be able to participate in this for scoring the PARCC, he declared that scoring by algorithm is more efficient and just as good as if not better than scoring by teachers. Hasn’t Wagner read the accounts of college graduates without teaching degrees or experience being recruited on Craigslist by Pear$on to score the tests? The quantity of tests they are expected to score and the rigid criteria they are expected to use cannot possibly result in valid scores for students. So maybe, yes, it doesn’t matter if you switch to computer scoring and don’t have to bother with providing low wages and no benefits to temporary workers who have no idea what they’re doing. Please see this post by Leonie Haimson for further insight on automated scoring. https://www.washingtonpost.com/news/answer-sheet/wp/2016/05/05/should-you-trust-a-computer-to-grade-your-childs-writing-on-common-core-tests/?postshare=841462480189631

...

There was another comment by the Commissioner that was jaw-dropping. When the high school student on the Council described a loss of instructional time due to insufficient technology in some schools during test administration, so that schedules are interrupted for weeks, Wagner insisted that it’s necessary to get all schools to use the online version, rather than the paper and pencil version of the PARCC, as soon as possible. He insisted that this is important not only for the testing, but for an underlying instructional purpose. He stated:

“We can’t think about student engagement unless we have a serious strategy around digital learning.”

I can’t think of a more misguided understanding of student engagement, can you?

https://resseger.wordpress.com/2016/05/27/story-telling-species/

6 more comments...

Curmudgucation

Discussion about this post