No, Software Still Can’t Grade Student Essays

glowing blue binary code on screen with word error background concept

One of the great white whales of computer-managed education and testing is the dream of robo-scoring, software that can grade a piece of writing as easily and efficiently as software can score multiple choice questions. Robo-grading would be swift, cheap, and consistent. The only problem after all these years is that it still can’t be done.

Still, ed tech companies keep making claims that they have finally cracked the code. One of the people at the forefront of debunking these claims is Les Perelman. Perelman was, among other things, the Director of Writing Across the Curriculum at MIT before he retired in 2012. He has long been a critic of standardized writing testing; he has demonstrated his ability to predict the score for an essay by looking at the essay from across the room (spoiler alert: it’s all about the length of the essay). In 2007, he gamed the SAT essay portion with an essay about how “American president Franklin Delenor Roosevelt advocated for civil unity despite the communist threat of success.”

He’s been a particularly staunch critic of robo-grading, debunking studies and defending the very nature of writing itself. In 2017, at the invitation of the nation’s teachers union, Perelman highlighted the problems with a plan to robo-grade Australia’s already-faulty national writing exam

This has annoyed some proponents of robo-grading (said one writer whose study Perelman debunked, “I’ll never read anything Les Perelman ever writes”). But perhaps nothing that Perelman has done has more thoroughly embarrassed robo-graders than his creation of BABEL.

All robo-grading software starts out with one fundamental limitation—computers cannot read or understand meaning in the sense that human beings do. So software is reduced to counting and weighing proxies for the more complex behaviors involved in writing. In other words, the computer cannot tell if your sentence effectively communicates a complex idea, but it can tell if the sentence is long and includes big, unusual words.

To highlight this feature of robo-graders, Perelman, along with Louis Sobel, Damien Jiang and Milo Beckman, created BABEL (Basic Automatic B.S. Essay Language Generator), a program that can generate a full-blown essay of glorious nonsense. Given the key word “privacy,” the program generated an essay made of sentences like this:

Privateness has not been and undoubtedly never will be lauded, precarious, and decent. Humankind will always subjugate privateness.

The whole essay was good for a 5.4 out of 6 from one robo-grading product.

BABEL was created in 2014, and it has been embarrassing robo-graders ever since. Meanwhile, vendors keep claiming to have cracked the code; four years ago, the College Board, Khan Academy and Turnitin teamed up to offer automatic scoring of your practice essay for the SAT.

Mostly these software companies have learned little. Some keep pointing to research that claims that humans and robo-scorers get similar results when scoring essays—which is true, when one uses scorers trained to follow the same algorithm as the software rather than expert readers. And then there’s this curious piece of research from the Educational Testing Service and CUNY. The opening line of the abstract notes that “it is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible.” The phrase “as possible” is carrying a lot of weight, but the intent seems good. But that’s not what the research turns out to be about. Instead, the researchers set out to see if they could catch BABEL-generated essays. In other words, rather than try to do our jobs better, let’s try to catch the people highlighting our failure. The researchers reported that they could, in fact, catch the BABEL essays with software; of course, one could also catch the nonsense essays with expert human readers.

Partially in response, the current issue of The Journal of Writing Assessment presents more of Perelman’s work with BABEL, focusing specifically on e-rater, the robo-scoring software used by ETS. BABEL was originally set up to generate 500-word essays. This time, because e-rater likes length as an important quality of writing, longer essays were created by taking two short essays generated by the same prompt words and just shuffling the sentences together. The findings were similar to earlier BABEL research.

The software did not care about argument or meaning. It did not notice some egregious grammatical mistakes. Length of essays matters, along with length and number of paragraphs (which ETS calls “discourse elements” for some reason). It favored the liberal use of long and infrequently used words. All of this leans directly again the tradition of lean and focused writing. It favors bad writing. And it still gives high scores to BABEL’s nonsense.

The ultimate argument about Perelman’s work with BABEL is that his submission are “bad faith writing.” That may be, but the use of robo-scoring is bad faith assessment. What does it even mean to tell a student, “You must make a good faith attempt to communicate ideas and arguments to a piece of software that will not understand any of them.”

ETS claims that the primary emphasis is on “your critical thinking and analytical writing skills,” yet e-rater, which does not in any way measure either, provides half the final score; how can this be called good faith assessment?

Robo-scorers are still beloved by the testing industry because they are cheap and quick and allow the test manufacturers to market their product as one that measures more high level skills than simply picking a multiple choice answer. But the great white whale, the software that can actually do the job, still eludes them, leaving students to deal with scraps of pressed whitefish.

source: forbes