Check out my May virtual workshop: Learning Design for Behavior Change

The Pesky Challenge of Evaluating AI Outputs

A man and a woman are sitting on a bench in a park and the man is holding a laptop.  He says “The AI Bot says I can make a million dollars just by drinking tomato juice twice a day.” The woman cheerfully responds “That sounds totally legit.”
Image source: Microsoft Office Stock Image Library

One of the things that has bothered me since the beginning of the AI conversation is that most of the discussions of using AI or LLM outputs contains some phrase to the importance of “evaluating the output to make sure it’s correct” or something along those lines.  Pretty much any responsible writing about AI contains a reference to the importance of not accepting the AI output at face value, but reading or viewing it to make sure it’s okay.

But here’s the thing.  They say that very casually LIKE IT’S NOT THE HARD PART.

First of all, you need the expertise to judge an output, and second you need the discipline to exert the effort required to assess an output.

One of the early ChatGPT efforts I saw was somebody using the AI tool to write a short textbook on the teaching method of Direct Instruction (this is not intended as a slight on that person – it was clearly an experiment at the same time people were experimenting with ChatGPT to write Shakespearean sonnets about their favorite dog breed). 

I have no reason to believe this was an effort to really create a textbook, but it gives us an example to work with.  If you were actually using it as a tool to help create an actual textbook, what questions would you need to ask (ethical issues aside)?

The first question would probably need to be “Is this an accurate resource about Direct Instruction?”  Possibly this person had the expertise to evaluate the output, but it seems like something that you would want an expert to review before distributing widely.

Second, you’d need to have the discipline to read the whole thing.

This scientific article was making the rounds on the internet last week.  I’m utterly unqualified to comment on the accuracy of this article, but even I know if you start your introduction with the phrase “Certainly, here is a possible introduction for your topic: Lithium-metal batteries are…” then it means somebody wasn’t being careful with the copy-and-paste function.  Academic writing can be a tedious process, and AI might be able to help with that, but if the authors of the article aren’t actually reading what the AI produces, that’s a problem.

It’s part of human nature to accept defaults.  This isn’t always the case, but it is common enough that we should be very concerned about people having the discipline to stop and review AI outputs.  I’m not convinced that it’s a realistic goal – quick copy and paste is too easy a behavior to admonish people out of – but that means we need to have other safeguards in place.  Either that means guardrails built into the AI itself, which are being developed, or it means having at least a spot check or audit process in a workplace context. 

I know one of the arguments has been that error rates in AI-generated material can be lower than error rates in human-generated material (Jennifer Solberg discusses a few good examples in this podcast), and should we hold the AI to 100% standard when we don’t hold humans to that standard. First, YES WE SHOULD. I suspect that is there is more likely to be an internal logic with human errors, but that’s a complicated topic that is probably very context dependent.

Risk level should drive this level of vigilance. What is the consequence of this being wrong? Is it a wonky social media post, or is it a missed cancer diagnosis, or is it an inaccurately placed drone attack?

More thoughts to come on this, but for now, I think there are a few questions we should be asking:

  • Does this person have the knowledge and expertise to judge this output?
  • Is it reasonable to expect this person has the discipline to evaluate the outputs in detail?
  • What is the risk if output errors are not caught?

1 thought on “The Pesky Challenge of Evaluating AI Outputs”

  1. Great points! That scientific article missed my corner of the Internet, but oh dear, is it a good example of what you have described. AI will get better at vetting writing produced by other AIs—for instance, using Perplexity (which analyzes reliable sources all around the Internet) to vet text produced by powerful but Internet-blind AI models (such as Claude 3).

    Human supervision is still necessary to vet AI-produced writing. Networks of agents (such as the ones that are used here: may make human supervision less necessary—and my concern here would be deskilling in people who have not yet developed the academic skills necessary to separate the academic wheat from academic chaff. How will they develop such a skill if they didn’t work long and hard at it before very competent AI writers ( came along and made it seemingly unnecessary?


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.