AI, Testing and the False Paradise of Plausible Results

A few months ago while filling out a routine form, I was confronted with an interesting question with no immediately available answer: what was the first day of class for my freshman year of college? Usually, one would only need the year or perhaps the month and year. This was the first time I had ever seen someone ask for the full date. (I suspect this is because the web form was built with a framework that had easy-to-use input components for full date, and they didn't want to wire up a custom component for month and year, but that is beside the point.) As a card-carrying member of Generation X, my freshman year began in the fall of 1988, well before the prevalence of the Internet where every document is instantly searchable. I spent more than an hour Googling for a document or school newspaper from the period that would point me in the proper direction, all to no avail. Out of desperation and no small amount of curiosity, I asked ChatGPT, offering as much detail as I could about the university and the time frame. After a moment, I received an answer: August 31st of 1988. No citations, no explanations, just an answer. Clean, terse and definitive.

But how could I trust this with no source to back it up? I did recall that my school year started much later than other universities in the area, so such a late date made sense. Did ChatGPT know that? Was the real answer stuffed into some dark corner of its LLM? Or was it just making up the whole thing like a beleaguered parent fabricating an answer to get a child to quietly go away? In this case, I was caught in a trap of multi-dimensional ignorance. I didn't know enough to know what I didn't know, and so I had no reason not to trust what could very well be false information. After all, the result was highly plausible, and no one would blame me for accepting it as the truth.

The stakes being medium-to-low in this case, I was happy to use the information, dubious though it may be. However, the result continued to bother me. This concern goes well beyond artificial intelligence, both the general and generative sort. In fact, most people today happily confuse basic algorithms for AI because to them they are the same thing. Information goes in. Information comes out. It must be the truth since it came from a computer.

More and more often, we see companies and governmental entities relying on both algorithms and AI in ways we never imagined a generation ago. In many ways, this can lead to better efficiency and, thanks to pattern detection and stable diffusion, remarkable results. However, the question of veracity remains unresolved. As more of our software becomes a black box of self-assembling logic, how do we test the output? Do we test the output?

We live in the future of definite uncertainty. For nearly every question we can imagine, an answer is just a tap away. As long as the answer seems like it could be correct, we do not check to see if it is actually correct. Welcome to the false paradise of plausible results.

For example, we now know that the blood testing appliances produced by now-discredited Theranos machines were deeply flawed, but until doctors began raising questions, no one asked why the devices kept telling them that their potassium levels were out of whack. In some cases, people took action based on these results and did themselves harm, but who could blame them? The results seemed plausible. Thankfully, these devices are no longer installed in hundreds of pharmacies dolling out terrible advice to human patients. However, human gullibility in the face of digital authority figures remains a major cause for concern.

As software engineers we can fall into the same trap, of course. We pride ourselves on being professional, thorough and detail-oriented. We build our applications with extensive unit and integration testing to ensure functionality in production. That sounds great, of course, but how do we test our tests?

Process improvement for testing is a major struggle for many development teams, including every team I've worked with in my career. When we run our tests, they all pass and the results are what we expect, plausible. With that in mind, we may think improving our tests is unnecessary. We can look at the coverage statistics and sleep soundly knowing that our lines and branches are being exercised, but those numbers don't tell us anything about the quality of the tests. When you look at your team's tests, ask yourself: are our tests written with positive and negative paths or just "happy path?" Are we just checking for output based on the mock input we provided?

A potential approach to this problem is similar to advice you may give an average consumer related to AI: an attitude of practiced skepticism. Here's a scary new rule you can try adding to your team's working agreement: every new feature should include new or updated tests.

With that in mind, the entire agile team is invested in making sure to include testing to your point estimation. For example, if the story itself has very little lift but the tests will be tricky, be sure to estimate for all of it. This is why I like agile teams where all engineers vote for estimations and come to a consensus. This can take a bit longer than simply having a senior engineer assign a number, but it ensures that if one of the members of the team has an incorrect notion of the scope of the story, the team can hash it out in real time. For remote teams, this can be especially helpful.

All of this is set in a context where many developers just hate writing tests or don't see the value of them enough to learn new patterns and mocking techniques. I've worked with engineers who ignore test files while reviewing pull requests because "they don't matter as long as they pass." Needless to say, I did not agree with that sentiment. Nor did I enjoy writing dozens of tests later on their behalf to fill the gap.

It's not enough that we get what we expect from our tests. If we look at our test scripts as nodding "yes men" to tell us what we want to hear, it won't be long before our code slips into the false paradise of plausible results. At the end of the day, tests have to be useful. If we can manage to build consensus around the value of testing and process improvement and dig a bit deeper into the subject with our colleagues, we will all be in better shape in sprints and years to come.