What is the Granularity of Plagiarism?

Is micro-plagiarism a thing?

Jan 09, 2023

I have a friend who once developed an original axiom that explains the entire internet search industry and, not incidentally, the stubborn existence of cluttered desks. I've always called it, "Leeke's Law".

If you can find something fast enough, there's no reason to organize.

Early in the days of the internet there was a push to manually tag all the content that people were creating to make the information on the internet easier to organize and find. But that effort, though not gone entirely, was ultimately defeated by Google et al because people could, in general, find what they were looking for without going to the trouble of explicitly organizing all their data.

What computers are good for, besides doing math, is performing repetitive micro tasks, and doing those tasks faster than any human being would or could possibly do them.

If you've been reading anything online of late, you have probably come across some reference to ChatGPT. ChatGPT is an online service created by the folks at OpenAI. It presents a chat-like human interface to access an underlying AI model trained using billions and billions of text documents. The AI model has the ability to generate coherent text in response to text input from a user. So you can, for example, ask it to write a limerick about some subject. And, lo and behold, you'll get back a limerick.

As it happens, ChatGPT is maxed out on users at the moment, but I'm a paying OpenAI user so I have some access to the underlying text model without going through the chat interface.

This morning, I decided to play around with the text model and asked it to do the following:

Write a sermon on substitutionary atonement.

Here is what I got back:

My dear brothers and sisters in Christ,
Today I would like to talk to you about the doctrine of substitutionary atonement. The doctrine of substitutionary atonement is the belief that Jesus died on the cross as a substitute for us, taking upon himself our sins and the punishment that we deserved. This doctrine is based on the teachings of scripture and it is a fundamental belief of Christianity.
The Bible teaches that we are all sinners and that the penalty for sin is death. We are unable to pay for our sins and so, God sent his son Jesus to be our substitute. Through his death on the cross, Jesus paid the penalty for our sins and thus made a way for us to be reconciled to God. Jesus died in our place, taking upon himself the punishment that we deserved, so that we could be forgiven and have eternal life.
The doctrine of substitutionary atonement is a profound and beautiful truth that has been integral to the Christian faith for centuries. It is a reminder of the incredible love and grace of God. We can look to the cross and be reminded of the great sacrifice that Jesus made for us out of love. We can be comforted in knowing that, through the death of Jesus, our sins are forgiven and...

What is fascinating, and even a little unnerving, about these emerging AI models is their ability to produce coherent textual responses to almost any question.

As I've written about before, they accomplish this by ingesting a massive corpus of text, far more than has been feasible until just this moment in technological history, and they compute subtle statistical associations observed within the corpus of text as a whole. This facilitates the ability to draw from the statistical model and regurgitate a body of text that is uncannily coherent and responsive to a user's input.

A large amount of digital ink has been spilled by people beginning to wring their hands about the implications of ChatGPT within education. After all, if trained AI models can generate coherent essays (narrator: "they can") then how will anyone ever know if a student paper is original?

I want to raise a larger but possibly more thorny question: what is plagiarism in the 21st century?

I raise this question because my understanding of these text models is that they do not produce anything original (e.g. they don't invent new words) but, rather, they create statistically informed recombinations of fragments of the text within the documents used to train them.

Let me illustrate with the text of the AI-generated sermon above. I decided to take phrases that stood out to me from the text and search for those phrases using Google. The phrase I chose to search for was "atonement is a profound and beautiful truth".

Here are the Google results I got back:

As you can see, this exact phrasing in regard to the atonement is contained within a book, indexed by Google Books, called The Kingdom of Cults Handbook.

So this raises more than a few questions for me. Most notably, are these textual AI models really just a form of micro, or fine-grained, plagiarism?

Historically, plagiarism has been understood to be the uncredited use of the written work done by someone else. But historically, humans have not possessed an incentive to piece together a plagiarized document from tiny pieces of many different sources. It is less labor-intensive just to write the document oneself than to plagiarize in tiny fragments. But recall the comment I introduced this post with - computers are really really good at doing repetitive micro-tasks faster than human beings.

Lots of people are wondering whether people will use ChatGPT in lieu of doing their own work. But what if ChatGPT is itself just a giant digital plagiarist? Who, at the end of the day, actually owns the individual fragments of text within those documents that ChatGPT is mining for resources to re-order into a response?

What is the granularity of uncredited text, within a document, that makes the producer of that document guilty of plagiarism?

This is not a rhetorical question.

Stuff I'm Thinking About

Discussion about this post