Jay's blog

Computer Science Primer For Writers And Authors

It's a very tumultuous time for creative humans at the moment. Writers, musicians, and artists are being told they're at risk of being replaced by AI. The irony is not lost on me that rather than trying to automate away the drudgery of life, AI tech bros seem focused on automating the very things that are uniquely human, the things most of us would prefer to spend our time on. Those who seek to commercialize these AI tools must not put much value on art if they think it can be automated and mass-produced. It's not just an insult to those who produce art, but to those who consume it.

I read an article from Techdirt today that concerned me, however. A person named Benji Smith created a website called Prosecraft that analyzed commercial books and provided interesting statistics. A number of authors whose books' statistics appeared on the site were very unhappy about it. The resulting backlash led to Prosecraft being taken down.

The backlash, however, seems a bit misguided. If you're a computer scientist or a software developer, the reason it's misguided is probably obvious. If you're a layman, I completely understand your confusion. The Techdirt article doesn't really explain why lumping Prosecraft in with the AI that people are rightly concerned about is unfair.

In this post, I aim to break down the terminology so that even if you don't know a nibble from a byte, you can have an informed opinion about artificial intelligence.

What is Artificial Intelligence?

This question is at least as difficult as that classic philosophical quandary, what is art? The problem with defining artificial intelligence is rooted in the problem of defining intelligence. The answer is murky, at best.

If you hear the term "artificial general intelligence", it generally refers to a machine or computer program exhibiting human-like intelligence that can be applied across subjects and disciplines. Artificial general intelligence, or AGI, does not exist at this point in time. Whether or not it's even possible is still debated.

Besides being hard to define, artificial intelligence has historically been a moving goal post. Someone will develop an algorithm (fancy talk for a set of instructions to solve a given problem) that does something previously thought to be difficult or near impossible. It's lauded as "artificial intelligence"! That is, until the solution is studied and becomes well-understood. The field advances and that algorithm is no longer considered to be AI anymore.

This may be a cynical view, but I believe that AI, in practice, is any cutting edge technology that provides solutions that appear human-like, to a problem that was until recently thought to be computationally difficult. As soon as it's no longer considered cutting edge, it's no longer considered AI.

Let's move away from AI as a term as it's all but meaningless.

What is currently meant by "AI"?

When you see "AI" being tossed around in the press, what's usually being discussed is the field of machine learning or ML. There are many different techniques being used right now, but you need to understand that this isn't magic. It's just statistical analysis.

No matter what techniques are being used, it all basically boils down to creating a large and complicated mathematical expression to accomplish a task. The way this is done is through a process called "training". Training takes sets of data, referred to as training data, and uses it to adjust constants within the mathematical expression until the accuracy of the expression improves.

Let's use a simple real-world example. Let's say I want an expression that will tell me whether or not there's a cat in a given image. My training data would be a large set of images comprised of two smaller sets, one with cats and one without. I've also labeled each set. Training the model consists of turning each image into a sequence of numbers, running that sequence through the expression, and seeing what number comes out of the expression. The number that comes out of the expression can be thought of as a probability that there's a cat in the image. But we already know there's a cat in the image because we labeled it beforehand. If the model produces a result of, say, 0.62 or 62% confidence, the training process will begin tweaking constants in the expression to get that number closer to 1.

Only once training is completed can the model be used. In the example above, once my cat picture model has been trained on all of my example pictures, I can start using it to classify images that haven't been seen before. Machine learning models are not like humans. When I'm using it to classify pictures of cats it's never seen before, it's not "learning" anything new. The model no longer changes.1

That doesn't mean the model is set in stone. If I put together some new pictures of cats and, well, not cats, I can continue training where I left off. I don't necessarily need to start from scratch training a new model. But it wouldn't make sense to continue training the cat classifier model on unknown pictures. If I give it a picture and it thinks there's an 80% likelihood there's a cat in the picture, if I were to then train it on its own results, that would be a recipe for disaster. The meaning behind the training would lose cohesion. It might turn into a classifier for furry things, or a classifier for contrasting objects, or just junk that doesn't really classify anything at all.

Generally speaking, the more training data we have to train the model, the more accurate the model will be.

Although breakthroughs are happening all the time in the field of machine learning as new techniques emerge, no breakthrough can change the fact that more training data corresponds to better results. There's an insatiable appetite for training data right now. That's a primary reason why services like Reddit and Facebook are getting protective over their data and who can access it. And why companies like Zoom, who technically have access to tons of valuable data that can be used for ML training, are trying to pivot to cash in on that data and the access to it.

It's also why individual creators are concerned about the implications of fair use laws. Just because I want to share art with fans via the web, why should companies be able to use it for free to train computer programs to undercut my profits while making art in my personal style? It's an absolutely valid question.

Taken to a logical extreme, machine learning may be able to be used as a form of intellectual property laundering. Obligatory reminder that I am not a lawyer. Machine learning sometimes exhibits a problem called "overfitting" where a model will spit out training data verbatim. This is generally considered an undesirable flaw, but it happens quite often. If I train a model on a copyrighted work and it spits out a facsimile of it, is that covered under fair use? I honestly don't know.

A similar legal issue was raised during the early days of online music piracy. If Sony sells me a license to the song "Toxic" by Britney Spears, otherwise known as buying music digitally, what are they actually licensing to me? Is it the literal numbers that represent the MP3 file in storage? If so, what if I re-encode the file using different settings or for a different file format like WAV? The numbers that represent the file itself may not have any substantial overlap with the numbers of the MP3 they licensed to me. But if I play it back, my human ears can't tell the difference.

I can't cite a court case, but that legal theory was shot down, of course. I'd bet it's only a matter of time before the legal meaning of fair use is changed to exclude machine learning training data. I'm still not a lawyer, though.

What did Prosecraft do?

Now that you have a better understanding of the technology that has everyone concerned, let's look at Prosecraft, the website that was shut down due to artificial intelligence concerns.

For a given book, Prosecraft would tell you how many words were in it. From there, it would give percentages about things like how many of the book's words were adverbs. Of the adverbs, how many ended in -ly or not.

So far, this is purely in the realm known as natural language processing, or NLP. NLP is a wide field concerned mostly with linguistic statistical analysis aided by computer science. What we've seen Prosecraft do so far hasn't touched anything related to machine learning.

If I wanted to replicate the abilities of Prosecraft we've discussed so far, I could do that by making a list of known adverbs from a dictionary. Then my theoretical computer program would process a book's text word by word. It would only have to keep track of how many words it processed, how many of them appear in my list of adverbs, and how many of the adverbs it finds end in -ly. There's no training, no learning, and no retention of the book text after analysis has completed.

Prosecraft also offered analysis around a book's vividness versus passiveness. It provided percentages for each as well as showing out-of-context excerpts from the book of what it termed the most vivid page and the most passive page from the book, including color-coding words it considered vivid or passive within those excerpts.

The analysis of vividness or passiveness gets a little murky because those can be considered subjective measures. My understanding of how this worked was via sentiment analysis.

What is sentiment analysis?

Sentiment analysis is a subfield within natural language processing that seeks to automatically identify emotional meaning behind words. The sentence "Jeff Bezos was born in 1964" is purely factual and doesn't really have any sentiment behind it. The sentence "Jeff Bezos is disgusting and I hate him" has a very strong sentiment behind it.

But language is tricky. If I said "Jeff Bezos is not a great guy and I don't love him" you understand this is still a negative sentiment. But a simple computer algorithm might just see "Jeff Bezos", "great guy", and "love him" and come to the wrong conclusion. It's this fuzziness that has led people to employ machine learning techniques for sentiment analysis.

Prosecraft used sentiment analysis. I don't know what tools or algorithms were used and I don't know how they were used. But my best guess is that they were used to classify words and phrases as "vivid" or "passive voice".

It's important to note that the books analyzed on Prosecraft were not used to train any algorithms to detect these characteristics.

Wrapping up

I hope I've made it clear that I understand the outrage and panic about AI. Don't believe the hype about artificial general intelligence. A more real threat of machine learning is its potential use to devalue art and the people who make art.

However, I agree that Prosecraft was unfairly targeted. What Prosecraft did was provide statistical tools that could help writers and readers. Criticizing Prosecraft would be like criticizing spell check and grammar check tools, once considered AI in their own right!

Fred Rogers said, "I went into television because I hated it so." I read that recently and I was floored! I never thought I'd hear Mr. Rogers say he hated anything. But he had the right attitude. When you're strongly repelled by something, it's important to learn as much as you can about it. I hope this post has helped someone learn a little more about something they despise. Just like television, machine learning is a tool. It has the potential to empower us. But it has a lot of potential for misuse as well. I want to make sure we know what we're railing against, because there's a real threat of collateral damage from modern day luddites.

  1. I think this is an important point about things like ChatGPT as well. We're often told to use tools like that cautiously and skeptically. We should be skeptical of what ChatGPT tells us because it might not be telling the truth. (Not lying per se, as that implies intent.) But we should be cautious about what we tell ChatGPT, not because it might learn our personal details and tell them to someone else, but because we're throwing data onto someone else's computer. ChatGPT isn't the risk. Hackers, humans, are the risk.

#ai #machine learning