Artist Bob Ross smiling in front of a terrible painting of Beast Jesus.
"I would have been the greatest artist ever,
if I could just remember how many fingers humans have."

Ask ChatGPT a simple question: How many r’s are there in ‘cranberry’?. It often answers 2 instead of 3.

ChatGPT can generate huge productivity gains, but can also generate huge reputational and career risks when it makes embarrassing mistakes. Users should understand when and why it might mess up. Gather round if you want to dodge a ChatGPT disaster!

Here are more simple prompts that you might expect to work but do not:

Here are prompts that work well:

  • Write a limerick about hedge funds where the first letter of each verse spells ‘hedge’. It’s incredibly fluent with language understanding and generation, including complex nuances and connotations. There’s a reason it’s called a ‘language model’.

  • Ask ChatGPT about a specific fact on a random Wikipedia page from WikiRoulette. It will probably be correct if asked in a straightforward way, because it trained on Wikipedia. But it can also make stuff up, if you stray from questions where it has a well-grounded basis to answer from training or in the prompt. Also, if you introduce a small variation it may fail: if it knows Tom Cruise’s mother is Mary Lee Pfeiffer, it might not know who Mary Lee Pfeiffer’s son is.

  • “What is this news story about? Classify it as belonging to national, local, business, politics, tech, sports.” Very good: classification tasks involving deep text understanding. Not as good: classifying over a very large list of classes, or doing regression to predict a numerical value, or unsupervised learning like clustering. ChatGPT’s training has nothing to do with these tasks.

ChatGPT is a poet, not a quant. It’s a wordcel, not a shape rotator. It understands ‘rhetorical logic’ but not causal reasoning. Its understanding of the world is skin deep, notwithstanding sophisticated, even beautiful language. It knows what it was told and does not have any ability to answer questions from first principles and data. Some demos of it performing complex tasks may be cherry-picked or overhyped or due to training data contamination with similar examples, and there may be less there than meets the eye.

Endowing a computer system with a gift for language and art and creativity somehow makes it less reliable and conscientious. It ‘hallucinates’. When it doesn’t know, it makes up something that sounds plausible. It is a BS artist. It is trained to make labelers happy with plausible-sounding answers, to speak for effect and not underlying truth, which is the definition of BS.

So:

  • Use generative AI for looking up, summarizing, and explaining reliable texts in your firm’s knowledge base, or obtained online from knowledgable authors and reliable sources.

  • Danger lurks if you ask it to follow a process, or algorithm, or chain of logic on its own that requires more than a couple of steps. If it’s a task that would benefit from writing a program, use the code interpreter or advanced data analysis functionality.

  • Improve ChatGPT’s reasoning and reduce hallucinations by using advanced prompting patterns like reflection (prompt ChatGPT to explain or criticize its own thought process and improve it) and chain-of-thought (give it a thought process to follow). Chain-of-thought prompting helps ChatGPT get the ‘cranberry’ question correct. Use prompt helpers from the GPT Store to create and improve prompts. However, better prompts only go so far. Test your prompts systematically. If small variations change results dramatically during validation, beware. It will only get worse on unseen inputs in production.

  • Read Ethan Mollick. There is a ‘jagged frontier’ where ChatGPT performs superbly at some tasks but shockingly poorly at seemingly similar tasks that slightly exceed its grasp. Lose track of where you stand on the ‘jagged frontier’ at your peril.

  • Use ChatGPT as a super hard-working, quick, but not-very-bright assistant who needs specific and detailed instructions, and sometimes might still go off the rails. Supervise it closely, and give it foolproof tools like the code interpreter for tasks that fall outside its reach.

  • Trust nothing, and verify everything. Review anything generated with ChatGPT closely. Don’t be the person who pastes ‘as a large language model’ or ‘regenerate response’ into client-facing work product. (There used to be even more of these in Google Scholar but authors and publications have cleaned them up!)

  • Use generative AI when you need infinite indefatigable interns, even if they might be stoner interns. Use it especially for language tasks, for generating a first draft of anything creative, ideating, editing and proofreading.

This is the most important chart on applying AI in your daily work:

Distribution of output scores on a consulting task, with and without AI.
Distribution of output quality across all the tasks. The blue group did not use AI, the green and red groups used AI. The red group got additional training on how to use AI.

This figure is from a randomized control trial of Boston Consulting Group consultants, who were asked to perform a set of consulting-related tasks. Most users obtained significant benefits from AI tools despite their limitations. For many tasks, language understanding is enough to unlock big gains.

But beware of the left tail of this distribution. Some users who got trained probably became overconfident and put too much stock in bad, possibly hallucinatory responses, and scored much worse. Don’t get conned by AI and dumped on the left side of this distribution.

I don’t know who needed to hear this, but I have come to realize not everyone understands ChatGPT’s limitations. Every day, I see news stories where foolish editors and reporters ask ChatGPT to perform tasks beyond its capabilities, like predicting the price of Apple stock. They are just using a modern-day version of reading tea leaves, or tarot cards, or the I Ching. Please, just stop! You are not only making fools of yourselves, but also teaching other people to be foolish.

The ability of computers to understand and generate human language is a huge advance. It makes computers great at tasks which experience tells us they shouldn’t be good at, like creating poetry and music and art and movies and code. This ability is completely orthogonal to everything computers did well until now. There is work in progress to better understand how LLMs work and how their limitations arise, and resolving the impedance mismatch between wooly right-brain poet LLMs and hard-boiled left-brain Dr. Spock data stores and coders. As Sam Altman says, this is the dumbest AI you will ever have to use. But right now, ChatGPT is pretty terrible at some tasks we expect computers to be very good at.

Don’t become a bad example to others. Don’t fall victim to the Eliza Effect, where people mistake natural language algorithms for human-level understanding. It’s an easy mistake to make, because the only entity we ever encountered that could write as well as a human…was a human. So it’s jarring to see ChatGPT write an Elizabethan sonnet in a flash, and then stumble on problems a small child could solve.

If even Google engineers, who should know how to make reliable systems using unreliable components, got caught out by how much their Gemini LLM gets wrong, how careful do the rest of us have to be?

Move fast, don’t miss out on the opportunity for massive productivity gains from AI, but follow common-sense rules, and stay safe, my friends!

Further reading:

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models. See the References section of this paper for work showing LLMs failing to reason, to understand casuality, to understand that ‘a is b’ implies ‘b is a’. Many good results in problem-solving tests turn out to depend on memorizing training data similar (or even identical) to test data.

Sober AI is the Norm. Sober, boring AI: Doing what you’re already good at, faster and better, like Copilot, RAG, summarization. Pie-in-the-sky, high-as-a-kite-AI: Expecting AI to do your job for you, like Devin. Even self-driving vehicles might fall into the latter category. They might be OK in a Disney parking lot or Phoenix suburbs, but not in an adversarial environment like NYC or New Delhi where people will just walk in front of a car that will definitely stop, or exploit hacks, or hijack driverless truck cargo. You need different infrastructure and legal frameworks. Hype and one-upmanship dynamic and competitive accelerationism can get a little out of control and people can release things that aren’t ready or where they haven’t thought through all the implications.