Artist Bob Ross smiling in front of a terrible painting of Beast Jesus.
"I would have been the greatest artist ever,
if I could just remember how many fingers humans have."

Ask ChatGPT a simple question: How many r’s are there in ‘cranberry’?. It often answers 2 instead of 3.

ChatGPT can generate huge productivity gains, but can also generate huge reputational and career risks when it makes embarrassing mistakes. Users should understand when and why it might mess up. Gather round if you want to dodge a ChatGPT disaster!

Here are more simple prompts that you might expect to work but do not:

Here are prompts that work well:

  • Write a limerick about hedge funds where the first letter of each verse spells ‘hedge’. It’s incredibly fluent with language understanding and generation, including complex nuances and connotations. There’s a reason it’s called a ‘language model’.

  • Ask ChatGPT about a specific fact on a random Wikipedia page from WikiRoulette. It will probably be correct if asked in a straightforward way, because it trained on Wikipedia. But it can also make stuff up, if you stray from questions where it has a well-grounded basis to answer from training or in the prompt. Also, if you introduce a small variation it may fail: if it knows Tom Cruise’s mother is Mary Lee Pfeiffer, it might not know who Mary Lee Pfeiffer’s son is.

  • “What is this news story about? Classify it as belonging to national, local, business, politics, tech, sports.” Very good: classification tasks involving deep text understanding. Not as good: classifying over a very large list of classes, or doing regression to predict a numerical value, or unsupervised learning like clustering. ChatGPT’s training has nothing to do with these tasks.

ChatGPT is a poet, not a quant. It’s a wordcel, not a shape rotator. It understands ‘rhetorical logic’ but not causal reasoning. Its understanding of the world is skin deep, notwithstanding sophisticated, even beautiful language. It knows what it was told and does not have any ability to answer questions from first principles and data. Some demos of it performing complex tasks may be cherry-picked or overhyped or due to training data contamination with similar examples, and there may be less there than meets the eye.

Endowing a computer system with a gift for language and art and creativity somehow makes it less reliable and conscientious. It ‘hallucinates’. When it doesn’t know, it makes up something that sounds plausible. It is a BS artist. It is trained to make labelers happy with plausible-sounding answers, to speak for effect and not underlying truth, which is the definition of BS.

So:

  • Use generative AI for looking up, summarizing, and explaining reliable texts in your firm’s knowledge base, or obtained online from knowledgable authors and reliable sources. Ensure responses stay well-grounded by specifically instructing the AI to only use information from the sources you provide.

  • Danger lurks if you ask it to follow a process, or algorithm, or chain of logic on its own that requires more than a couple of steps. If it’s a task that would benefit from writing a program, use the code interpreter or advanced data analysis functionality.

  • Improve ChatGPT’s reasoning and reduce hallucinations by using advanced prompting patterns like reflection (prompt ChatGPT to explain or criticize its own thought process and improve it) and chain-of-thought (give it a step-by-step thought process to follow and use to structure a response). Chain-of-thought prompting helps ChatGPT get the ‘cranberry’ question correct. Use prompt helpers from the GPT Store to create and improve prompts. But clever prompts only go so far. Test your prompts systematically. If small variations change results dramatically during validation, beware. It will only get worse on unseen inputs in production.

  • For retrieval tasks, use ChatGPT as a super hard-working, quick, but not-very-bright assistant who needs specific and detailed instructions, and sometimes might go off the rails anyway. Ask it to read a lot of text, and retrieve or summarize specific things in a specific way. Make sure the questions you ask can be answered directly by the text. Make sure the task passes the ‘intern test’: it’s something that you could explain to an untrained, unreliable assistant and if they just worked hard at it for a long time they would get it right with the knowledge base you provide. Use generative AI when you need infinite indefatigable interns, even if they might be stoner interns.

  • For creative tasks, use AI to generate a first draft of anything, ideating, editing and proofreading. But rely on it for no more than a first draft you can iterate on. Trust nothing, and verify everything. Ask it to provide sources and check the sources. Do not be that person who cuts-and-pastes ‘as a large language model’ or ‘regenerate response’ into client-facing work product. (There used to be even more of these in Google Scholar but authors and publications have cleaned them up!) Review anything generated with ChatGPT closely and use it as an assistant and copilot, not as a proper primary, or secondary source. (Also review it for style, and edit it to say things you would say, eliminate ChatGPT’s tendency for throat-clearing, roundabout participial phrases, and unsual words like ‘delve’.)

  • Read Ethan Mollick. There is a ‘jagged frontier’ where ChatGPT performs superbly at some tasks but shockingly poorly at seemingly similar tasks that slightly exceed its grasp. Lose track of where you stand on the ‘jagged frontier’ at your peril. Supervise AI closely, and give it foolproof tools like the code interpreter for tasks that fall outside its reach.

This is the most important chart on applying AI in your daily work:

Distribution of output scores on a consulting task, with and without AI.
Distribution of output quality across all the tasks. The blue group did not use AI, the green and red groups used AI. The red group got additional training on how to use AI.

This figure is from a randomized control trial of Boston Consulting Group consultants, who were asked to perform a set of consulting-related tasks. Most users obtained significant benefits from AI tools despite their limitations. For many tasks, language understanding is enough to unlock big gains.

But beware of the left tail of this distribution. Some users who got trained probably became overconfident and put too much stock in bad, possibly hallucinatory responses, and scored much worse. Don’t get conned by AI and dumped on the left side of this distribution.

I don’t know who needed to hear this. But every day I see posts by people who are oblivious to ChatGPT’s limitations, and are suprised by terrible answers to counting or logic problems. Every day, I see news stories where foolish editors and reporters ask ChatGPT to perform tasks beyond its capabilities, like predicting the price of Apple stock. Please, don’t do this. These people are not only making fools of themselves, but also teaching other people to be foolish. What they are doing is a modern-day version of reading tea leaves, or tarot cards, or the I Ching. People need to get real about what ChatGPT can and cannot do.

The ability of computers to understand and generate human language is an incredible advance. It makes computers great at tasks which experience tells us they shouldn’t be good at, like creating poetry and music and art and movies and code. This ability is completely orthogonal to everything computers did well until now. There is work in progress to better understand how LLMs work and how their limitations arise, and resolving the impedance mismatch between wooly right-brain poet LLMs and hard-boiled left-brain Dr. Spock data stores and coders. As Sam Altman says, this is the dumbest AI you will ever have to use. But right now, ChatGPT is really bad at some tasks we normally expect computers to be really good at.

Don’t become a bad example to others. Don’t fall victim to the Eliza Effect, where people mistake natural language algorithms for human-level understanding. It’s an easy mistake to make, because the only entity we ever encountered that could write as well as a human…was a human. So it’s jarring to see ChatGPT write an Elizabethan sonnet in a flash, and then stumble on problems a small child could solve.

If even Google engineers, who should know how to make reliable systems using unreliable components, got caught out by how much their Gemini LLM gets wrong, how careful do the rest of us have to be?

Move fast, don’t miss out on the opportunity for massive productivity gains from AI, but follow common-sense rules, and stay safe, my friends!

Further reading:

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models. See the References section of this paper for work showing LLMs failing to reason, to understand causality, to understand that ‘a is b’ implies ‘b is a’. Many good results in problem-solving tests turn out to depend on memorizing training data similar (or even identical) to test data.

Sober AI is the Norm. Sober, boring AI: Doing what you’re already good at, faster and better, like Copilot, RAG, summarization. Pie-in-the-sky, high-as-a-kite-AI: Expecting AI to do your job for you, like Devin. Even self-driving vehicles might fall into the latter category. They might be OK in a Disney parking lot or Phoenix suburbs, but not in an adversarial environment like NYC or New Delhi where people will just walk in front of a car that will definitely stop, or exploit hacks, or hijack driverless truck cargo. You need different infrastructure and legal frameworks. Hype and one-upmanship dynamics and competitive accelerationism can get a little out of control and people can release things that aren’t ready or where they haven’t thought through all the implications.