AI chatbots don't know why they did it
They can't remember what they were thinking, so they make something up
Chatting with an AI seems like chatting with a person, so it’s natural to ask an AI chatbot to explain its answers. But it doesn’t work how you’d expect. An AI chatbot doesn't know how it decided what to write, so its explanation can only match its previous reasoning by coincidence.1
How can I be sure of this? Based on how the API works, we know that a chatbot has no short-term memory.2 The only thing a chatbot remembers is what it wrote. It forgets its thought process immediately.
Yes, this is weird. If someone asked you why you did something yesterday, it’s understandable that you might not remember how you decided to do it. But if it were something you did a minute ago, you probably have some idea what you were thinking at the time. Forgetting your own thoughts so quickly might occasionally happen (“senior moment”), but having no short-term memory at all is a severe mental deficiency.
But it seems to work. What happens when you ask a chatbot to explain its answer?
It will make up a plausible justification. It will try to justify its answer the way a human would,3 because it’s trained on writing justifications like people. Its justification will have little to do with how the language model really chooses what to write.4 It has no mechanism or training to do real introspection.
Often, chatbots are able to give reasonable justifications in reasonable situations, when there are likely similar justifications in their training set. But if you ask it to explain an off-the-wall answer, it may very well say something surreal.
How should this be fixed? The bug isn’t giving a bad justification, it’s trying to answer the question at all. Users will be misled by their intuitions about how people think.
OpenAI has trained ChatGPT to refuse to answer some kinds of questions. It could refuse these questions too. The chatbot could say "I don't remember how I decided that" whenever someone asks it for a justification, so that people learn what’s going on. This might seem like unhelpful evasion, but it’s literally true!
Maybe it could speculate anyway? “I don’t remember how I decided that, but a possible justification is…”
We’re on firmer ground when a chatbot explains its reasoning and then answers the question. Chain-of-thought reasoning improves benchmark scores, indicating that chatbots often do use their previous reasoning to decide on their answers. But we don't really know how LLM's think so there is no guarantee.5
Sometimes ChatGPT will use chain-of-thought reasoning its own. If it doesn’t, you can ask it to do that in a new chat session. (This avoids any possibility that it’s biased by its previous answer.)
By “reasoning” or “thought process,” I mean how a chatbot decides what to write. The details are mysterious, but hopefully researchers will figure it out someday.
ChatGPT’s API is stateless and requires the caller to pass in previous chat history. I wrote about some consequences of this before. It’s possible that the interactive version of ChatGPT works differently, but I think that’s unlikely and it would probably be discovered.
Specifically, it answers the way the character it’s pretending to be would.
Possibly some deductions might be done the same way as previously, but not the entire calculation? See discussion.
If you are using the API, you can even alter the discussion and write your own made-up response as the bot's one, and then ask it to explain why it said that.
The bot would do it's best to justify what you literally put in it's mouth, having no way of knowing that it wasn't actually it's own response.