How language models work internally is an interesting and important mystery for research, but an AI chatbot’s API is straightforward. (Developers can read ChatGPT’s API documentation.) You send in a request and get a response. Chat history gets passed in as part of the request.
We can think of chatting as a turn-based game where the chatbot is only allowed to make a move when we ask it to. Also, the client holds onto the game state. The server doesn’t remember which games are in progress.
So there's nothing doing any thinking about you or your game between turns. Nothing's there to be "conscious." The game state when it's sitting there, suspended, has the same sort of existence as a fictional character in an unfinished novel that the writer has abandoned. It's just text. A different writer could pick it up.
This is not like chatting with a person who has an independent existence outside of the chat. Instead, you’re playing a game with a ghost. It only flickers into existence when you poke it.
Furthermore, since all the game state is client-side, you can cheat. Edit the game state and the chatbot will never know. And it’s very easy to cheat, because the game state is written in plain text. This is sometimes called “prompt engineering” but it’s really just telling the writer what you want.
Programmers sometimes talk about writing code in a mythical “do what I mean” programming language. When editing a chatbot’s game state, that’s actually how it works. You can make up any rules you like, and maybe the chatbot will follow them too, since it’s mostly cooperative. This is like Calvinball and unlike a simulation.
Implications for AI safety
The way most people use AI chatbots today seems pretty safe. An AI can’t think faster than people when most of the time it’s in a suspended state. It only gets to take a turn when we let it. If seems dangerous, we can take as long as we like to think, or stop talking to it altogether. Also, the game state has near-ideal interpretability, and more training will likely make it better.1
These arguments no longer hold if you run it in a loop, particularly if you also give it access to tools. As some people are already doing.
If we were to somehow convince developers to stick to building turn-based games with AI, we’d probably be better off. That seems unlikely, but charging money for API usage is a somewhat useful safety measure, since it encourages restraint.
If API prices get too low, maybe a small tax on language model requests would be a good idea, at least for the more powerful models that run on server farms? It could be considered both a carbon tax and a safety measure.
On the other hand, Gwern has speculated that training an LLM to avoid long-winded explanations might cause it to instead encode its thinking using steganography. Maybe it’s better to avoid making confident predictions about such things?