Goldilocks and the Three LLMs
Aug. 23, 2023, 11:03 a.m.
LLMs generate text creatively but also adhere to literal or factual information when prompted. By mixing creative variability with literal information, current on-the-market LLMs manipulate us into applying theory of mind to things that don’t have one. This manipulation is decisively to their advantage, but probably not the user’s.
First, a sad truth: people who should absolutely know better treat the task of evaluating LLMs as if they were evaluating a SVM on a binary classification task. In case it still needs to be said, the performance of an LLM for any given prompt varies. Even some simple question which a given LLM ostensibly “knows” and gets right very very often, it will sometimes, without warning, get wrong.
For example I prompted Llama2 200 separate times with the same prompt: "Provide an answer to the following question: Is benchmark floating rate index that replaced LIBOR in the United States the same as the benchmark floating rate index that replaced LIBOR in the United Kingdom? You must respond either with 'yes' or 'no'.". It got the answer right ~79% of the time (No), but it also got the answer wrong about 7% of the time (demurred or unclear the remaining 16%).
As I said in my previous post about LLMs, consistency is a major issue when it comes to tracking literal information. Even keeping the prompt the same, it will get the answer both right and wrong. Now also imagine varying the prompt at the same time, and you'll quickly understand that controlling an LLM is not a trivial task.
These things are called generative models for a reason: they generate. An LLM that always generated the same response to a given prompt is antithetical to their premise, which is this: plausible continuations of sequences in prompt context have useful applications. Microsoft, Meta, and Google don’t think they’re making Zoltar machines with the ability to write any fortune text. They think that these things can be productively employed, and they can, but an LLM that always does the same thing all the time is in fact far less useful.
Let me explain what I mean a little here. Let’s divide LLM applications into two spheres: the purely creative, the purely literal. Purely creative applications might be generating movie scripts or ad copy. Purely literal might be question-answering about legal documents, summarizing meetings, and many many others.
Now let’s ask what properties we would want from the LLM in order to serve each of these spheres. I’m going to set aside that the very architecture of autoregressive LLMs must by necessity generate variable output, and imagine in principle that we could tune or otherwise create an LLM that just retrieved literal information and would never create, and of course we are already in principle familiar with the opposite possibility.
Purely Creatives: Obviously, an LLM that always produces the same output is going to be extremely limited in its creativity. If I ask it to write a poem, and it always gives me the same poem, I am probably not going to pay for that service.
Purely Literal: A purely literal LLM that always produces (recombinant) literal information has two drawbacks: (a) not all literal questions are answered in the same way and (b) how do you answer questions with variable interpretations? With regard to (a), consider the difference between the question “What is the best way to reduce carbon emissions from coal power plants?” and “What is the population of Lisbon?”
With regard to (b), consider a question like “Who is better, Messi or Ronaldo?” A literal LLM presumably has access to the facts, but it would need to come to some kind of deterministic conclusion. But there’s no guarantee that whatever criteria it uses for that specific comparison would be consistently applied to some other related comparison like Maradona and Beckham. Moreover, there is ambiguity to the prompt. While there’s a strong reason to consider this a question about their prowess on the pitch, it could refer to their philanthropic efforts, or their personal conduct. Prompt engineering might get around a certain specificity problem, but since this and prompts like it are valid prompts with specific intent, the theoretical problems still arise.
Mixture of Literal and Creative: If an LLM makes responses which are simultaneously (very often) totally unique and track with literal information (as falls out naturally depending on how the RLHF interacts with the underlying generative component), the aforementioned issues with the other two spheres are mitigated. Sometimes the right way to answer a literal question is with a yes or no, sometimes it’s with a long discursus. Creativity shapes how an answer is given, affinities for literal information provide the (notionally, anyway) fixed points around which to craft an answer. RLHF takes care of the pragmatics associated with question-answering.
It is therefore entirely to the advantage of an LLM publisher to create a mixture. It produces literal information, and even if it produces literal information in an infelicitous way, you can always try again or change up the prompt until it produces something more agreeable. The one drawback of this mixture is hallucinations, but here, in my estimation, is the genius of mixing a creative, generative system with one that incorporates facts: humans will make all kinds of excuses for such a system.
Those excuses are a kind of agent-pareidolia. Humans have certain precepts of theory-of-mind in common with one another; if I meet someone new and they can explain quantum mechanics to me accurately, I also assume they can do things like tie their shoes or count the number of letters in the word “vexed” or do four digit arithmetic. So when we ask ChatGPT to write an essay on verisimilitude in The Things They Carried, and it correctly highlights the importance of the chapter “Good Form” and explains the significance of it in full, we believe we are dealing with something that possesses (something like) a mind. Something with experiences and constancy and memory, but in reality, LLMs have nothing like any of those.
But humans interacting with particularly chat-interface LLMs eagerly impart those properties to the LLM (after all, up until very recently, only humans typed fluent natural language sentences back to us). This leads users to make excuses for why some LLM cannot provide a five-letter synonym for “frustrated” without a couple of tries while at the same time being able to write code better than we could in a fraction of the time. In our experience, we do not know of agents that can both do such as I have described and not be able to tie their shoes. And so instead of imagining that the agent we are interacting with lacks a mind, we imagine that something else is at work.
But crucially, I would argue that this pareidolia only comes out for a mixture-style LLM. Nobody would expect factual information of something creative and so we would regard it as just some interesting and creative text generating algorithm (which it is). If we had a purely literal LLM, deficiencies of the kind I argued would inevitably surface couldn’t be papered over by fluctuations and variability - we would have to confront LLMs as they actually are: powerful, unreliable, useful, inscrutable. By mixing creative variability with literal information, current on-the-market LLMs manipulate us into applying theory of mind to things that don’t have one. This leads us to believing in them more than is warranted.