LLMs: Mystery, Misconceptions, Love and Hate

Space Age vs. accounting statistics

I often hear “LLMs are nothing but statistical engines, which happen to be trained on a huge corpus of text”. Technically this is to a great extent a correct statement except for the tiny “nothing but” that creeped in it. The very fact that LLMs are statistics based is sometimes frowned upon. It makes them sort of illegitimate contenders for being considered as alternative intelligence rivaling human intelligence with inherent planning and reasoning capabilities.

This trivialization of statistics’ role in LLM potency is very much misleading. For one, statistics as foundation for accounting is quite different from stochastic statistics used to explain complex physical systems, for example, of astrophysics and quantum mechanics. Unlike in accounting LLMs use sophisticated statistical physics methods to capture and express intricacies of latent relationships between billions of factors of reasoning that otherwise are unattainable by humans.

I have elaborated about why LLMs are way beyond trivial statistics utilization in my AI Intuition article and the following series in Why. But here I would like to go over some important sources of misconceptions about LLMs, dispel them and further emphasize key points articulated in my previous articles, in light of persisting doubts about LLM value and potency.

Hopefully you will be more confident exploring AI after reading this post.

Linguistics vs. Physical system approach

One of the main reasons why people doubt LLMs stems from pre-conception that LLMs are just n-grams on steroids, a mere incremental statistical mechanisms’ evolution. To appreciate that let’s see quickly how n-grams work. Well, for starters, you indeed analyze billions of bytes of text and count frequencies – “what is the likelihood of n-th word given n-1 sequence of words?”. And, voila, the word “sandwich” is recommended to follow the prompt “I eat…” – “I eat a sandwich”. There is not much reasoning involved in that conclusion, because context for n-1 words is not captured in decision making for the most likely prediction of n-th word. “Really fast” also could have been recommended as well. So why did we end up with “I eat a sandwich” instead of “I eat really fast“? The context is definitely missing: is the focus of the prompt on food or one’s eating habits?

That’s how n-grams work – there is somewhat rudimentary statistics but not really reasoning.

Side note. Modern n-grams implementations are a bit more sophisticated than how it is described above. But the essence is what you read.

In all fairness the modern n-grams implementation for the next word prediction use more sophisticated statistical means, like Markov Chains and Hidden Markov Models. But this doesn’t make them sufficiently context aware in order to claim substantial presence of reasoning.

LLMs unlike n-grams do reason about next word prediction because they are intimately aware of context the prediction has to occur in. How do they reason? By:

Capturing and codifying knowledge properly
Building sophisticated roadmap for reasoning
Making decisions by following the roadmap

And here lies the source of confusion and many misconceptions:

LLMs are compared to and mused about as some linguistics laden, human-like cognitive machinery. Instead they are complex Physical systems. LLMs are subject to universal physical laws and methodologies. In all, cutting edge advancements in Information, Probability and Manifold theories are underlying LLMs magic manifestation.

In fact, only the very early tokenization phase (where the sentences are broken to words which are broken to parts of words and indexed as numbers in a vocabulary) on the input and translation of numeric decisions to the spoken words on the output involve some knowledge of linguistics. Apart from that, LLMs are complex physical (not linguistic!) systems that are natural language(s) agnostic.

Here is a rather refreshing insight from AI luminaries Geoff Hinton and Demis Hassabis on how misconceptions about linguistics role in AI slowed down overall progress in Natural Language Processing (NLP) and AI in the past decades:

Decision making in complex systems

The magic of plausible competency of LLMs on endless number of topics is greatly attributed to its decision making abilities. Decision making or “reasoning” in complex systems like LLMs is a very involved and complicated process. You need to capture all possible different contexts and masterfully navigate through the search space of possible solutions by gauging all kinds of trade-offs. Like in real life there are no perfect decisions or solutions to make or choose from. In order to make educated guess one needs to somehow:

Define metrics – how close to truth and optimal is the decision.
Navigate through very many intermediate decisions before arriving at the ultimate one.

This means how we quantify the decision making process, and derive qualitative essence from the chain of subsequent quantities and metrics.

Well, that’s exactly what is achieved in LLMs by applying a sophisticated probabilistic framework that deals with inherent uncertainty. It is not just about counting ratios of word occurrence frequencies in a huge corpus of text. But actually:

What do you do with these basic ratios (or priors)?
How do you reason about propagations and derivations off these ratios in a very long and widely branching chain of potentially millions of intermediate decision makings?

It is not for the fainthearted, nor your accounting grade statistics.

Probability as frequency vs. intrinsic property

Building on the previous chapter it is important to realize the difference between frequentists’ view on probabilities and that of how LLM is treating them.

A frequentist, which we all tend to reason like on a daily basis, says if a particular word occurs in a given context with probability of 0.7% given 1,001 tries, then for 1,002nd try prediction is that the same word will occur with the probability of 0.7%. We apply similar logic to many things surrounding us and consider it as a description of a chance.

In contrast, the LLMs machinery approaches it very differently and considers 0.7% probability as an “intrinsic property” of an object in a particular context (some specific word in our case) or behavior of a complex system. It treats probabilities not as a description of a chance but as some numeric representation that captures quite plausible complex behavior of a system, which has so many factors in it, understood and not quite so, that is impossible to be described by any other ways, for example rule based.

During training LLMs capture the qualitative nature of the words, their meaning in a given context through rather involved codification of all kinds of intricate interrelationships with the other words and in many different contexts. This is done by calculating probability inferences and is manifested as embeddings and deep neural networks architecture and parameters. You can find more details about it in World Knowledge: Compression and Representation and LLM Training and Inference.

By far it is not like “oh, I’ve seen this word 1,001 times so 1,002nd time it is very likely that we will see it again”. LLMs don’t have to observe many words in a particular sequence to be able to reason plausibly of certain word sequence predictions. Just because LLM during its training did not have knowledge about the fact that “a dog eats salami” but was trained on “a cat eats sausage“ one should not assume that if asked for the completion of the idea “a dog eats…” LLM would not be able to infer “a dog eats salami”. LLM does successfully complete the sentence by association with knowledge about cats’ eating habits. It is a form of reasoning.

To draw those analogies, and fill in the gaps LLM uses sophisticated probabilistic means including the Transformer concept that is an ingenious variation of probabilistic inference. All this in greater detail is covered in a series of building LLM intuition. An in depth article dedicated to Transformers is coming soon.

Below is a rather revealing dialog from the amazing “Oppenheimer” movie about the meaning of probabilities when dealing with impactful uncertainties. There is no such thing as absolute precision, but uncertainties described as properly calculated probabilities often, for practical purposes, are pretty safe to use and actionable:

Precision and trade-offs

Discussions about precision of LLM outcomes come up very often. Many debaters lament about LLM unpredictable precision that renders them, from their point of view, unreliable and border line useless. I think getting caught up in pedantic elaboration about LLM precision is wasteful. When reasoning about complex systems we very often can’t guarantee 100% precision of the outcome and resort to acceptable uncertainty evaluations and approximations. Good enough is often better than unawareness about the original solution or intractability of a solution. LLMs are very good at it.

Recently acclaimed and highly respected AI professor Subbarao Kambhampati was alluding about precision and trade-offs and brought up the following interesting analogy (verbatim):

“Style is a distributional property; correctness is instance level property. LLMs (and GenAI) learn and sample from a distribution (and can thus capture style). Databases store and retrieve instances (and can thus ensure correctness). Think twice before buying claims that LLMs can self-verify correctness (and ensure factuality); or that databases can get creative…”

I would further elaborate about it and add that that style and correctness in the context of LLMs are not mutually exclusive. Style is a collection of basic invariants found in a set of possible solutions (instances). The more honed in, refined styles eventually increase precision. So ultimately continuously nudging (fine-tuning/training) LLM to calibrate its “styling” will lead to better and more precise outcomes.

Another way of looking at the precision aspect is from a somewhat different angle than it is usually done today – look at it from a holistic, high-impact viewpoint. It is understandable that you want a precise factual outcome that you know makes sense, you have seen it, you expect it. In contrast, the LLMs more often than not produce ingenuine outcomes that you would not expect nor would be able to comprehend due to the vast field of plausible possibilities. Case in point, every year in the US alone there are 50,000 publications on cancer research. Imagine how many more precise therapies and drugs would successfully fight cancer if only health researchers could assimilate and infer valuable finds from this massive pile of insights? That’s what LLM seems to excel to deliver on.

Having said that, of course, we need to apply sound judgment and not deploy LLMs indiscriminately in the mission critical systems like managing Nuclear Power Stations. At least not today.

LLM reasoning and planning…or lack of those

I finish this article with the hotly debated subject of “Can LLMs be considered intelligent and reasoning, particularly given that they can not plan”. Ability to plan is a key staple of human intelligence and generally what is considered a precondition for anything really to be deemed intelligent, including AI. Planning is an important building block for successful reasoning. As it happens often, when one is pedantically preoccupied with precise definition of what constitutes some important concept or term it takes way too long if ever to converge on finally agreed formulation. A collateral damage of this is we lose the ability to think broader, out of the box, stay open minded, and be receptive to changing conventional ways of thinking about the fundamental subject at hand. That stifles progress.

Intelligence

The advent of frontier LLMs with their seemingly amazing “intelligent” capabilities for the first time gave humanity a chance to admit that there could be alternative ways to manifest quite effective reasoning. I talk briefly about this in LLM foundations and Brain-DNA analogy. Now, some resort to continuing adamantly defending the notion that as long as whatever alternative intelligence is not meeting human intelligence standards it falls short to be considered as intelligence, period. I feel this is counterproductive for embracing novel and perhaps efficient ways of driving to solutions and decisions as modern LLMs achieve. I talk about this to great extent in the AI Intuition series.

Pragmatic approach to intelligence

I believe the argument should revolve around how effectively one, while (a) dealing with complexity of the subject of consideration, and (b) associated uncertainties, (c) with huge searchable space for solutions in front of her, can arrive at a plausible conclusion. If you self-reflect you will realize that humans behave 95% of time – day-in/day-out in this way. That is how our pragmatic daily intelligence manifests: many things we do instinctively, intuitively, without explicit planning yet we arrive at quite effective decisions and solutions.

Planning and reasoning

Ah, the subject of planning is recurring. Often LLMs are chastised for not being able to explicitly plan hence get relegated to non-intelligent thingies. What the naysayers are missing though is that frontier LLMs are front loaded and have codified innumerable “what-if” scenarios, or mini-plans, thanks to training on billions of sources with all kinds of contexts. Mind you, textual information recorded in millennia worth sources is a reverse-engineering of humans’ reasoning, which includes planning.

The LLM training process with abundant knowledge sources and sophisticated methods of its codification implicitly captures planning. During inference LLM traverses this massive landscape of knowledge, including buried mini-plans, and optimally arrives at quite actionable outcomes.

Yeah but, you would say, sometimes LLMs are mistaken or they can’t effectively make proper decisions because they lack real time feedback to act on. I would draw this last analogy for today to address it:

A toddler is a very promising human being. But they need to be educated, trained throughout their, particularly, pre-teen and teenage years, get exposed to all kinds of situations spurring specific skills acquisition and knowledge assimilation. Eventually the toddler becomes a quite capable adult whose skills depend on the degree of continuous training.

LLMs are no different. They are as good as the quality of the ongoing training. You will find more details on this subject in, of course, AI Intuition series.

Parting words of wisdom and caution

Let’s wrap up this article with the wise words about LLM intelligence from the Godfather of the modern AI Geoff Hinton himself:

See you next time on these waves 🙂