How Intelligent AI Really Is?

The forecast duties of meteorological service are adding up as we speak. We are hearing regular swift correlations between slightest weather fluctuations from the historical normal and global warming. Now we get bombarded every day with prognosis about the imminent arrival of AGI – Artificial General Intelligence. No day passes by without fear mongering or exaltation, depending on how you look at it: Are we there yet – has AGI arrived? Am I still relevant as a human? Are we clinching ever closer to the end of humanity as we have known it?

In all this craze, let’s take a step back and unpack exactly what intelligence of AI means and how worried we should be about AI eventuality. I have dedicated a series of bite-sized articles on the importance of getting familiar with AI. This article is more about what it means AI achieving a certain degree of intelligence.

What is Intelligence

Intelligence has varied nuanced definitions but it is safe to state that in broad strokes it encompasses one’s ability, be it living organism or technical system, to:

adapt to familiar but varying situations in one domain,
adapt to new completely unfamiliar situations in one domain,
generalize past experience in one domain and find possible applicability of solutions in the different domain

Intelligence is on a spectrum. It is not an absolute binary merit – you are intelligent or not. Which is important to keep in mind because in light of LLMs there are tendencies today of either outright glorifying AI as approaching an ultimate AGI or dismissing them as lacking any signs of intelligence (relegating their capabilities to mere memorization and “stochastic parroting”) hence useless.

Bottom line, adaptability is the key ingredient of intelligence manifestation.

How do you adapt

Adaptability as fundamental building block of intelligence is achieved in different ways:

Mother nature through evolution has designated special task and skill specific areas of the brain responsible for particular cases of adaptation. It has also shelved in a very stingy and optimal way worthy traits (programming blocks if you wish) of adaptability in human’s DNA
As you go through life experiences with the implicit or explicit feedback – what works and what not – the respective areas of brain get further developed, recalibrated and, in the long run (hundreds-thousands of years), DNA gets mutated

While human adaptability helps with acquisition of specific skills in particular areas, the “by-product” of adaptability is the ability to abstract, conceptualize and generalize. Thus transferring intuitively or explicitly “what was learned in one domain and skill to another”. This is the highest form of intelligence manifestation in humans.

Generally, adaptability doesn’t and can not start from a “clean slate”. There must be some prior knowledge frontloaded and encoded somewhere and somehow. In humans and living organisms it is DNA and embryo’s brain, in artificial systems it is Neural Networks with parameters and embeddings. More on this here.

Some organisms and systems adapt better than others, their adaptability characteristics greatly vary depending on circumstances.

Which takes us to the next important dilemma – how do you rate the level of intelligence given its dependency on one’s adaptability?

How do you gauge intelligence

For fair comparison of intelligence one has to set a plain field. Alas often when doing comparisons inadequate care is given to avoid comparing apples-to-oranges. This happens often in the AI field.

Thus for the intelligence comparisons to be meaningful and actionable the following arrangements should take place:

Clearly defined domain(s) of intelligence – you can’t compare intelligence of living organism adapting to planet Venus acidic atmosphere with ability to win chess game against Deep Blue
Framework of measuring intelligence agreed upon – metrics, reviews, benchmarks etc. Often researchers tend to adopt one framework of intelligence measuring to new domains. This leads to short cuts, overly reliance on intuition that ends up with seemingly plausible measurement results but after scrupulous analysis questionable methodologies and misguiding outcomes become apparent. Generalization of intelligence level metrics from trusted measurements in specific domain(s) should be approached very carefully.
Plain field of expressing intelligence should be set. There is an argument that intelligence level is a function of how efficiently the agent (human, living organism or artificially engineered system) can learn producing solutions with minimal prior knowledge and exposure to new experiences, and effectively apply these learnings to novel situations. Hence frontloading with knowledge and incremental training should be closely controlled to avoid unfair judgements about one’s intelligence capabilities.

Unpacking intelligence of LLMs

Armed with above section about:

what intelligence means,
how agents (people, living organisms, systems) adapt and use the results of adaptation, and
how intelligence levels are measured,

let’s get back to the main subject of this article – what intelligence in LLMs means.

Today’s state-of-the-art LLMs amaze us with their fluency to:

produce plausible ideas in various topics of interest,
summarize,
analyze,
translate, and
other.

They certainly manifest above average skills in specific domains (e.g. law, spoken language, healthcare).

What is the level of their intelligence then? Depending on who you ask you get different answers. Skeptics of LLM intelligence evaluation state that:

Intelligence should not be judged by outcomes measurement (skills)
Intelligence should be gauged on efficiency of learning (least efforts with least experience)
Intelligence is manifested in ability to abstract knowledge from one familiar domain (skill) and apply it to another absolutely new domain (skill) with no or very minimal training

In this camp, once again, intelligence is a function of efficiency with minimal front loading the prior knowledge and limited exposure to relevant experiences.

Additional consideration is that the traditional benchmarks suffer from leakage, also known as data contamination, when test data leaks into the training data of large language models (LLMs). This can lead to inflated performance, making comparisons between LLMs unfair and providing an unreliable measure of LLM intelligence level.

Rigorous benchmarking

To address above mentioned concerns the ARC tests (abstraction and reasoning corpus) are introduced for better and fair benchmarking. It is claimed that ARC tests are difficult to cheat on and they better reflect intelligence assessment.

ARC is based on François Chollet’s “On The Measure of Intelligence” paper. In essence, he touts that extensive experience (read: huge training data sets) influencing acquisition and manifestation of skills is masking true intelligence of a system under consideration:

do you really have a phenomenal associative memory, or
do you stand out thanks to your exceptional reasoning abilities?

Hence per François only Core Knowledge (essentially deposited in human DNA through evolution) should be considered for fair consideration of intelligence:

objectness and elementary physics,
agent-ness and goal-directedness,
natural numbers and elementary arithmetic,
elementary geometry and topology.

Those are the traits that according to François a child has been front loaded with by evolution and further get recalibrated by age of 5-6 years old. No more extensive training is needed to manifest basic intelligence skills. Which, somewhat surprisingly, LLMs struggle to achieve despite advanced training with trillions of bytes of information.

I don’t necessarily disagree about the fairness of the ARC benchmarking framework, but the restrictions it imposes don’t help with fair judgment about practical usefulness of systems like LLMs. After all:

Are we building AI systems to enhance our skills, or
In the quest going after Artificial General Intelligence (AGI) we want to prove that we can build the system that will relegate our abstraction abilities?

I honestly feel the focus should stay and expand on the former.

Generally speaking I do not recommend reading too much into benchmark tests. They are not scientifically calibrated enough to guarantee trustworthiness of the testing results.

Having said that I am a firm believer of LLM value and their promise in appropriate uses, that requires:

performance tests against your own real life application and use case benchmark, and,
most importantly – treating LLM as an assistant to you.

You, human-in-the-loop, are the ultimate decision makers.

How Large Language Models work

This is a bonus chapter that shortly explains how LLMs:

understand the world,
capture essence buried in data, and
navigate through it without verbatim memorization.

Here is a quick cheat list of how LLMs really work:

LLMs are globally trained (with all kinds of many local contexts) on massive data.
The results of this training are captured in the form of:

a) parameters of neural networks, and

b) embeddings (rich representations) of vocabulary thanks to ingenuity of transformer to extract and codify complex relationships between elements (words).

Think of the result of training as a sort of massive static landscape of world knowledge. But make no mistake both parameters and embeddings are manifestations of complex dynamic vector spaces. They are dynamic in waiting for inference time to unleash their power.
During inference the in-context-learning (the way the transformer understands the intent of your prompt, which could be very long – millions of tokens) overlays the prompt (mini-context) on a globally captured static world model. The act of overlaying is bringing true dynamism of synthesizing plausible output based on knowledge captured, compressed and stored in parameters and embeddings during training of the model.
The process of overlaying long prompt contexts on the LLM during inference involves transformer and neural networks layers machinery. This moment is very important to internalize, for some criticize LLMs not learning new things during inference. Not true. It is the act of overlaying long prompt contexts on the rest of LLM that helps further mini-train LLM for the duration and in the context of prompt request.
Because the world is complex, intents and reasoning are non-trivial and depend on millions of parameters, the probabilistic engine of LLMs brings actionable certainty to the highly non-deterministic world we live in.
In later stages LLMs go through fine-tuning which is a process that morphs the originally created landscape of world knowledge in such a way that the domain of our interest gets elevated preferential treatment by LLM. In a sense it is deprioritizing many alternative reasoning paths in favor of our preferences. This is how we get a specialized version of LLM.

You can find more on LLM inner workings in my bite-sized why series.

LLMs job is to generalize concepts and predict outcomes to the best of their capabilities. They don’t operate on the assumption of pointing to a precise piece of queried information via indexes, as some think, and hence draw wrong associations with databases. The billions of parameters characterizing LLMs and the related vocabulary embeddings that have captured dynamic nature of associations and causality between elements can be considered as very rich fluid indexes. These “indexes” get morphed into solution search guides during LLM inference, which is a process of overlaying prompt intent onto LLM machinery as described above; thus awakened by a specific prompt or NLP task.

To recap, LLMs:

don’t memorize data verbatim in a way databases do,
they are not databases,
they don’t have indexes.