World Knowledge: Compression and Representation

Intelligence, no matter artificial or human, relies to great extent on:

How effectively we store the knowledge
Reliably codify it without losing essence of original information
Retrieving bits and pieces of relevant information from this storage as needed
Connecting dots between all kinds of facts and observations as part of the reasoning and thought process, and finally
Produce useful output of reasoning.

The profoundness, impact of our inferences depends on the capacity of stored knowledge and ability to quickly connect dots between its fragments.

Information Theory place in LLM

Large Language Models heavily rely on many tenets of Information Theory. It is of particular interest how efficiently the world knowledge is internally represented and stored in order to guarantee instantaneous retrieval and inference. Today’s largest LLMs possess knowledge about the world thanks to being “front-loaded” with billions(!) of articles and books that capture millennia worth of human civilization knowledge. There are lots of redundancies and inefficiencies of natural language buried in the form of text. The written knowledge is significantly compressed in LLMs without losing its substance. All that thanks to intricate use of Information Theory. Here is a teaser diagram.

Let’s walk through the details.

Embeddings

One of the truly phenomenal breakthroughs that Deep Neural Networks brought to light was enhancing knowledge compression with traits of instantaneous contextual correlation, which is called embeddings. Embeddings are word codings – vectors of real numbers – that in essence capture rich semantic and loose contextual closeness (e.g. animal, eating, walking) between words that otherwise the words themselves don’t carry.

Embedding explained

Of course, one can use thesaurus and dictionaries in order to make sense of word meaning in specific context. But there are practical limitations to it. Massive number of lookups into them is very inefficient. When we say “A cat walked in the ally and ate a piece of pizza laying by the tree” it is semantically similar to “A dog was running along the street and devoured a sausage at the corner of the block”. Despite the fact that all the words in two sentences are different.

The word cat that is represented in LLM as embedding captures the fact that the cat is a) a pet, b) with four limbs, c) two eyes, d) is quite independent, and many more other features. Same is true for the dog, that is, it has many interesting features. All the words in both previous example sentences have very rich representation in LLM in the form of embeddings. A diagram below depicts the concept of semantic similarity using embeddings.

Astute readers will notice “comparing” in quotes on the diagram. That’s where the whole essence of LLM magic lies. It is inference that is technically a more correct term for what happens here. In order to assess similarity it takes a lot of neural network inference chops. Add to this consideration of probability distributions for infinite combinations of words in different contexts. We will cover what happens behind the scenes in later posts.

Side note. Comparing is still technically a proper term in the context of RAG.

Comparing is still technically a proper term when considered in the context of RAG (retrieval augmented generation), which you probably heard a lot of buzz about these days. Well, the above illustration is the essence of RAG – comparing embeddings of different sentence embeddings to draw conclusions on similarity or, in other words, loose match of two sentences. Without going into much details, similarity is achieved using a method called cosine similarity. It is, in nutshell, a dot product (also known as inner product) of two sentence embeddings as illustrated in the diagram. The higher the resulting number of the product the higher is the similarity. From a cosine similarity perspective that can be visualized as both embedding vectors point in one direction. Ok, enough of an excursion to Linear Algebra turf.

How embeddings are learned

Embeddings are crucial to LLM magic. But they need to be learned as part of LLM training. Let’s say we are codifying 40,000 English words as embeddings by processing billions of articles. That means all these 40,000 words will have associated vectors of real numbers (embeddings). Each embedding captures relative fullness of the word’s role and its meaning in different contexts that were part of documents fetching into LLM during training. Along with the embeddings billions of Deep Neural Network parameters are learned as well. We will talk about parameters and their significant role in the post to follow.

This figure illustrates the process of words embedding learning.

What happens during inference

During inference (that is when we expect LLM to produce some results, e.g. translation of text or completing sentence) LLM is actually fed with the prompt’s embeddings of the respective words one after another at an input. Now, bear with me, the most mind bending thing follows. At intermediate stages of embeddings processing (DNN has many processing layers) some temporary intermediate embeddings are produced that are understood only by LLM. Those intermediate embeddings carry additional codification of intricacies of words interrelation.

Side note. Transformers and attention at the heart of embedding magic.

For instance, in order to construct semantically correct sentences it is important to discern words in a sentence based on their significance in a particular context. This is a foundational principle for transformers functionality manifested as attention. Transformers play a central role in today’s DNN. We will talk about them in later posts.

At the output LLM translates the final result of inference that is in the form of embedding back to one of 40,000 English words in the vocabulary that we as humans understand.

At the output LLM translates final result of inference that is in the form of embedding back to one of 40,000 English words in the vocabulary that we as humans understand. The whole process of inference is conceptually depicted below.

In the process of producing embeddings for the words in the vocabulary and intermediate embeddings that are needed both during training and inference LLMs heavily rely on Information Theory for effective compression and codification as well as Probability Theory to capture probability distributions of the words and likelihood of correlations between them. Learning embeddings and the model parameters during LLM training is an extremely computationally intensive process. We will cover LLM training in greater detail in the following posts.

I know. Fun fun fun 🙂 . Keep reading the next post…