Important! Not all generative AI is classified as LLMs.
Generative Artificial Intelligence (Generative AI, GenAI or GAI) is a subfield of AI that uses generative models to produce text, images, videos or any other form of data.
Inspired by my own black cat, who sits beside me as I work on my laptop, I crafted this sentence to explain LLMs and the key concepts that will help you understand them.
The word "cat" does not contain any inherent information about what a cat is, just from its letters.
The word can be represented as a number index, but this does not provide much information
Words can be represented as numerical indices.
For example, a dictionary of words might assign the word "cat" an index of 5 and "dog" an index of 10. However, the problem is that these indices themselves do not carry any meaning.
They are just placeholders for words, and the model has no inherent understanding of the relationships between words, such as "cat" and "dog".
We need Meaningful Representations! Instead of simple indices, words can be represented by a list of feature attributes.
For example, a dictionary entry for a word that contains various features:
The key question in representing words is:
Tokenization is the process of breaking down a sentence into smaller units, usually words or subwords, which are called tokens.These tokens are then represented as numbers (indices) or fed into subsequent layers of a model. It converts text into units (tokens) that can be processed by models.
Example:
Sentence:
Tokenization (word-level): ["The", "cat", "is", "black", "."]
Now, a good question might be: why do we select word-level tokenization? Well, there are different types of tokenization, such as: word-level, subword-level, character-level, n-grams etc.
For example, subword-level tokenizarion reduces vocabulary size compared to word-level tokenization.
Character-level tokenization can handle typos or misspellings better.
Word-level tokenization creates a more manageable and intuitive representation for analysis.It helps us to easier visualise the text data.
Once tokenization is complete, embedding is the process of converting these tokens (or token indices) into dense, continuous vector representations that capture the semantic meaning of the tokens. These embeddings help models understand relationships between words, their meanings, and context.
While models like Word2Vec, GloVe, and FastText provided high-quality static embeddings, one limitation remained: words had the same vector representation regardless of context e.g., the word "bank" would have the same vector representation in "river bank" and "financial bank".
Example: Tokens: ["the"
, "cat"
, "is"
, "black"
]
The tokenized words are converted into vectors (word embeddings), typically of fixed size (e.g., 300 or 768 dimensions).
The embedding process assigns a dense vector to each token that captures its meaning.
Important! As you can see, the sentence is now in lowercase and punctuation has been removed. For further details on common techniques for text preprocessing, such as: lowercasing, removing punctuation, stopword removal, stemming and lemmatization etc.
Example (simplified vectors):"the"
→ [0.25, 0.89, -0.12, ...
] (300 or 768 numbers)"cat"
→ [0.72, -0.34, 0.65, ...
]"is"
→ [0.18, 0.49, -0.75, ...
]"black"
→ [0.93, -0.15, 0.24, ...
]The Transformer architecture is a deep learning model introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.
See paper: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.
In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. Link: https://arxiv.org/pdf/1706.03762.
Transformers do not have recurrence or convolution to track word order.
Positional encoding is added to give the model information about the position of each word in the sequence.
The positional encoding is added to each word's embedding:
[
[0.1 + pos1, 0.2 + pos2, ..., 0.5 + posd],
[0.6 + pos1, 0.7 + pos2, ..., 0.1 + posd],
[0.9 + pos1, 0.3 + pos2, ..., 0.4 + posd],
[0.3 + pos1, 0.8 + pos2, ..., 0.6 + posd]
]
Where pos1, pos2, ..., posd
are the positional encodings for each dimension, which help the model understand the sequence order.
Self-Attention Mechanism allows the Transformer to look at all the words in the sentence and decide which ones are important to focus on for each position.
Scaled Dot-Product Attention: Each word (or token) is represented by three vectors: Query (Q), Key (K), and Value (V). For each word, attention scores are calculated using the dot product of the query with keys from other words in the sentence.
Example: When processing "cat", it can attend to other words in the sentence to understand the context:
Attention("cat", "The") = 0.1
Attention("cat", "cat") = 0.5
Attention("cat", "is") = 0.3
Attention("cat", "black") = 0.1
Instead of applying one attention mechanism, the Transformer uses multi-head attention, which allows the model to focus on different parts of the sentence from different perspectives (or "heads"). After processing the attention heads, the results are concatenated and passed through a linear transformation to combine the outputs of all heads.
Example: Multi-Head Attention"The cat is black".
Each head essentially provides a different perspective on the sentence, extracting different relationships between the words.
Once the attention mechanism has adjusted the word representations based on the context, each word is passed through a feed-forward neural network. This step applies two linear transformations with a ReLU activation in between. Each position is processed independently, and the result updates the representation of each word.
The Transformer repeats the self-attention and feed-forward steps multiple times (usually 6 or 12 layers). This allows the model to refine the word representations iteratively. After multiple layers, the word embeddings of "The cat is black." now contain rich information about the relationships between the words.
The decoder uses the output of the encoder to predict the next word in a sequence. During training, the model learns to predict the next word based on a given input. For example, if the input is "The cat is", the decoder will try to predict "black" as the next word. During inference (generation), the decoder generates the next word step by step. If we input "The cat is", the model might output "black". This predicted word is fed back into the decoder to predict the following word. The decoder uses masked self-attention to ensure that it only attends to the previous words in the sequence (and not future words), ensuring that predictions are made in an autoregressive manner.