Miruna Clinciu

Generative AI, focus on LLMs

Important! Not all generative AI is classified as LLMs.

Generative AI, focus on LLMs

Generative Artificial Intelligence (Generative AI, GenAI or GAI) is a subfield of AI that uses generative models to produce text, images, videos or any other form of data.

Large Language Models (LLMs) can be seen as a subset of GenAI, focusing specifically on natural language processing and generation tasks.
To share a bit more, we also have some amazing Multimodal Models that are capable of processing information from different modalities, including images, videos, and text.
These are often referred to as Multimodal Large Language Models (MLLMs).

Words...

Inspired by my own black cat, who sits beside me as I work on my laptop, I crafted this sentence to explain LLMs and the key concepts that will help you understand them.

"The cat is black."

The word "cat" does not contain any inherent information about what a cat is, just from its letters.
The word can be represented as a number index, but this does not provide much information

Words can be represented as numerical indices.

For example, a dictionary of words might assign the word "cat" an index of 5 and "dog" an index of 10. However, the problem is that these indices themselves do not carry any meaning.

They are just placeholders for words, and the model has no inherent understanding of the relationships between words, such as "cat" and "dog".

We need Meaningful Representations! Instead of simple indices, words can be represented by a list of feature attributes.

For example, a dictionary entry for a word that contains various features:

Part of Speech (POS): Whether the word is a noun, verb, adjective, etc.

Description: A brief definition of the word.

Usage: Examples of how the word is used in sentences.

The key question in representing words is: "What features should we use to capture the meaning of a word?"

Should we include semantic features (e.g., synonyms, related concepts)?

Should we account for contextual usage (e.g., the word "bank" can refer to a financial institution or a riverbank)?

Tokenization

Tokenization is the process of breaking down a sentence into smaller units, usually words or subwords, which are called tokens.These tokens are then represented as numbers (indices) or fed into subsequent layers of a model. It converts text into units (tokens) that can be processed by models.

Example:
Sentence: "The cat is black."
Tokenization (word-level): ["The", "cat", "is", "black", "."]

Now, a good question might be: why do we select word-level tokenization? Well, there are different types of tokenization, such as: word-level, subword-level, character-level, n-grams etc.

For example, subword-level tokenizarion reduces vocabulary size compared to word-level tokenization.

Character-level tokenization can handle typos or misspellings better.

Word-level tokenization creates a more manageable and intuitive representation for analysis.It helps us to easier visualise the text data.

Embeddings

Once tokenization is complete, embedding is the process of converting these tokens (or token indices) into dense, continuous vector representations that capture the semantic meaning of the tokens. These embeddings help models understand relationships between words, their meanings, and context.

Embedding models:

Word2Vec (2013): Introduced efficient learning of embeddings via CBOW and Skip-gram. Continuous Bag of Words (CBOW): Predicts a target word from its surrounding context. Continuous Skip-Gram Model: Predicts surrounding words from a target word.

GloVe (2014): Focused on global statistics to capture semantic relationships. Constructs a co-occurrence matrix and factors it to learn vector representations for words.

FastText (2014): An extension of Word2Vec that represents words as bags of character n-grams. FastText can generate word embeddings for rare words or words not seen during training, by leveraging subword information (character-level embeddings).

While models like Word2Vec, GloVe, and FastText provided high-quality static embeddings, one limitation remained: words had the same vector representation regardless of context e.g., the word "bank" would have the same vector representation in "river bank" and "financial bank".

ELMo (2017): introduced contextualized embeddings using bidirectional LSTMs.

BERT (2018): revolutionized embeddings with a transformer-based bidirectional model.

GPT-2 and GPT-3 (2019): pioneered large-scale, generative language models.

T5, RoBERTa, and ALBERT (2020): advanced transformer-based architectures, making them more efficient.

Example: Tokens: ["the", "cat", "is", "black"]

The tokenized words are converted into vectors (word embeddings), typically of fixed size (e.g., 300 or 768 dimensions).

The embedding process assigns a dense vector to each token that captures its meaning.

Important! As you can see, the sentence is now in lowercase and punctuation has been removed. For further details on common techniques for text preprocessing, such as: lowercasing, removing punctuation, stopword removal, stemming and lemmatization etc.

Example (simplified vectors):

"the" → [0.25, 0.89, -0.12, ...] (300 or 768 numbers)
"cat" → [0.72, -0.34, 0.65, ...]
"is" → [0.18, 0.49, -0.75, ...]
"black" → [0.93, -0.15, 0.24, ...]

Transformer

The Transformer architecture is a deep learning model introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.

See paper: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.
In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. Link: https://arxiv.org/pdf/1706.03762.

Positional Encoding

Transformers do not have recurrence or convolution to track word order.

Positional encoding is added to give the model information about the position of each word in the sequence.

The positional encoding is added to each word's embedding:

[
  [0.1 + pos1, 0.2 + pos2, ..., 0.5 + posd], 

  [0.6 + pos1, 0.7 + pos2, ..., 0.1 + posd], 

  [0.9 + pos1, 0.3 + pos2, ..., 0.4 + posd], 

  [0.3 + pos1, 0.8 + pos2, ..., 0.6 + posd]
]

Where pos1, pos2, ..., posd are the positional encodings for each dimension, which help the model understand the sequence order.

Self-Attention Mechanism

Self-Attention Mechanism allows the Transformer to look at all the words in the sentence and decide which ones are important to focus on for each position.

Scaled Dot-Product Attention: Each word (or token) is represented by three vectors: Query (Q), Key (K), and Value (V). For each word, attention scores are calculated using the dot product of the query with keys from other words in the sentence.

Example: When processing "cat", it can attend to other words in the sentence to understand the context:

Attention("cat", "The") = 0.1 

Attention("cat", "cat") = 0.5 

Attention("cat", "is") = 0.3 

		Attention("cat", "black") = 0.1

Multi-Head Attention

Instead of applying one attention mechanism, the Transformer uses multi-head attention, which allows the model to focus on different parts of the sentence from different perspectives (or "heads"). After processing the attention heads, the results are concatenated and passed through a linear transformation to combine the outputs of all heads.

Example: Multi-Head Attention

"The cat is black".

Head 1: might focus on the relationship between "The" and "cat" (i.e., focusing on the subject and its determiner).

Head 2: might focus on "cat" and "black" (i.e., focusing on the relationship between the noun and its adjective).

Head 3: might focus on the verb "is" and "black" to better understand the relationship between the verb and the adjective.

Each head essentially provides a different perspective on the sentence, extracting different relationships between the words.

Once the attention mechanism has adjusted the word representations based on the context, each word is passed through a feed-forward neural network. This step applies two linear transformations with a ReLU activation in between. Each position is processed independently, and the result updates the representation of each word.

The Transformer repeats the self-attention and feed-forward steps multiple times (usually 6 or 12 layers). This allows the model to refine the word representations iteratively. After multiple layers, the word embeddings of "The cat is black." now contain rich information about the relationships between the words.

The decoder uses the output of the encoder to predict the next word in a sequence. During training, the model learns to predict the next word based on a given input. For example, if the input is "The cat is", the decoder will try to predict "black" as the next word. During inference (generation), the decoder generates the next word step by step. If we input "The cat is", the model might output "black". This predicted word is fed back into the decoder to predict the following word. The decoder uses masked self-attention to ensure that it only attends to the previous words in the sequence (and not future words), ensuring that predictions are made in an autoregressive manner.