About Transformers

Aktualisiert14. March 2026

What Transformers are for

A Transformer is a powerful neural network architecture in AI that processes data (like text or images) in parallel rather than sequentially, using several mechanism to understand the context and relationships between data points.

Transformer Architectures

The architecture of a Transformer is characterized by two essential pieces: The Encoder and the Decoder. Both do work independently from each other but can also work in conjunction.

An Encoder based architecture is mainly used for extracting context-aware representations from input data. It encodes data into a dense but rich representation of its meaning. Typical use cases for encoder based Transformers are classification or sentiment analysis.

A Decoder however is primarily designed to produce output based on certain instructions given to the Decoder. It is the base of modern LLMs such as GPT. Decoders are known for their ability to perform next Token predictions for what they are trained for.

Encoder-Decoder architectures combine the two architektural approaches and are specifically designed for tasks such as translation, summarization and training multimodal models.

Embeddings

Embeddings are high-dimensional representations of the input Tokens. The representation is done via vectors which span a multidimensional area to represent the semantic feature of the input Tokens. Embeddings are useful for models to capture the meaning of a word/Token based on its context. Tokens can have multiple meanings based on their appearance and character which comes to their relation to other Tokens as part of the entire input.

GPT V3 for example applies 12.000 vectors for embedding Tokens.

Positional Encoding

Positional Encoding enhance vectors with the information about where the Token appears from a sequence perspective of the input information. The position of a Token is relevant for its meaning. Consider a sentence with multiple adjectives an nouns, each adjective describes a specific noun. Without positional encoding the model wouldn’t know which adjective describes which noun in the sentence.

Attention Mechanism

The Attention Mechanism is for giving models a meanings in given sequences by calculating a weighted total of the embeddings of all words in a phrase.

The Attention Mechanism is introducing three types of vectors.

Query Vector: This is the word or token for which the attention weights are calculated. The Query vector specifies which sections of the input sequence should be prioritized. When you multiply word embeddings by the Query vector, you ask, “What should I pay attention to?”
Key Vector: The set of words or tokens in the input sequence compared to the Query. The Key vector aids in identifying the important or relevant information in the input sequence. When you multiply word embeddings by the Key vector, you ask, “What is important to consider?”
Value Vector: It stores the information or features associated with each word or token in the input sequence. The Value vector contains the actual data that will be weighted and mixed in accordance with the attention weights calculated between the Query and Key vectors. The Value vector answers the query, “What information do we have?”

A nice explanation on Attention Mechanism can be found here: https://rahulrajpvr7d.medium.com/what-are-the-query-key-and-value-vectors-5656b8ca5fa0

Self-Attention Mechanism

The Self-Attention Mechanism enforces the relevance of the Attention Mechanism. It allows the model to dynamically weigh the importance of each word in a sentence relative to every other word. This is achieved using query, key, and value vectors derived from the input embeddings. By comparing these vectors, the model identifies and highlights the most significant parts of the text, ensuring that important words receive more attention.

See more on Self-Attention Mechanism and how it does work in conjunction with Query, Key and Value vectors can be obtained here:https://www.sciencedirect.com/topics/computer-science/self-attention-mechanism

See everything in action

We are now writing some code to show how this all works in action. It manually walks through the early stages of an OPT transformer model instead of only calling model.generate() or model(...).

Components we are using

facebook/opt-1.3b

facebook/opt-1.3b is an open-source, 1.3-billion parameter Large Language Model (LLM) released by Meta AI in May 2022 as part of the Open Pre-trained Transformer (OPT) series. It is designed as a decoder-only transformer, sharing a similar architecture to GPT-3, with the goal of democratizing access to LLMs for research and development

bitsandbytes

The bitsandbytes library provides quantization tools for LLMs through a lightweight Python wrapper around hardware accelerator functions. It enables working with large models using limited computational resources by reducing their memory footprint.

The flow is:

Load the OPT-1.3B model in 8-bit format to save memory.
Load the matching tokenizer.
Convert a sentence into token IDs.
Turn those token IDs into learned token embeddings.
Create positional embeddings so the model knows token order.
Add token embeddings and positional embeddings together.
Send the result into the first decoder layer’s self-attention module.
Print intermediate tensors so you can inspect what happens internally.

Here is the complete code:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# -----------------------------------------------------------------------------
# STEP 1: Define how the model should be loaded
# -----------------------------------------------------------------------------
# BitsAndBytesConfig is used to configure model quantization.
# Quantization reduces memory usage by storing model weights in a lower-precision
# numerical format than the default full precision.
#
# In this case:
# - load_in_8bit=True
#   This tells Hugging Face + bitsandbytes to load the model weights in 8-bit
#   precision instead of the usual 16-bit or 32-bit floating point precision.
#   The main benefit is lower GPU memory consumption.
#
# - llm_int8_enable_fp32_cpu_offload=False
#   This option controls whether some weights should be offloaded to the CPU
#   in 32-bit floating point when GPU memory is insufficient.
#   Since it is False here, the code is asking the framework not to use that
#   fallback behavior.
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=False
)

# -----------------------------------------------------------------------------
# STEP 2: Load the pretrained causal language model
# -----------------------------------------------------------------------------
# AutoModelForCausalLM automatically selects the correct model class for the
# checkpoint "facebook/opt-1.3b".
#
# "facebook/opt-1.3b" is a pretrained OPT language model with about 1.3 billion
# parameters. It is a causal language model, meaning it predicts the next token
# based on previous tokens.
#
# quantization_config=quantization_config
#   Applies the 8-bit loading setup defined above.
#
# device_map="auto"
#   Lets Hugging Face automatically decide where to place the model
#   (for example, on GPU if available, otherwise CPU, or across multiple devices).
OPT = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-1.3b",
    quantization_config=quantization_config,
    device_map="auto"
)

# -----------------------------------------------------------------------------
# STEP 3: Load the tokenizer for the same model
# -----------------------------------------------------------------------------
# The tokenizer converts raw text into token IDs that the model can understand.
# It must match the model checkpoint, so we use the same model name.
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

# -----------------------------------------------------------------------------
# STEP 4: Define the input sentence
# -----------------------------------------------------------------------------
# This is the raw text that will be transformed into tokens and then processed
# through parts of the model.
inp = "The dirty old town is selected is most ugliest city of the region"

# -----------------------------------------------------------------------------
# STEP 5: Tokenize the input text
# -----------------------------------------------------------------------------
# tokenizer(..., return_tensors="pt") converts the input string into PyTorch
# tensors. The result is typically a dictionary containing:
#
# - input_ids:
#   Integer token IDs corresponding to the tokenized text.
#
# - attention_mask:
#   A mask indicating which positions are real tokens (1) and which are padding (0).
#
# Since the input is a single sentence, the tensor shape will usually be:
#   [batch_size, sequence_length]
# where batch_size = 1 here.
inp_tokenized = tokenizer(inp, return_tensors="pt")

# Print the size of the token ID tensor.
# This tells us how many tokens the input sentence became after tokenization.
print(inp_tokenized['input_ids'].size())

# Print the full tokenized output dictionary.
# This helps inspect the actual token IDs and attention mask.
print(inp_tokenized)

# -----------------------------------------------------------------------------
# STEP 6: Print the internal base model structure
# -----------------------------------------------------------------------------
# OPT is a causal LM wrapper. Internally, OPT.model refers to the base transformer
# model, which contains the decoder stack and embeddings.
print(OPT.model)

# -----------------------------------------------------------------------------
# STEP 7: Convert token IDs into token embeddings
# -----------------------------------------------------------------------------
# Neural language models do not work directly on raw integer token IDs.
# Instead, each token ID is mapped to a dense vector representation called
# an embedding.
#
# OPT.model.decoder.embed_tokens is the token embedding layer.
#
# inp_tokenized['input_ids'] is moved to the same device as the model using:
#   .to(OPT.device)
#
# The output shape is typically:
#   [batch_size, sequence_length, hidden_size]
#
# Each token is now represented by a learned vector of length hidden_size.
embedded_input = OPT.model.decoder.embed_tokens(
    inp_tokenized['input_ids'].to(OPT.device)
)

# Print the token embedding layer object itself.
print("Layer:\t", OPT.model.decoder.embed_tokens)

# Print the size of the embedded input tensor.
# This shows how many tokens were embedded and the dimensionality of each embedding.
print("Size:\t", embedded_input.size())

# Print the actual embedding tensor values.
# This is usually a large tensor containing floating-point numbers.
print("Output:\t", embedded_input)

# -----------------------------------------------------------------------------
# STEP 8: Compute positional embeddings
# -----------------------------------------------------------------------------
# Transformers need positional information because self-attention alone does not
# inherently know the order of tokens in a sequence.
#
# OPT.model.decoder.embed_positions provides learned positional embeddings.
#
# Here the code passes the attention_mask into embed_positions. In OPT,
# positional embedding logic uses the mask to infer positions of valid tokens.
#
# The output is another tensor of shape:
#   [batch_size, sequence_length, hidden_size]
#
# This tensor tells the model where each token is located in the sequence.
embed_pos_input = OPT.model.decoder.embed_positions(
    inp_tokenized['attention_mask'].to(OPT.device)
)

# Print the positional embedding layer object.
print("Layer:\t", OPT.model.decoder.embed_positions)

# Print the shape of the positional embeddings.
print("Size:\t", embed_pos_input.size())

# Print the actual positional embedding values.
print("Output:\t", embed_pos_input)

# -----------------------------------------------------------------------------
# STEP 9: Combine token embeddings and positional embeddings
# -----------------------------------------------------------------------------
# In transformer models, the input representation for each token is usually:
#
#   token_embedding + positional_embedding
#
# This gives the model both:
# - what the token is
# - where the token appears in the sequence
#
# The resulting tensor is the actual hidden-state input to the first decoder layer.
embed_position_input = embedded_input + embed_pos_input

# -----------------------------------------------------------------------------
# STEP 10: Pass the combined embeddings into the first self-attention block
# -----------------------------------------------------------------------------
# OPT.model.decoder.layers[0] is the first transformer decoder block.
# .self_attn is the self-attention module inside that block.
#
# Self-attention allows each token to attend to other tokens in the sequence
# and build context-aware representations.
#
# The call returns multiple values. In this code:
# - hidden_states captures the main output tensor from self-attention
# - the two underscores (_) ignore the other returned values
#
# Typically these other values may include attention weights or cached key/value
# tensors, depending on model configuration.
hidden_states, _, _ = OPT.model.decoder.layers[0].self_attn(embed_position_input)

# Print the self-attention module being used.
print("Layer:\t", OPT.model.decoder.layers[0].self_attn)

# Print the shape of the self-attention output.
# This should usually still be:
#   [batch_size, sequence_length, hidden_size]
print("Size:\t", hidden_states.size())

# Print the actual output tensor from the first self-attention layer.
print("Output:\t", hidden_states)

Let’s investigate in the output

STEP 5: Tokenize the input text

# Print the size of the token ID tensor.
# This tells us how many tokens the input sentence became after tokenization.
#In the given example, all indices of the attention mask vector are set to 1, #indicating that every token will be processed normally.  However,  by  setting  an  #index  in  the  attention  mask  vector  to  0,  you can  instruct  the  model  to #overlook specific tokens from the input. Also, notice how the textual input is #transformed into token IDs using the model’s pre-trained dictionary.

torch.Size([1, 16])
{'input_ids': tensor([[    2,   133, 11216,   793,  1139,    16,  3919,    16,   144,  1717, 571, 27911,   343,     9,     5,   976]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

STEP 6: Print the internal base model structure

# OPT is a causal LM wrapper. Internally, OPT.model refers to the base transformer
# model, which contains the decoder stack and embeddings.
# This helps us understanding the architecture of the model which in this case is a Decoder
Output:  tensor([[[ -5168., -18032., -59648.,  ...,    -inf,    -inf,     inf],
         [ 25904.,   4476.,    -inf,  ..., -47136., -15416.,  59520.],
         [ 44800., -22784., -14568.,  ..., -19744.,  31248.,     inf],
         ...,
         [   -inf,    -inf, -47520.,  ...,    -inf,    -inf,    -inf],
         [    inf,    -inf,  44032.,  ...,    -inf,    -inf,     inf],
         [    inf,   1720., -15112.,  ..., -33696.,   5212.,  14488.]]],
       dtype=torch.float16, grad_fn=<MatMul8bitLtBackward>)

STEP 7: Convert token IDs into token embeddings

The embedding layer is accessed via the decoder object’s .embed_tokens method, which delivers our tokenized inputs to the layer. As you can see, the embedding layer will convert a list of IDs of the size [1, 10]to [1, 10, 2048]. Here, 2048 is the size of our embeddings for the OPT model.This representation will then be employed and transmitted through the decoder layers.

# Print the token embedding layer object itself.

Layer:   Embedding(50272, 2048, padding_idx=1)

# Print the size of the embedded input tensor.
# This shows how many tokens were embedded and the dimensionality of each embedding.

Size:    torch.Size([1, 16, 2048])

# Print the actual embedding tensor values.
# This is usually a large tensor containing floating-point numbers.

Output:  tensor([[[-4.0680e-02,  5.1910e-02,  5.7434e-02,  ..., -2.6291e-02,
          -3.5522e-02, -2.6001e-02],
         [-3.7140e-02,  2.2034e-02, -9.5673e-03,  ...,  2.6489e-02,
          -1.6617e-02, -2.9640e-03],
         [ 1.2985e-02, -8.7738e-03, -7.7784e-05,  ..., -3.5706e-02,
          -1.1139e-02, -2.1820e-02],
         ...,
         [ 2.0767e-02,  1.1774e-01, -3.2177e-03,  ...,  3.8391e-02,
           3.6469e-02,  2.7695e-02],
         [ 6.5279e-04,  2.6749e-02,  2.5726e-02,  ...,  6.2164e-02,
           4.2145e-02,  2.7878e-02],
         [ 4.5715e-02, -8.2550e-03, -3.8330e-02,  ..., -5.0598e-02,
          -5.3955e-02,  6.6589e-02]]], dtype=torch.float16,
       grad_fn=<EmbeddingBackward0>)

Step 8: Compute positional embeddings

Transformers need positional information because self-attention alone does not inherently know the order of tokens in a sequence. OPT.model.decoder.embed_positions provides learned positional embeddings.

Here the code passes the attention_mask into embed_positions. In OPT, positional embedding logic uses the mask to infer positions of valid tokens.

The output is another tensor of shape:

[batch_size, sequence_length, hidden_size]

This tensor tells the model where each token is located in the sequence.

Layer:   OPTLearnedPositionalEmbedding(2050, 2048)
Size:    torch.Size([1, 16, 2048])
Output:  tensor([[[-8.1406e-03, -2.6221e-01,  6.0768e-03,  ...,  1.7273e-02,
          -5.0621e-03, -1.6220e-02],
         [-8.0585e-05,  2.5000e-01, -1.6632e-02,  ..., -1.5419e-02,
          -1.7838e-02,  2.4948e-02],
         [-9.9411e-03, -1.4978e-01,  1.7557e-03,  ...,  3.7117e-03,
          -1.6434e-02, -9.9087e-04],
         ...,
         [-4.2458e-03, -3.1555e-02,  8.8730e-03,  ..., -9.0637e-03,
           4.7684e-03,  9.3603e-04],
         [ 1.4668e-03, -5.1575e-02,  7.4482e-04,  ...,  6.3362e-03,
          -7.6065e-03,  1.2688e-02],
         [-7.0839e-03, -9.7168e-02, -7.8659e-03,  ..., -8.5220e-03,
          -1.6375e-03,  1.0361e-02]]], dtype=torch.float16,
       grad_fn=<EmbeddingBackward0>)

Step 9: Combine token embeddings and positional embeddings

In transformer models, the input representation for each token is usually:
token_embedding + positional_embedding

This gives the model both:

what the token is
where the token appears in the sequence
The resulting tensor is the actual hidden-state input to the first decoder layer.

embed_position_input = embedded_input + embed_pos_input

STEP 10: Pass the combined embeddings into the first self-attention block

The self-attention module consists of the previously introduced projection layers: query (q_proj), key (k_proj), and value (v_proj), along with a final output projection (out_proj). Its input is formed by combining the embedded token representations with their positional encoding vectors. In practical implementations, an attention mask is also provided so the model can ignore or exclude certain parts of the input sequence. For simplicity, this masking mechanism is omitted in the sample code.

Beyond the attention mechanism, the architecture includes several additional components that enhance the model’s expressive power: nonlinear activation functions, feedforward layers, and normalization techniques.

Nonlinearity, commonly implemented with the ReLU (Rectified Linear Unit) activation function, allows the network to learn complex relationships in the data. By introducing nonlinear transformations, the model can represent patterns that cannot be captured using only linear operations. This property is essential because stacking purely linear layers would be equivalent to a single linear transformation, limiting the model’s capacity.

The architecture also incorporates feedforward networks, which are composed of fully connected layers. These layers transform the embeddings into more abstract representations. Typically, a feedforward block contains two linear transformations separated by a ReLU activation, enabling the network to model more complex functions. Through this process, the model moves from capturing low-level word information toward higher-level semantic meaning, helping it understand both individual words and broader concepts.

Finally, normalization methods, particularly layer normalization, are applied to stabilize training. As multiple layers are stacked together, normalization ensures that the inputs to each layer maintain consistent mean and variance. This stabilization improves both training efficiency and model performance.

Together, self-attention, nonlinear activations, feedforward layers, and normalization allow the model to effectively learn and represent complex relationships within the data.

OPT.model.decoder.layers[0] is the first transformer decoder block.
.self_attn is the self-attention module inside that block.

Self-attention allows each token to attend to other tokens in the sequence
and build context-aware representations.

The call returns multiple values. In this code:

hidden_states captures the main output tensor from self-attention
the two underscores (_) ignore the other returned values

Typically these other values may include attention weights or cached key/value
tensors, depending on model configuration.

Layer:   OPTAttention(
  (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
)
Size:    torch.Size([1, 16, 2048])
Output:  tensor([[[ -5168., -18032., -59648.,  ...,    -inf,    -inf,     inf],
         [ 25904.,   4476.,    -inf,  ..., -47136., -15416.,  59520.],
         [ 24992., -22320., -38720.,  ..., -57248., -28160.,  44864.],
         ...,
         [   -inf, -46080., -26672.,  ...,  64992.,  11928., -21088.],
         [-12768., -44992.,    -inf,  ...,    -inf, -25872.,  -1644.],
         [ -9784.,   2084.,   4416.,  ...,    -inf,    -inf,    -inf]]],
       dtype=torch.float16, grad_fn=<MatMul8bitLtBackward>)