The world of robotics is evolving rapidly with the advent of foundation models: large AI systems that enable robots to perform complex manipulation tasks with unprecedented flexibility. These models, which leverage techniques like transformers and imitation learning, allow robots to learn across diverse environments and tasks without task-specific programming.
In this blog post, we’ll break down key foundation models for robotic manipulation, including:
- Architectures of Foundation Models: How modern models like ACT, Octo, OpenVLA and Helix are designed to enable robots to perform generalist tasks.
- Action Representation: Different methods for representing actions, such as continuous space, discretization and diffusion-based generation.
- Finetuning Considerations: Key insights on how to adapt these models for specific tasks and environments to ensure high performance in real-world applications.
Whether you're an AI researcher, roboticist or just curious about the future of autonomous robots, this guide will provide a clear and engaging overview of the exciting innovations shaping the next generation of robotic manipulation.
ACT leverages transformer-based action chunking and end-to-end imitation learning to enable low-cost robotic arms to perform complex tasks with high success rates. Developed as part of the ALOHA project, ACT learns from real-world demonstrations collected via a custom teleoperation setup. The model generates action sequences in a chunked manner, improving stability and reducing compounding errors over time. With only 10 minutes of demonstrations, ACT enables robots to achieve 80-90% success rates on fine manipulation tasks.
ACT uses a dataset collected from real-world bimanual teleoperation experiments. The dataset consists of human demonstrations, meaning they gather their own data rather than relying on pre-existing datasets. The demonstration dataset consists of trajectories of image observations, joint positions and executed actions.
- 4 RGB images
- Joint positions for the two robot arms (7+7=14 DOF)
- Absolute joint positions in chunks (e.g., next 100 timesteps)
From the demonstration dataset, we sample:
- A sequence of RGB images
- Joint positions of two 7-DOF robot arms (14-dimensional vector)
- A target action sequence over the next
$k$ time steps
The encoder is a BERT-style transformer encoder that receives:
- A learned [CLS] token
- The current joint positions, projected to the embedding dimension
- The target action sequence, also linearly embedded
These inputs form a
The decoder — the actual policy — takes as input:
- Image features: Each image is processed with a ResNet18 to get a 15×20×512 feature map, flattened into a sequence of 300×512. For 4 cameras, this gives a total of 1200×512.
- 2D sinusoidal position embeddings are added to preserve spatial structure.
-
Joint positions and
$z$ , both projected to the same embedding dimension.
These inputs are concatenated into a 1202×512 sequence and passed through a transformer encoder. A transformer decoder uses cross-attention to generate a sequence of
At test time, the model uses only the CVAE Decoder as the policy. The encoder is discarded.
- The robot receives a new observation: RGB images + joint positions
- These are processed exactly as during training (ResNet18 → flattened features → transformer encoder)
- The style variable
$z$ is fixed to a zero vector (i.e., mean of the prior distribution) - The transformer decoder outputs a deterministic
$k \times 14$ tensor, corresponding to the next$k$ joint positions
This deterministic decoding provides stable, repeatable behavior, which is especially valuable for evaluation and deployment.
A central innovation of ACT is its use of action chunking - predicting sequences of joint positions over a fixed horizon (e.g., the next k steps) instead of single-step actions. This chunked prediction strategy reduces the task's effective time horizon and significantly mitigates compounding errors during execution.
Octo is a large, transformer-based policy pretrained on 800k demonstrations from the Open X-Embodiment dataset. Designed for flexibility, it supports multiple robots, sensor setups, and task types - including language commands and goal images. Octo can be finetuned quickly on new environments and is fully open-source, making it a powerful foundation for scalable, general-purpose robotic learning.
Octo is trained on a massive dataset of 800,000 robot trajectories collected from the Open X-Embodiment dataset - the largest and most diverse robot manipulation dataset to date. This dataset brings together demonstrations from nine different robotic platforms, spanning a wide variety of manipulation tasks such as pick-and-place, tool use, button pressing and drawer opening or closing. The data is highly heterogeneous, featuring a mix of camera perspectives (e.g., wrist-mounted and third-person views), robots with different degrees of freedom, and task-conditioning signals in the form of either language instructions or goal images.
- RGB images from multiple viewpoints (wrist cam, third-person).
- Proprioceptive states (joint positions, velocities).
- Task conditioning:
- Text commands (e.g., "Pick up the red cup").
- Goal images (e.g., "Make the scene look like this").
- Delta position Cartesian actions in chunks.
Octo architecture consists of three main components:
- Input tokenizers for processing observations and task specifications
- A transformer backbone that encodes the unified input sequence
- Readout heads that decode the embeddings into actionable commands
Octo supports multiple input modalities including language commands, goal images, and diverse robot observations. Each of these is converted into a unified token representation using modality-specific encoders:
- Language commands are tokenized and encoded using a pretrained T5-base transformer model, producing a sequence of language embeddings.
- Goal images and RGB observations (from wrist or third-person cameras) are passed through a shallow CNN, then divided into flattened patch sequences.
After encoding, each token is assigned a learned positional embedding. These are concatenated into a single token sequence that includes both task tokens (e.g., language or goal images) and observation tokens, forming the complete input to the transformer.
The token sequence is processed by a transformer model with a block-wise attention mechanism. Observation tokens are allowed to attend causally - meaning only to past or current tokens - while also attending to task tokens. This structure ensures proper temporal consistency in policy outputs.
Importantly, modality-specific blocks can be masked, enabling Octo to seamlessly handle datasets with missing modalities (e.g., no language input) and making it highly modular for downstream finetuning.
To generate actions, readout tokens are inserted into the input sequence. These tokens attend to task and observation tokens but are not attended to in return. They act as passive readers, similar to the [CLS] token in BERT, summarizing the encoded information.
The output embeddings of the readout tokens are passed through a lightweight action head based on diffusion models, which predicts a chunk of future actions. This formulation allows Octo to model complex, multimodal action distributions and supports chunked action execution similar to ACT.
One of Octo’s key design advantages is its modular and adaptable architecture. During finetuning, new sensors, tasks, or robot morphologies can be integrated by simply attaching new lightweight encoders, positional embeddings, or output heads — all without modifying the pretrained transformer weights. This stands in contrast to prior architectures that often require reinitialization or full retraining when adapting to new settings.
OpenVLA is a 7B-parameter open-source model for generalist robot manipulation, trained on 970k real-world demos from the Open X-Embodiment dataset. It combines a LLaMA 2 language model with visual features from DINOv2 and SigLIP, enabling rich vision-language grounding.
OpenVLA is trained on a curated subset of 970,000 robot demonstrations from the Open X-Embodiment dataset, which contains over 2 million trajectories from 70+ robotic platforms. To ensure consistency, only demonstrations with third-person camera views and single-arm end-effector control were included. For diversity, the team followed Octo’s data mixture strategy, prioritizing datasets with a wide range of tasks and scenes, while down-weighting redundant or narrow-scope data. This balance enables strong generalization across embodiments and environments.
- Observation image(s): One or more RGB frames from third-person cameras, processed by the visual encoder.
- Language instruction: A natural language command describing the desired task (e.g., "stack the blocks" or "put the apple in the bowl").
- Delta position Cartesian actions as discrete tokens
OpenVLA builds on a modular vision-language foundation, with three primary components:
-
Visual Encoder:
- Dual-encoder setup: features from DINOv2 and SigLIP are extracted independently and concatenated.
- Enables strong spatial grounding, helpful for manipulation tasks involving complex scenes.
-
Projector:
- A small 2-layer MLP that maps visual features into the language model's token embedding space.
- Ensures compatibility with the Llama 2 tokenizer and architecture.
-
Language Model Backbone (Prismatic-7B):
- Based on Llama 2 (7B), pretrained on large-scale Internet text.
- Fine-tuned with a next-token prediction objective on mixed vision-language-action data.
- Predicts tokenized robot actions in an autoregressive fashion, conditioned on the task context.
This combination allows OpenVLA to act as a generalist visuomotor controller, understanding high-level language commands and grounding them into low-level action sequences.
A key innovation of OpenVLA is its ability to ground natural language instructions in visual observations by leveraging a large pretrained language model (LLaMA 2) within a unified vision-language-action architecture. This enables OpenVLA to understand and execute complex task instructions - such as “place the blue mug on the top shelf next to the red bowl” - without requiring handcrafted reward functions or rigid scripting.
Helix (Figure AI) is a Vision-Language-Action (VLA) model capable of controlling the entire upper body of a humanoid robot from raw pixels and natural language. It introduces a novel dual-system design - System 1 for fast, reactive control and System 2 for semantic understanding - enabling real-time dexterous manipulation grounded in language.
Helix is trained on a high-quality, diverse dataset consisting of approximately 500 hours of teleoperated demonstrations, collected across multiple robots and human operators. These demonstrations cover a broad spectrum of upper-body behaviors, including precise finger movements, coordinated arm motions and full-body pose adjustments. To generate language-conditioned training pairs at scale, an auto-labeling vision-language model (VLM) is used to create hindsight instructions. This model analyzes segmented video clips from onboard cameras and answers the prompt: “What instruction would you have given the robot to get the action seen in this video?”
- Monocular RGB image from the robot’s onboard camera
- Robot state information (e.g., wrist pose, finger joint positions)
- Natural language command specifying the desired behavior
- Continuous 35-DoF action vector at 200Hz, including:
- Wrist pose targets
- Finger movements
- Head and torso orientation
Helix consists of two main components that operate at different frequencies: System 2 (S2) for high-level perception and planning, and System 1 (S1) for low-level real-time control.
S2 is a 7B-parameter vision-language model (VLM), pretrained on large-scale internet data. It processes:
- Monocular RGB images from the robot’s onboard camera
- Proprioceptive robot state (e.g., wrist pose, finger joint positions)
- A natural language command
These inputs are encoded into a shared embedding space and distilled into a single latent semantic vector, which summarizes the high-level task intent. This vector is passed to S1 to guide motor control.
S1 is an 80M-parameter cross-attention encoder-decoder transformer optimized for reactive control at 200 Hz. It uses:
- A multi-scale convolutional vision backbone pretrained in simulation
- The same image and state inputs as S2
- The latent vector from S2 as task-conditioning input
These inputs are combined and processed to produce continuous control outputs for:
- End-effector poses (wrist and arm)
- Finger flexion and abduction
- Head and torso orientation
- A scalar representing task progress (used for predicting completion)
Helix introduces a novel dual-system architecture inspired by "System 1 / System 2" reasoning. System 2 (S2) handles slow, semantic understanding using a large vision-language model, while System 1 (S1) performs fast, reactive control at 200 Hz. This separation allows Helix to combine internet-scale language grounding with high-frequency, whole upper-body humanoid control.
One of the most crucial components in any robot policy is how actions are represented and generated. Different approaches make different trade-offs in terms of generalization, expressivity, and training stability. This chapter outlines and compares three prominent action representation strategies: MSE regression, discretization, and diffusion-based generation.
The most straightforward method is to directly regress the next action (e.g., joint positions or torques) using Mean Squared Error (MSE):
This method assumes a unimodal distribution, producing the "average" best action. It works well when demonstrations are consistent, but struggles in multimodal settings, where multiple distinct strategies exist (e.g., grasping an object from different angles).
Instead of predicting continuous values, one can discretize each action dimension into
- Each action
$a_i$ is split into$K$ bins. - The model outputs a probability distribution over bins.
Training is done using cross-entropy loss:
This approach enables multi-modal prediction by selecting among multiple possible bins. However, it introduces quantization errors and can lead to coarse, jittery behavior in fine-grained tasks.
Diffusion models provide a powerful way to represent multi-modal, continuous action distributions, especially when predicting chunks of actions rather than single steps. These models consist of two phases: a forward process (adding noise) and a reverse process (iterative denoising).
In the forward process, we gradually add Gaussian noise to a ground-truth action chunk ( a_0 ), generating noisy versions ( x_k ) at timestep ( k ):
The model is trained to predict the noise that was added:
Where:
-
$x_k$ is the noisy action chunk at timestep$k$ -
$\epsilon_\theta$ is the denoising network (diffusion head) -
$e$ is the context embedding from the transformer
Training Pseudocode:
def diffusion_training_step(a0, context_embedding, noise_schedule):
k = sample_timestep()
eps = torch.randn_like(a0)
alpha_k = noise_schedule.alpha(k)
# Forward diffusion (noisy input)
x_k = torch.sqrt(alpha_k) * a0 + torch.sqrt(1 - alpha_k) * eps
# Predict noise
eps_pred = denoise_net(x_k, context_embedding, k)
# Loss: predict the added noise
loss = F.mse_loss(eps_pred, eps)
return loss
Once a diffusion model is trained, generating actions is done via a reverse denoising process, starting from random noise and progressively refining it into a meaningful action or action chunk.
Given a noisy action sample
Where:
-
$x_k$ is the noisy action chunk at step$k$ -
$\epsilon_\theta(x_k, e, k)$ is the predicted noise from the denoising network -
$e$ is the transformer-derived context embedding -
$\alpha_k, \gamma_k, \sigma_k$ are parameters from a cosine or linear noise schedule - The added Gaussian noise
$\mathcal{N}(0, I)$ ensures sample diversity
This step is iteratively applied from a pure noise sample
def generate_action_chunk(context_embedding, T=50):
# Start from pure Gaussian noise
x = torch.randn(batch_size, action_dim)
# Iteratively denoise
for k in reversed(range(T)):
eps_pred = denoise_net(x, context_embedding, k)
alpha_k, gamma_k, sigma_k = noise_schedule.get(k)
# Reverse denoising step
x = alpha_k * (x - gamma_k * eps_pred)
if k > 0:
x += sigma_k * torch.randn_like(x)
return x # Final predicted action chunk
Fine-tuning is a critical process for adapting pre-trained VLA models to specific tasks and robot setups. While these models are powerful and generalizable out of the box, fine-tuning allows for improved performance in diverse real-world scenarios. This chapter explores general considerations and the different strategies available for fine-tuning VLA models to achieve task-specific optimization.
Full fine-tuning involves updating all model parameters, including the vision encoder, LLM and transformer layers. This approach provides the most flexibility and potential for performance improvement but requires substantial computational resources. It is suitable for situations where the robot setup and task domain are significantly different from the pre-trained data.
- Pros: High performance with full adaptation.
- Cons: High computational cost and memory usage.
In this strategy, only the last layer of the model’s transformer backbone are fine-tuned. This method significantly reduces the number of trainable parameters and computation requirements but may limit the model’s ability to adapt to new tasks that demand deeper adjustments across the network.
- Pros: Low computational cost and memory usage.
- Cons: Likely to yield poorer performance on complex tasks.
Sandwich fine-tuning unfreezes the vision encoder and last layer while keeping the rest of the model frozen. This technique is a compromise between full fine-tuning and parameter-efficient approaches, providing better adaptation to new visual features while saving on GPU memory by not fine-tuning the entire model backbone.
- Pros: Balanced approach with good performance and reduced memory usage.
- Cons: Still requires significant resources, though less than full fine-tuning.
LoRA is a low-rank adaptation technique that modifies only a small fraction of the model parameters while achieving performance close to that of full fine-tuning. By applying LoRA to all linear layers of the model, we can drastically reduce the number of trainable parameters (often to just 1.4% of the full model) and achieve significant computational savings without sacrificing performance.
- Pros: Best performance-compute trade-off, requiring only a fraction of the model parameters to be updated.
- Cons: May not fully capture all potential domain-specific nuances compared to full fine-tuning.
The selection of a fine-tuning strategy should depend on the available computational resources, the complexity of the task, and the extent of the domain shift between the pre-trained model and the target environment. For most use cases, LoRA presents a highly effective solution, offering an excellent trade-off between computational efficiency and task performance.
- T. Zhao, V. Kumar: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
- D. Ghosh, H. Walke: Octo: An Open-Source Generalist Robot Policy
- A. Brohan, N. Brown: RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- M. Kim, K. Pertsch: OpenVLA: An Open-Source Vision-Language-Action Model
- K. Black, N. Brown: π0: A Vision-Language-Action Flow Model for General Robot Control
- Figure AI: Helix: A Vision-Language-Action Model for Generalist Humanoid Control
- C. Chi, Z. Xu, S. Feng: Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
- K. Pertsch, K. Stachowicz: FAST: Efficient Action Tokenization for Vision-Language-Action Models
- G. Berseth: Coding Generalist Robot Policies