Summarizing FAST: Efficient Robot Action Tokenization (Physical Intelligence)

March 1, 2025

Read the research paper →

Introduction and Background

FAST (Frequency-space Action Sequence Tokenization) is a method addressing the challenge of discretizing continuous robot actions for use in large sequence models. Transformer-based vision-language-action (VLA) policies have shown promise in learning complex robotic behaviors, but they require actions to be represented as sequences of discrete tokens. Prior works typically used a naïve tokenization: discretizing each action dimension at each time step into fixed bins (often 256 levels). This simple per-dimension, per-timestep binning produces very long token sequences for high-frequency or dexterous tasks and leads to highly correlated tokens, degrading the training signal for autoregressive models.

As a result, previous VLA policies struggled with fast-control tasks (e.g. OpenVLA failed to fit the high-frequency DROID dataset). In other domains like NLP and speech, effective tokenization/compression (e.g. byte-pair encoding for text, spectrograms for audio) greatly improves modeling. The FAST paper builds on this insight and proposes a compression-based action tokenization scheme to better encode robot action sequences. The authors introduce FAST, which uses a discrete cosine transform (DCT) based pipeline, and FAST+, a universal pre-trained tokenizer for broad robotic use. Their aim is to enable training generalizable, high-frequency robotic policies with transformers, where previous discretization methods failed.

Methodology and Approach

FAST Tokenization Pipeline: FAST compresses continuous action trajectories into discrete tokens via frequency-domain transformation and encoding. The key steps are:

  1. Quantile Normalization: Scale each action dimension so that its values (in the training data) fall in a standard range (e.g. 1st–99th percentile mapped to [-1,1]). This makes the tokenization consistent across different robots and robust to outliers.
  2. Discrete Cosine Transform (DCT): Apply DCT to each action dimension's time-series, converting the sequence into a set of frequency coefficients. Low-frequency coefficients capture the coarse motion, while high-frequency coefficients capture fine, rapid changes.
  3. Coefficient Quantization: Scale and round the DCT coefficients to zero-out small-magnitude values, effectively dropping insignificant high-frequency components. This yields a sparse matrix of mostly zeros (representing compressed action signal).
  4. Flattening (Low-Frequency First): Flatten the sparse DCT coefficient matrix into a 1D sequence of integers, interleaving coefficients such that all low-frequency components (across all action dimensions) come first. This ordering means the most informative tokens (overall trajectory shape) appear earliest in the sequence, aiding the autoregressive model's predictions.
  5. Byte-Pair Encoding (BPE) Compression: Feed the flattened sequence into a BPE compressor to merge frequent Feed the flattened sequence into a BPE compressor to merge frequent patterns and runs of zeros into single tokens. BPE produces a compact sequence of discrete tokens and a fixed-size vocabulary that easily integrates into transformer models.

All steps are invertible, so the original continuous actions can be reconstructed from the tokens. Notably, FAST's tokenization has only two hyperparameters (the quantization scale and BPE vocabulary size), which are not sensitive and were fixed across all experiments. This simplicity contrasts with prior learned tokenizers (e.g. vector-quantized autoencoders) that require complex training and tuning.

FAST+ Universal Tokenizer: To avoid training a new BPE for each robot or dataset, the authors introduce FAST+, a universal action tokenizer. FAST+ is created by applying the above pipeline on a large corpus (~1 million) of 1-second action sequences collected from diverse robots (single-arm, bi-manual, mobile platforms) with various action dimensions and control frequencies. The resulting BPE vocabulary (of size 1024) is thus "universal," allowing FAST+ to serve as a plug-and-play tokenizer for any new robot's actions without retraining. Experiments confirm that FAST+ generalizes well: when tested on completely new robot datasets, it consistently compresses action sequences by about 2× (or more) relative to naive binning, and policies trained with FAST+ perform on par with those using a tokenizer customized to that dataset.

Key Findings and Results

Efficient Compression: FAST dramatically reduces the token sequence length needed to represent actions compared to naive discretization. Across various robot datasets, FAST produces far fewer tokens per action chunk – roughly ~30 tokens per second per robot arm (so ~60 for bimanual tasks), regardless of control frequency. This is a significant compression, especially in high-frequency domains (e.g. 50 Hz T-shirt folding), where naive binning yields hundreds of tokens for the same 1-second trajectory. The compression is achieved with minimal loss; FAST's quantization can be tuned to balance fidelity and token count, and default settings reached comparable reconstruction accuracy as naive methods while using an order of magnitude fewer tokens.

Improved Learning Performance: By compressing out redundant information, FAST greatly speeds up and improves policy learning. In a didactic experiment, an autoregressive model trained to predict a smooth spline failed at high sampling rates using naive per-step tokens (error skyrocketed as frequency increased), whereas with DCT-based tokens, error remained low across all frequencies. In real robot tasks, policies using naive tokenization struggled or outright failed on the highest-frequency, most dexterous tasks in the evaluation suite (e.g. Table Bussing at 20 Hz and T-Shirt Folding at 50 Hz saw near-zero task success). In contrast, policies trained with FAST (or similar compressed tokens) learned effectively on all tasks, achieving significantly higher success rates in those challenging domains. Overall, FAST-based tokenization led to more data-efficient training – the same model could reach competent performance where the baseline could not, due to the richer information content per token.

Comparison of Tokenization Methods: The authors compared FAST to alternative discretization schemes. One baseline, FSQ (a frequency-domain quantization method introduced by the authors as an ablation), also improved learning over naive binning by compressing the action targets, but FAST generally performed as well or better, especially on complex manipulation tasks. Another comparison was with learned vector-quantized (VQ) tokenizers (trained autoencoders that produce discrete codes for actions): those worked for coarse, low-frequency motions but failed to capture fine-grained high-frequency details, leading to poor policy performance on dexterous tasks. FAST outperformed such VQ methods while being much simpler and having far fewer tuning parameters. Crucially, FAST's performance was model-agnostic – swapping naive tokens with FAST tokens boosted OpenVLA (a different transformer backbone) on tasks it originally couldn't learn, proving the tokenization benefits hold across architectures.

State-of-the-Art and Scaling: When combined with a powerful VLA model (the authors use a 3B-parameter transformer dubbed "π-FAST"), FAST enabled scaling up to 10,000+ hours of robot data for training a single policy. The resulting generalist policy (trained on the large DROID multi-task dataset) matched the performance of state-of-the-art diffusion-based policies on a wide range of manipulation tasks, while reducing training time by up to 5×. Notably, this π-FAST policy is the first language-conditioned generalist robot policy that can be evaluated in a zero-shot fashion on novel environments simply by giving it a new instruction. The paper demonstrates this by deploying the policy (with FAST tokenization) in unseen tabletop settings across three different university campuses. The policy, without any fine-tuning, could successfully execute various open-ended tasks (picking and placing objects, opening cabinets, turning on faucets, etc.) purely from natural language prompts, showing robust generalization. Even failures were sensible (e.g. reaching for a door handle but not opening it), indicating the model learned a reasonable understanding of the tasks. This level of zero-shot generality was not achieved by prior work – earlier DROID-based policies and OpenVLA required environment-specific training or adaptation and did not report zero-shot results. In summary, FAST's effective tokenization unlocks both better task performance and broader generalization in transformer policies, at substantially lower computational cost than existing approaches.

Applications and Implications

The FAST approach has significant implications for robot learning and deployment of large-scale policies:

  • Enabling High-Frequency Skill Learning: By mitigating token redundancy, FAST makes it feasible to train sequence models on fine-grained control tasks (such as dexterous bi-manual manipulation or high-speed motions) that were previously impractical with discrete tokens. This broadens the range of skills and tasks that can be learned from demonstration data using transformer policies, including handling deformable objects (cloth, bags) and dynamic interactions that demand rapid control updates.
  • Generalist Robots via Scalable Training: FAST allows merging diverse robot datasets (potentially tens of thousands of hours) into one model, since the tokenization can handle varying action dimensions and frequencies uniformly. The success of the π-FAST policy on multitask data suggests that we can train general-purpose robotic agents that follow language instructions across many tasks and environments. In practice, this moves closer to the vision of foundation models in robotics – large pre-trained policies that can be instructed to perform a range of behaviors.
  • Ease of Integration: An attractive feature of FAST is that it does not require any modification to the model architecture or special training tricks – it simply preprocesses the action representation. This means any pre-trained transformer (e.g. a vision-language model) can be turned into a VLA policy by adding FAST tokens to its vocabulary. The authors indeed integrate FAST by replacing some unused text tokens with action tokens in the model's vocabulary, then fine-tuning it for action prediction. Such plug-and-play integration lowers the barrier to using advanced language/vision models for robotic control.
  • Unified Tokenization Standard: With the release of FAST+ (the universal tokenizer), the community gains a standardized tool for encoding robot actions. Different labs and robots can use the same tokenization scheme, facilitating transfer learning and sharing of models. A single policy could potentially be deployed across different robot hardware by simply feeding it actions via FAST+ encoding, since the token space is unified across many embodiments. This could accelerate research in multi-robot and cross-embodiment learning, as suggested by FAST+'s success on unseen robot data.
  • Computational Efficiency: The compression significantly reduces sequence length, which directly saves memory and compute during training and inference. The fact that FAST achieved 5× faster training to reach SOTA performance means researchers can iterate on larger models or datasets more quickly. It also implies lower latency in online decision-making if the robot uses the model in real-time (fewer tokens to process per control cycle). Although the current π-FAST policy is autoregressive (slower per-step inference than diffusion), the reduced token count and potential for batching steps still help; further optimizations (discussed below) can narrow the gap.

Comparisons to Prior Research

This work builds upon and significantly improves previous approaches to representing and learning robotic actions:

Versus Naïve Discretization: Earlier transformer-based robot policies (e.g. RT-1, Bridge, OpenVLA) simply discretized each joint command at each time step into bins. While straightforward, this approach worked acceptably only for low-frequency or short-horizon tasks. The FAST paper identifies that naive tokenization fails for smooth, high-frequency trajectories because consecutive tokens carry almost no new information, crippling the next-token prediction objective. FAST directly addresses this by compressing out redundancy (using DCT), yielding tokens with much higher information content per token. Empirically, FAST-enabled models outperform those using naive tokens on all challenging tasks. In areas like language modeling, it's well-known that a good tokenization improves model efficiency and accuracy, and FAST demonstrates the same principle in robotics.

Versus Continuous or Diffusion Policies: Some recent works avoid discrete tokens altogether by using regression outputs or diffusion models to generate continuous actions. For example, diffusion-based VLAs treat action generation as a denoising process instead of classification. These methods can capture fine details but often demand architectural changes (e.g. additional diffusion decoder networks) and are computationally heavy to train. The authors show that with FAST, a plain autoregressive transformer can achieve competitive performance to state-of-the-art diffusion models on many tasks, while being much more training-efficient. In fact, π-FAST matched diffusion policy performance at 5× less training time. FAST's advantage is that it leverages the simplicity and maturity of standard sequence models (and their optimizations) by making the discrete representation workable, rather than developing a new generation method from scratch.

Versus Learned VQ-VAE Tokenizers: Prior research also explored learning a discrete representation of actions via neural autoencoders (vector quantization). While flexible, those approaches introduce a complex training stage for the tokenizer itself, and tend to be brittle – reconstruction quality depends on carefully tuned hyperparameters and often struggles with high-frequency nuances. FAST, by using an analytic compression (DCT) and a proven text compression (BPE), avoids training a tokenizer network altogether. It robustly preserves fine details (e.g. the precise motions in cloth folding) where learned VQ approaches failed. The results showed FAST yields higher-fidelity control and task success than a baseline VQ method, with far less complexity.

Inspiration from Other Domains: The concept of compressing data before sequence modeling aligns with practices in other domains: NLP uses subword merges (BPE) to reduce sequence length, vision uses tokenizers like VQ-GAN or image patches, and audio uses frequency transforms or learned codecs. FAST is novel in applying frequency-domain compression to robot actions, demonstrating that classical signal processing (DCT) combined with modern tokenization can significantly improve robot policy learning. By doing so, it connects robotics with the successes of foundation models in NLP/vision, whereas previous robotics works either stuck with naive discretization or attempted end-to-end learning of tokens. FAST shows that a thoughtfully engineered tokenization can be a drop-in improvement for many existing systems (since it doesn't require changing the model), enabling those systems to scale to harder tasks.

Conclusion and Future Directions

Conclusion: The paper presents FAST as an effective solution for tokenizing high-frequency robotic actions, leading to better compression and learning outcomes than prior methods. Through extensive real-world and simulated experiments, the authors demonstrated that FAST dramatically improves policy performance over naive discretization and even outperforms more complex learned tokenizers. Moreover, FAST's integration into large-scale training produced a generalist policy (π-FAST) that matches the state-of-the-art in capability while being far more efficient to train. FAST+ further generalizes this approach, offering a ready-to-use tokenizer for any robot and establishing a strong default for future VLA models. These contributions advance the field by removing a key bottleneck in applying powerful sequence models to robotics – the representation of continuous actions – and by showing that broad generalization and zero-shot execution in robotics are attainable with the right training representation.

Future Work: There are several promising avenues following this work:

  • Broader Robot Domains: While FAST was validated on fixed manipulators, initial tests showed it compresses other morphologies (mobile robots, dexterous hand manipulators, humanoids) well. A future step is to train and evaluate VLA policies with FAST on these platforms to confirm performance gains in more dynamic settings.
  • Alternative Compression Methods: FAST's DCT+BPE pipeline is one design; exploring different compression transforms or algorithms (e.g. wavelets, learned codecs, Huffman coding) could yield even better token efficiency. Combining frequency-based tokenization with diffusion or other non-autoregressive decoding strategies is also suggested, potentially marrying the benefits of both approaches.
  • VLA Architecture Trade-offs: The paper opens the discussion on autoregressive vs diffusion-based VLA models, but more work is needed to determine the best paradigm. Future research might compare these on factors like training speed, ability to follow language instructions accurately, and range of behaviors they can express. Such studies will inform when a simpler AR model with FAST suffices and when more complex decoders are justified.
  • Inference Speed Optimization: One noted drawback is that an autoregressive policy like π-FAST, while faster to train, can be slower at runtime because it generates actions step-by-step. The authors point out that techniques from accelerating large language model inference (e.g. parallel decoding, token pruning, caching strategies) could be applied to VLA models to speed them up. Future work can focus on reducing latency so that FAST-based models can tackle highly dynamic real-time tasks (where decisions must be made in milliseconds).

In conclusion, FAST significantly contributes to robot learning by introducing a powerful yet simple way to discretize actions for sequence models. It enables training robots on tasks and scales that were previously unattainable with discrete policies, and sets the stage for more unified and generalist robotic intelligence. The research community can build on these findings to refine action tokenization further and integrate it with evolving model architectures, ultimately moving closer to fluent, versatile robot behavior driven by advanced AI models.