Summarizing Hi Robot (Physical Intelligence)

February 27, 2025

Read the research paper →

Problem & contribution

Open-world robots must interpret rich, multi-step natural-language instructions and adapt to real-time human feedback far beyond atomic commands like "pick up the cup." The paper introduces Hi Robot, a hierarchical system that marries high-level deliberative reasoning with low-level motor control, enabling multi-stage instruction following, on-the-fly corrections ("that's not trash"), and constraint handling (e.g., "make a vegetarian sandwich, no pickles").

System overview (System-2 + System-1)

Hi Robot is a two-layer policy:

A high-level VLM (System-2) ingests the open-ended prompt plus images (base + wrist cameras) and outputs a short low-level language command describing the next atomic skill.

A low-level VLA (System-1) consumes that command with images and robot state to produce continuous actions. The two run at different frequencies: low-level acts fast; high-level is re-invoked either every ~1s or immediately when user feedback arrives, yielding reactive but stable behavior.

Interaction loop

Users can interject at any time (text or speech→ASR). The high-level policy immediately recomputes the next command grounded in current images, and can optionally attach a verbal response (e.g., confirmation/clarification) that is played to the user (TTS) and removed before handing the command to the low-level controller. This grounding is crucial for context like "leave it alone" or "that's not trash."

Data & training pipeline

Human teleop demos are collected and segmented into short skills (≈1–3 s) plus simple movement primitives (e.g., small corrective motions).

A large VLM is then prompted with visual context + skill label to generate synthetic user prompts/interjections and paired robot utterances, producing situated high-level supervision that covers negative tasks, corrections, and constraints.

The high-level VLM is fine-tuned with next-token cross-entropy on image-language tuples; the low-level VLA is trained with a flow-matching objective to output continuous action chunks. Both policies share the PaliGemma-3B backbone; the low-level adds an "action expert" head for control. The framework is modular to swap in other language-conditioned policies.

System & hardware specifics

Speech is transcribed locally with Whisper large-v2; TTS uses Cartetia. Real-time inference runs on 1–2 NVIDIA RTX 4090 GPUs. Platforms include: UR5e single-arm (2 cameras), Bimanual ARX (two 6-DoF arms, 3 cameras), and Mobile ARX (ALOHA-based mobile base + two arms; 16-D action space).

Evaluation protocol & baselines

Three real-world domains probe long-horizon reasoning and interaction: table bussing, sandwich making, and grocery shopping. Metrics are: Instruction Accuracy (IA) agreement of high-level command with user intent and observations and Task Progress (TP) proportion of objects placed correctly. A blind human evaluator scores 20 trials per task per method. Baselines include: (i) GPT-4o as the high-level (same low-level controller), (ii) a flat VLA (no high-level), and ablations (flat + synthetic, no synthetic).

Ablations

(1) Synthetic interactions are critical: removing them harms language flexibility models ignore clarifications or violate constraints. (2) Hierarchy > flat: with identical data, the hierarchical policy better integrates mid-task updates and partial instructions (e.g., "bus only yellowish things").

Limitations & outlook

The current system relies on prompt engineering for synthetic data and decouples high- and low-level policies (the high-level isn't explicitly aware of low-level success/failure). Future work could couple the layers, unify them into a single model with hierarchical inference at run-time, and adaptively schedule multi-level processing.

Takeaway

By structuring VLMs into a hierarchical VLA stack and augmenting them with situated synthetic interactions, Hi Robot transforms open-ended user language and live feedback into grounded, sequenced actions demonstrating a practical recipe for steerable, long-horizon robotic behavior in the real world.