Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train–test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time.
We propose Vision–Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation–language inputs without modifying policy parameters. By leveraging vision–language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements.
VLS grounds out-of-distribution observation–language inputs into a compact geometric scaffold of task-relevant 3D keypoints using SAM and DINOv2 features. A VLM decomposes the task into sequential stages and synthesizes differentiable, programmatic reward functions as PyTorch operations.
VLS steers sampling through three mechanisms: RBF-based diversity using repulsive forces, gradient-based refinement injecting reward gradients into denoising updates, and Feynman–Kac resampling that tilts the distribution toward high-reward regions.
To handle physical uncertainty, VLS employs adaptive guidance strength based on reward feedback and a Schmitt-trigger-based stage switching mechanism. This hysteresis-based approach avoids oscillatory behavior and enables stable multi-stage task execution.
VLS enables robust manipulation across diverse tasks in the CALVIN benchmark. Below are example rollouts showing successful task completion for movable objects (colored cubes) and articulated parts (drawer, switch, button, door).
VLS improves the frozen π-0.5 policy under both task perturbations (changed language instructions) and position perturbations (relocated objects) across LIBERO's four task suites: Goal, Spatial, Long (10), and Object.
VLS enables robust real-world deployment on a Franka Emika robot in a kitchen environment with a DROID-style camera setup. We demonstrate both in-distribution tasks and out-of-distribution scenarios including appearance, position, and object shifts.
Side-by-side comparison of π-0.5 vs. VLS on an open-world steering task, demonstrating VLS's ability to adapt to out-of-distribution scenarios.
We evaluate four leading VLA models under position and task perturbations. Despite strong in-distribution performance, pretrained VLAs fail under OOD conditions—demonstrating that inference-time steering is essential.
| Method | Task Perturbation | Position Perturbation | Overall | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Goal | Spatial | 10 | Object | Avg. | Goal | Spatial | 10 | Object | Avg. | ||
| OpenVLA | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| π-0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| π-0.5 | 0.00 | 1.00 | 1.00 | 1.00 | 0.75 | 38.00 | 20.00 | 8.00 | 17.00 | 20.75 | 10.75 |
| π-0.5 (LeRobot) | 12.00 | 48.50 | 21.50 | 10.50 | 23.13 | 29.00 | 41.00 | 11.00 | 16.00 | 24.25 | 23.69 |
| π-0.5 (LeRobot) + VLS | 33.50 | 54.00 | 25.50 | 41.00 | 38.50 | 38.00 | 42.00 | 15.50 | 45.00 | 35.13 | 36.81 |
We evaluate VLS on a Franka Emika robot with a Robotiq gripper in a kitchen environment. The setup uses a DROID-style camera configuration with a Zed mini (wrist) and Zed 2 (side) cameras.
We propose VLS, a training-free framework that guides pretrained robotic policies using differentiable rewards generated by Vision–Language Models, addressing the challenge of policy deployment in out-of-distribution scenarios. By treating adaptation as an inference-time control problem rather than a retraining problem, VLS enables frozen diffusion and flow-matching policies to execute reliably under spatial and semantic shifts that would otherwise cause failure.
Experiments demonstrate that VLS significantly outperforms existing methods in both simulation (31% improvement on CALVIN, 13% on LIBERO-PRO) and real-world tasks on a Franka robot.
@article{liu2026vls,
title = {VLS: Steering Pretrained Robot Policies via Vision-Language Models},
author = {Shuo Liu and Ishneet Sukhvinder Singh and Yiqing Xu and Jiafei Duan and Ranjay Krishna},
journal = {arXiv preprint arXiv:2602.03973},
year = {2026}
}