VLS: Steering Pretrained Robot Policies
via Vision–Language Models

Shuo Liu1,4, Ishneet Sukhvinder Singh2, Yiqing Xu3,4, Jiafei Duan1,4*, Ranjay Krishna1,4*

University of Washington1   |   University of Oxford2   |   National University of Singapore3   |   Allen Institute for AI4

*Co-advised

This video provides a high-level overview of our VLS framework and its capabilities for steering pretrained robot policies in out-of-distribution scenarios.

VLS Overview

Vision–Language Steering (VLS) is a training-free framework for inference-time steering of frozen generative robot policies. By leveraging VLMs to generate differentiable reward functions for partially denoised action proposals, VLS enables pretrained diffusion or flow-matching policies to adapt to out-of-distribution scenarios—such as object changes, scene changes, or instruction changes—without any fine-tuning.

Abstract

Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train–test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time.

We propose Vision–Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation–language inputs without modifying policy parameters. By leveraging vision–language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements.

+31%
Improvement on CALVIN
+13%
Gain on LIBERO-PRO
0
Training Required

Method Overview

VLS Pipeline
VLS Pipeline: Given RGB-D observation and language instruction, VLS grounds the input into task-relevant 3D keypoints using SAM and DINOv2. A VLM generates stage-aware differentiable reward functions that guide the frozen base policy's denoising process through gradient-based refinement, RBF diversity, and Feynman–Kac resampling.
1

OOD Input Grounding & Reward Generation

VLS grounds out-of-distribution observation–language inputs into a compact geometric scaffold of task-relevant 3D keypoints using SAM and DINOv2 features. A VLM decomposes the task into sequential stages and synthesizes differentiable, programmatic reward functions as PyTorch operations.

2

Inference-Time Denoising Guidance

VLS steers sampling through three mechanisms: RBF-based diversity using repulsive forces, gradient-based refinement injecting reward gradients into denoising updates, and Feynman–Kac resampling that tilts the distribution toward high-reward regions.

3

Closed-Loop Execution Control

To handle physical uncertainty, VLS employs adaptive guidance strength based on reward feedback and a Schmitt-trigger-based stage switching mechanism. This hysteresis-based approach avoids oscillatory behavior and enables stable multi-stage task execution.

CALVIN Rollouts

VLS enables robust manipulation across diverse tasks in the CALVIN benchmark. Below are example rollouts showing successful task completion for movable objects (colored cubes) and articulated parts (drawer, switch, button, door).

Movable Objects (Cubes)

Move Red Cube
Move Blue Cube
Move Pink Cube

Articulated Parts

Open Drawer
Close Drawer
Open Door Left
Open Door Right
Turn On Switch
Turn Off Switch
Press Button On
Press Button Off

LIBERO-PRO Rollouts

VLS improves the frozen π-0.5 policy under both task perturbations (changed language instructions) and position perturbations (relocated objects) across LIBERO's four task suites: Goal, Spatial, Long (10), and Object.

Task Perturbation (Language OOD)

Goal - Original
Goal - Base Policy (Fail)
Goal - VLS (Success)
Spatial - Original
Spatial - Base Policy (Fail)
Spatial - VLS (Success)
Long10 - Original
Long10 - Base Policy (Fail)
Long10 - VLS (Success)
Object - Original
Object - Base Policy (Fail)
Object - VLS (Success)

Position Perturbation (Observation OOD)

Goal - Original
Goal - Base Policy (Fail)
Goal - VLS (Success)
Spatial - Original
Spatial - Base Policy (Fail)
Spatial - VLS (Success)
Long10 - Original
Long10 - Base Policy (Fail)
Long10 - VLS (Success)
Object - Original
Object - Base Policy (Fail)
Object - VLS (Success)

Real-World Rollouts (Franka Robot)

VLS enables robust real-world deployment on a Franka Emika robot in a kitchen environment with a DROID-style camera setup. We demonstrate both in-distribution tasks and out-of-distribution scenarios including appearance, position, and object shifts.

In-Distribution Tasks

L1: Place orange on red plate
L1: Place orange on green plate
L2: Place banana on green plate
L2: Place orange on red plate

Out-of-Distribution Tasks

Appearance Shift: Yellow plate
Position Shift: Swapped plates
Object Shift: Place mug on plate

Open-World Steering Comparison

Side-by-side comparison of π-0.5 vs. VLS on an open-world steering task, demonstrating VLS's ability to adapt to out-of-distribution scenarios.

π-0.5
VLS (Ours)

Quantitative Results

LIBERO-PRO: Inference-Time Steering Is Necessary

We evaluate four leading VLA models under position and task perturbations. Despite strong in-distribution performance, pretrained VLAs fail under OOD conditions—demonstrating that inference-time steering is essential.

Method Task Perturbation Position Perturbation Overall
Goal Spatial 10 Object Avg. Goal Spatial 10 Object Avg.
OpenVLA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
π-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
π-0.5 0.00 1.00 1.00 1.00 0.75 38.00 20.00 8.00 17.00 20.75 10.75
π-0.5 (LeRobot) 12.00 48.50 21.50 10.50 23.13 29.00 41.00 11.00 16.00 24.25 23.69
π-0.5 (LeRobot) + VLS 33.50 54.00 25.50 41.00 38.50 38.00 42.00 15.50 45.00 35.13 36.81

CALVIN: VLS Outperforms Existing Steering Methods

CALVIN Results
Steering methods comparison on CALVIN. Success rates for VLS (ours), DynaGuide, ITPS, and the base diffusion policy across movable objects (cubes) and articulated parts (drawer, switch, button, door). VLS achieves 94% average on movable objects (7.4× over base policy) and 87% on articulated parts (9.6× boost), outperforming prior steering methods by 15–25 percentage points. Error bars show standard deviation over 600 episodes per task.

Ablation Study

Ablation Study
(Left) Ablation of VLS components (50 episodes per task). We compare Full VLS (gradient guidance + FK steering + RBF diversity) against variants that remove each component. Removing gradient guidance causes severe performance collapse, confirming it as the primary driver of VLS's effectiveness. (Right) Scaling with sample batch size on door_left task. Larger batch sizes improve performance at the cost of inference time, illustrating a practical compute–performance tradeoff.

Real-World Deployment

We evaluate VLS on a Franka Emika robot with a Robotiq gripper in a kitchen environment. The setup uses a DROID-style camera configuration with a Zed mini (wrist) and Zed 2 (side) cameras.

Real-World Deployment Results
(Left) In-Distribution Tasks: Level 1 requires placing an orange onto a specified plate (red or green). Level 2 introduces a banana, requiring sequential object and plate selection. (Right) Out-of-Distribution Tasks: (1) Appearance shift: replacing red/green plate with unseen yellow plate; (2) Position shift: swapping plate locations; (3) Object shift: replacing banana with a never-before-seen mug. Each task evaluated over 20 trials.

Key Findings

Conclusion

We propose VLS, a training-free framework that guides pretrained robotic policies using differentiable rewards generated by Vision–Language Models, addressing the challenge of policy deployment in out-of-distribution scenarios. By treating adaptation as an inference-time control problem rather than a retraining problem, VLS enables frozen diffusion and flow-matching policies to execute reliably under spatial and semantic shifts that would otherwise cause failure.

Experiments demonstrate that VLS significantly outperforms existing methods in both simulation (31% improvement on CALVIN, 13% on LIBERO-PRO) and real-world tasks on a Franka robot.

Limitations and Future Work

BibTeX

@article{liu2026vls,
  title     = {VLS: Steering Pretrained Robot Policies via Vision-Language Models},
  author    = {Shuo Liu and Ishneet Sukhvinder Singh and Yiqing Xu and Jiafei Duan and Ranjay Krishna},
  journal   = {arXiv preprint arXiv:2602.03973},
  year      = {2026}
}