VLS: Steering Pretrained Robot Policies
via Vision–Language Models

Shuo Liu^1,4, Ishneet Sukhvinder Singh², Yiqing Xu^3,4, Jiafei Duan^1,4^*, Ranjay Krishna^1,4^*

University of Washington¹ | University of Oxford² | National University of Singapore³ | Allen Institute for AI⁴

^*Co-advised

Paper arXiv Code

This video provides a high-level overview of our VLS framework and its capabilities for steering pretrained robot policies in out-of-distribution scenarios.

VLS Overview

Vision–Language Steering (VLS) is a training-free framework for inference-time steering of frozen generative robot policies. By leveraging VLMs to generate differentiable reward functions for partially denoised action proposals, VLS enables pretrained diffusion or flow-matching policies to adapt to out-of-distribution scenarios—such as object changes, scene changes, or instruction changes—without any fine-tuning.

Abstract

Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train–test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time.

We propose Vision–Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation–language inputs without modifying policy parameters. By leveraging vision–language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements.

+31%

Improvement on CALVIN

+13%

Gain on LIBERO-PRO

0

Training Required

Method Overview

VLS Pipeline — VLS Pipeline: Given RGB-D observation and language instruction, VLS grounds the input into task-relevant 3D keypoints using SAM and DINOv2. A VLM generates stage-aware differentiable reward functions that guide the frozen base policy's denoising process through gradient-based refinement, RBF diversity, and Feynman–Kac resampling.

1

OOD Input Grounding & Reward Generation

VLS grounds out-of-distribution observation–language inputs into a compact geometric scaffold of task-relevant 3D keypoints using SAM and DINOv2 features. A VLM decomposes the task into sequential stages and synthesizes differentiable, programmatic reward functions as PyTorch operations.

2

Inference-Time Denoising Guidance

VLS steers sampling through three mechanisms: RBF-based diversity using repulsive forces, gradient-based refinement injecting reward gradients into denoising updates, and Feynman–Kac resampling that tilts the distribution toward high-reward regions.

3

Closed-Loop Execution Control

To handle physical uncertainty, VLS employs adaptive guidance strength based on reward feedback and a Schmitt-trigger-based stage switching mechanism. This hysteresis-based approach avoids oscillatory behavior and enables stable multi-stage task execution.

CALVIN Rollouts

VLS enables robust manipulation across diverse tasks in the CALVIN benchmark. Below are example rollouts showing successful task completion for movable objects (colored cubes) and articulated parts (drawer, switch, button, door).

Movable Objects (Cubes)

Move Red Cube

Move Blue Cube

Move Pink Cube

Articulated Parts

Open Drawer

Close Drawer

Open Door Left

Open Door Right

Turn On Switch

Turn Off Switch

Press Button On

Press Button Off

LIBERO-PRO Rollouts

VLS improves the frozen π-0.5 policy under both task perturbations (changed language instructions) and position perturbations (relocated objects) across LIBERO's four task suites: Goal, Spatial, Long (10), and Object.

Task Perturbation (Language OOD)

Goal - Original

Goal - Base Policy (Fail)

Goal - VLS (Success)

Spatial - Original

Spatial - Base Policy (Fail)

Spatial - VLS (Success)

Long10 - Original

Long10 - Base Policy (Fail)

Long10 - VLS (Success)

Object - Original

Object - Base Policy (Fail)

Object - VLS (Success)

Position Perturbation (Observation OOD)

Goal - Original

Goal - Base Policy (Fail)

Goal - VLS (Success)

Spatial - Original

Spatial - Base Policy (Fail)

Spatial - VLS (Success)

Long10 - Original

Long10 - Base Policy (Fail)

Long10 - VLS (Success)

Object - Original

Object - Base Policy (Fail)

Object - VLS (Success)

Real-World Rollouts (Franka Robot)

VLS enables robust real-world deployment on a Franka Emika robot in a kitchen environment with a DROID-style camera setup. We demonstrate both in-distribution tasks and out-of-distribution scenarios including appearance, position, and object shifts.

In-Distribution Tasks

L1: Place orange on red plate

L1: Place orange on green plate

L2: Place banana on green plate

L2: Place orange on red plate

Out-of-Distribution Tasks

Appearance Shift: Yellow plate

Position Shift: Swapped plates

Object Shift: Place mug on plate

Open-World Steering Comparison

Side-by-side comparison of π-0.5 vs. VLS on an open-world steering task, demonstrating VLS's ability to adapt to out-of-distribution scenarios.

π-0.5

VLS (Ours)

Quantitative Results

LIBERO-PRO: Inference-Time Steering Is Necessary

We evaluate four leading VLA models under position and task perturbations. Despite strong in-distribution performance, pretrained VLAs fail under OOD conditions—demonstrating that inference-time steering is essential.

Method	Task Perturbation					Position Perturbation					Overall
Method	Goal	Spatial	10	Object	Avg.	Goal	Spatial	10	Object	Avg.	Overall
OpenVLA	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
π-0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
π-0.5	0.00	1.00	1.00	1.00	0.75	38.00	20.00	8.00	17.00	20.75	10.75
π-0.5 (LeRobot)	12.00	48.50	21.50	10.50	23.13	29.00	41.00	11.00	16.00	24.25	23.69
π-0.5 (LeRobot) + VLS	33.50	54.00	25.50	41.00	38.50	38.00	42.00	15.50	45.00	35.13	36.81

CALVIN: VLS Outperforms Existing Steering Methods

CALVIN Results — Steering methods comparison on CALVIN. Success rates for VLS (ours), DynaGuide, ITPS, and the base diffusion policy across movable objects (cubes) and articulated parts (drawer, switch, button, door). VLS achieves 94% average on movable objects (7.4× over base policy) and 87% on articulated parts (9.6× boost), outperforming prior steering methods by 15–25 percentage points. Error bars show standard deviation over 600 episodes per task.

Ablation Study

Ablation Study — (Left) Ablation of VLS components (50 episodes per task). We compare Full VLS (gradient guidance + FK steering + RBF diversity) against variants that remove each component. Removing gradient guidance causes severe performance collapse, confirming it as the primary driver of VLS's effectiveness. (Right) Scaling with sample batch size on door_left task. Larger batch sizes improve performance at the cost of inference time, illustrating a practical compute–performance tradeoff.

Real-World Deployment

We evaluate VLS on a Franka Emika robot with a Robotiq gripper in a kitchen environment. The setup uses a DROID-style camera configuration with a Zed mini (wrist) and Zed 2 (side) cameras.

Real-World Deployment Results — (Left) In-Distribution Tasks: Level 1 requires placing an orange onto a specified plate (red or green). Level 2 introduces a banana, requiring sequential object and plate selection. (Right) Out-of-Distribution Tasks: (1) Appearance shift: replacing red/green plate with unseen yellow plate; (2) Position shift: swapping plate locations; (3) Object shift: replacing banana with a never-before-seen mug. Each task evaluated over 20 trials.

In-distribution: VLS achieves 69% average success rate, outperforming the frozen π-0.5 baseline by 19%.
Out-of-distribution: The baseline's performance degrades sharply under OOD conditions, while VLS maintains stable execution across all shift types.
Novel object handling: In the most challenging case (replacing banana with an unseen mug), the baseline achieves 0% success while VLS succeeds in 40% of trials.

Key Findings

Inference-time steering is necessary: Pretrained VLAs, despite strong in-distribution performance, fail under OOD conditions. Post-training on robot data entangles spatial reasoning with specific training contexts.
VLS outperforms heuristic guidance: Unlike DynaGuide (DINO-feature heuristics) and ITPS (predefined guidance functions), VLS conditions guidance directly on observation–language inputs, enabling precise steering under spatial variability.
Both gradient-based and gradient-free components matter: Gradient guidance provides dense refinement, while FK resampling and RBF diversity improve sample efficiency and prevent premature mode collapse.
Practical compute–performance tradeoff: Batch size can be tuned for deployment—smaller for lower latency, larger for maximum robustness.

Conclusion

We propose VLS, a training-free framework that guides pretrained robotic policies using differentiable rewards generated by Vision–Language Models, addressing the challenge of policy deployment in out-of-distribution scenarios. By treating adaptation as an inference-time control problem rather than a retraining problem, VLS enables frozen diffusion and flow-matching policies to execute reliably under spatial and semantic shifts that would otherwise cause failure.

Experiments demonstrate that VLS significantly outperforms existing methods in both simulation (31% improvement on CALVIN, 13% on LIBERO-PRO) and real-world tasks on a Franka robot.

Limitations and Future Work

Computational latency: Batch sampling, MCMC runs, and FK resampling introduce inference overhead that may limit real-time applications.
VLM capability dependence: Steering quality depends on the VLM's ability to correctly decompose tasks and generate appropriate reward functions.
Future directions: Progress-aware reward signal generation and optimizing computational efficiency during inference.

BibTeX

@article{liu2026vls,
  title     = {VLS: Steering Pretrained Robot Policies via Vision-Language Models},
  author    = {Shuo Liu and Ishneet Sukhvinder Singh and Yiqing Xu and Jiafei Duan and Ranjay Krishna},
  journal   = {arXiv preprint arXiv:2602.03973},
  year      = {2026}
}