The Reward Engineering Bottleneck
In reinforcement learning, the reward function defines what the policy should optimize. A simple instruction like "walk forward naturally" seems straightforward, but translating it into a mathematical objective that a learning algorithm can follow is anything but:
- Reward forward velocity? The policy maximizes speed at the expense of gait quality.
- Penalize uneven motion? The policy finds that not moving at all is the smoothest solution.
- Balance both terms? The optimal trade-off varies across morphologies and tasks, and even a well-shaped reward can fail under the wrong hyperparameters.
This iterative process of writing, training, observing, and revising reward functions is called reward engineering. In practice, it consumes weeks of researcher time per task, requires deep domain expertise, and remains the primary bottleneck preventing RL from scaling to new tasks and environments.
Manual Reward Engineering
- Expert writes reward function
- Trains for hours, watches video
- Manually diagnoses failure mode
- Tweaks coefficients, retrains
- Repeats for days to weeks
Prompt-to-Policy (Automated)
- LLM writes reward function
- PPO trains the policy
- LLM and VLM judge behavior
- LLM diagnoses & revises
- Automatically repeats until success
Our Approach
Prompt-to-Policy (P2P) automates this entire cycle. The user provides a natural language description of the desired behavior. An LLM-based agent then iterates autonomously: it writes a reward function, trains a policy via RL, and judges the result through a multi-modal evaluation (combining code-based trajectory analysis with VLM video evaluation). If the behavior does not match the intent, the agent revises and retrains. The only input is a prompt; the output is a trained policy.
Where P2P Sits
Prior reward-engineering systems (Eureka, Text2Reward) use LLMs to iteratively refine reward functions, but each task requires a human-defined fitness metric, and the search modifies only the reward from the single previous iteration. General-purpose research agents (AutoResearch, AI Scientist v2) offer broader flexibility, but operate without RL-specific scaffolding: no structured behavior evaluation, no experiment lineage, no reward-aware guardrails.
P2P combines iterative reward writing with agentic flexibility, while adding structure specific to RL: multi-modal evaluation that generates scoring criteria automatically from the intent, joint reward and hyperparameter search with branching across the full experiment history, and guardrails (tiered lessons, code review, multi-judge synthesis) that keep the search grounded.
Key Design Decisions
1. Dual Judgment with Cross-Verification
Physics metrics alone miss qualitative aspects ("Is the gait natural?"); a VLM alone risks hallucination ("runs forward smoothly" when actually vibrating in place). P2P runs two independent judges in parallel: a Code Judge (LLM-authored evaluation of trajectory data) and a VLM Judge (vision-language model watching the rollout video). An LLM Synthesizer merges both scores, cross-checking disagreements by re-examining specific video moments or verifying physics data directly.
VLM Judge: Two-Turn Protocol
Naively asking a VLM "did the agent succeed?" leads it to agree with the prompt rather than critically evaluate. To counter this, P2P uses a two-turn protocol (Andrade et al., 2025). In the first turn the VLM receives the task description and a still frame of the starting pose (no rollout video) and must commit to 3–5 concrete visual criteria for success. In the second turn it receives the rollout video and evaluates each criterion individually, then assigns a holistic overall score. Because the criteria are locked upfront, the VLM must check each one against the video rather than defaulting to a blanket approval.
At low frame rates (e.g. 5 fps), motion continuity (such as rotation and velocity) is hard to judge from individual frames. P2P addresses this by optionally providing a motion trail video, where preceding sub-frames appear as translucent ghosts, alongside the standard recording. The VLM uses both videos together when scoring.
Code Judge: LLM-Authored Trajectory Analysis
The Code Judge evaluates behavior by analyzing raw physics data (joint positions, velocities, contact forces, and body coordinates). Several recent systems also use LLMs to generate evaluation or success-detection code (LEARN-Opt, OMNI-EPIC, Agentic Skill Discovery), but typically in a single shot or with only execution-level error feedback. P2P's Code Judge adds two quality gates absent in prior work: a structured hierarchical decomposition where the LLM organizes sub-tasks into a dependency graph (sequential (1 → 2, gated), independent ((1, 2)), and nested (1.1, 1.2)) that directly governs scoring, and a dedicated review where a separate LLM pass searches for loopholes, exploitable edge cases, and other issues in the generated decomposition and code.
Agentic Synthesis
When the two judges agree, the synthesizer merges them directly into a final verdict. When they disagree, an agentic tool loop kicks in: the synthesizer can re-query the VLM on a specific time window of the video (e.g., "did the policy lift off between seconds 1–3?") or run a trajectory check (a Python snippet against the physics data to verify a specific claim). This targeted cross-examination resolves ambiguity without re-running the full judgment pipeline.
2. Agentic Revision: Reward, Hyperparameters, and Experiment Lineage
Reward design and hyperparameter tuning are deeply coupled: a well-shaped reward can fail under the wrong learning rate, and vice versa. P2P's Revise Agent tunes both simultaneously, training each reward revision across multiple HP configurations in parallel, and tracks every outcome in an experiment lineage.
The Revise Agent is not a fixed optimization loop. It is a tool-using LLM that receives training diagnostics, investigates the lineage (reading past reward functions, comparing scores, checking for recurring failure patterns), and then decides what to do next:
- Where to branch from. The agent can continue from the current iteration or revert to an earlier one whose approach scored higher. In the diagram above, v2 regressed from v1, so the agent branched back to v1 and tried a different reward design (v3).
- What to change. Based on the diagnostics and its own investigation, the agent proposes reward and hyperparameter changes together. In normal mode, it must propose 1–2 changes only. Structural rewrites are allowed only during plateau (3+ iterations since the best score was achieved).
- What to avoid. The agent retrieves accumulated lessons from the experiment lineage. Lessons have graduated trust tiers; the agent must respect high-tier lessons but can demote or retire them with justification when the evidence warrants it.
Each lesson enters the knowledge base as STRONG and can be promoted or demoted by the agent over time:
- HARD: Rules learned from past catastrophic failures that must never be violated.
- STRONG: Confirmed principle. Follow unless specific evidence says otherwise.
- SOFT: Context-specific. Challenge freely with a good reason.
- RETIRED: No longer active, shown for reference to prevent rediscovering the same mistake.
When the active lesson count grows large, an LLM consolidation pass merges duplicates (the higher tier wins), removes superseded entries, and caps the list. RETIRED lessons remain visible so the agent does not rediscover the same mistake. This lets the Revise Agent make informed revisions grounded in the full experiment history, rather than blind guesses based only on the latest result.
Walkthrough: Natural Human Walking
Prompt: "Walk forward with a natural human-like gait, alternating left and right steps in a steady rhythm, while gently swinging the opposite arm forward with each step."
Dashboard Tutorial
The web dashboard lets you watch the full loop in real time. Visit the project page to get started.
Limitations and Future Directions
Current Limitations
- Inconsistent VLM scoring. The VLM judge provides directionally useful scores but is not perfectly calibrated. Scores can vary across runs for the same behavior, and the model sometimes misinterprets body orientation or rotation direction.
- Subjective tasks are hard to judge with code. The Code Judge works well for clear physical criteria (e.g., "do a backflip"), but struggles with subjective goals like "walk naturally": what is the appropriate arm swing amplitude, 0.3 m or 0.33 m?
- Over-correction on near-successes. The revise agent sometimes restructures a near-successful reward function entirely instead of fine-tuning it. A partial backflip at iteration 1 may get replaced with a completely different reward design at iteration 2, causing regression.
- Early-stage IsaacLab support. IsaacLab integration is at an early stage. Preliminary results on ANYmal locomotion are promising, but systematic evaluation across the full environment suite is ongoing.
Future Directions
- Human-in-the-loop steering. Allow users to intervene between iterations with natural-language corrections ("good jump but land softer", "rotate the other way"), combining autonomous iteration with interactive course-correction.
- Domain-specific VLM. Fine-tune a VLM on human-annotated rollout videos with consistent scoring criteria, replacing the current zero-shot judge with a calibrated model that understands simulated robot behavior.
- Automated harness optimization. The P2P pipeline is itself a harness—a hand-engineered program that orchestrates LLMs to achieve natural-language intents through reinforcement learning. A recent work Meta-Harness shows that such harnesses can now be discovered autonomously by a coding agent; applying that approach to P2P is an intriguing direction for future work.
Citation
Wooseong Chung, Taegwan Ha, Yunhyeok Kwak, Taehwan Kwon, Jeong-Gwan Lee, Kangwook Lee, Suyoung Lee
KRAFTON AI, Ludo Robotics
*Equal contribution. Authors listed in alphabetical order by last name.
BibTeX citation
@misc{prompt2policy2026,
title = {Prompt-to-Policy: Agentic Engineering for Reinforcement Learning},
author = {{KRAFTON AI} and {Ludo Robotics} and Wooseong Chung and Taegwan Ha and Yunhyeok Kwak and Taehwan Kwon and Jeong-Gwan Lee and Kangwook Lee and Suyoung Lee},
year = {2026},
url = {https://github.com/krafton-ai/Prompt2Policy}
}