Prompt-to-Policy: Agentic Engineering for Reinforcement Learning

Chung, Wooseong; Ha, Taegwan; Kwak, Yunhyeok; Kwon, Taehwan; Lee, Jeong-Gwan; Lee, Kangwook; Lee, Suyoung

The Reward Engineering Bottleneck

In reinforcement learning, the reward function defines what the policy should optimize. A simple instruction like "walk forward naturally" seems straightforward, but translating it into a mathematical objective that a learning algorithm can follow is anything but:

Reward forward velocity? The policy maximizes speed at the expense of gait quality.
Penalize uneven motion? The policy finds that not moving at all is the smoothest solution.
Balance both terms? The optimal trade-off varies across morphologies and tasks, and even a well-shaped reward can fail under the wrong hyperparameters.

This iterative process of writing, training, observing, and revising reward functions is called reward engineering. In practice, it consumes weeks of researcher time per task, requires deep domain expertise, and remains the primary bottleneck preventing RL from scaling to new tasks and environments.

Manual Reward Engineering

Expert writes reward function
Trains for hours, watches video
Manually diagnoses failure mode
Tweaks coefficients, retrains
Repeats for days to weeks

Prompt-to-Policy (Automated)

LLM writes reward function
PPO trains the policy
LLM and VLM judge behavior
LLM diagnoses & revises
Automatically repeats until success

Our Approach

Prompt-to-Policy (P2P) automates this entire cycle. The user provides a natural language description of the desired behavior. An LLM-based agent then iterates autonomously: it writes a reward function, trains a policy via RL, and judges the result through a multi-modal evaluation (combining code-based trajectory analysis with VLM video evaluation). If the behavior does not match the intent, the agent revises and retrains. The only input is a prompt; the output is a trained policy.

PROMPT

"Shadow box while walking"

natural language

→

Prompt-to-Policy

write → train → judge → revise

autonomous iteration

→

TRAINED POLICY

no human intervention

Iterate until behavior matches intent

1

Write

LLM writes the initial reward function and code-based judge from the intent and environment specification

2

Train

PPO trains a policy using the generated reward, across multiple seeds and hyperparameter configs in parallel

3

Judge

A Code Judge analyzes trajectory metrics while a VLM watches the rollout video. An LLM synthesizes both into a pass/fail judgment

4

Revise

If failed, an LLM agent diagnoses the failure, revises the reward function, and tunes hyperparameters for the next iteration

Where P2P Sits

Prior reward-engineering systems (Eureka, Text2Reward) use LLMs to iteratively refine reward functions, but each task requires a human-defined fitness metric, and the search modifies only the reward from the single previous iteration. General-purpose research agents (AutoResearch, AI Scientist v2) offer broader flexibility, but operate without RL-specific scaffolding: no structured behavior evaluation, no experiment lineage, no reward-aware guardrails.

P2P combines iterative reward writing with agentic flexibility, while adding structure specific to RL: multi-modal evaluation that generates scoring criteria automatically from the intent, joint reward and hyperparameter search with branching across the full experiment history, and guardrails (tiered lessons, code review, multi-judge synthesis) that keep the search grounded.

Key Design Decisions

1. Dual Judgment with Cross-Verification

Physics metrics alone miss qualitative aspects ("Is the gait natural?"); a VLM alone risks hallucination ("runs forward smoothly" when actually vibrating in place). P2P runs two independent judges in parallel: a Code Judge (LLM-authored evaluation of trajectory data) and a VLM Judge (vision-language model watching the rollout video). An LLM Synthesizer merges both scores, cross-checking disagreements by re-examining specific video moments or verifying physics data directly.

VLM Judge: Two-Turn Protocol

Naively asking a VLM "did the agent succeed?" leads it to agree with the prompt rather than critically evaluate. To counter this, P2P uses a two-turn protocol (Andrade et al., 2025). In the first turn the VLM receives the task description and a still frame of the starting pose (no rollout video) and must commit to 3–5 concrete visual criteria for success. In the second turn it receives the rollout video and evaluates each criterion individually, then assigns a holistic overall score. Because the criteria are locked upfront, the VLM must check each one against the video rather than defaulting to a blanket approval.

VLM judge two-turn protocol. Turn 1 locks success criteria before any evidence is shown; Turn 2 scores each criterion against the rollout.

At low frame rates (e.g. 5 fps), motion continuity (such as rotation and velocity) is hard to judge from individual frames. P2P addresses this by optionally providing a motion trail video, where preceding sub-frames appear as translucent ghosts, alongside the standard recording. The VLM uses both videos together when scoring.

Code Judge: LLM-Authored Trajectory Analysis

The Code Judge evaluates behavior by analyzing raw physics data (joint positions, velocities, contact forces, and body coordinates). Several recent systems also use LLMs to generate evaluation or success-detection code (LEARN-Opt, OMNI-EPIC, Agentic Skill Discovery), but typically in a single shot or with only execution-level error feedback. P2P's Code Judge adds two quality gates absent in prior work: a structured hierarchical decomposition where the LLM organizes sub-tasks into a dependency graph (sequential (1 → 2, gated), independent ((1, 2)), and nested (1.1, 1.2)) that directly governs scoring, and a dedicated review where a separate LLM pass searches for loopholes, exploitable edge cases, and other issues in the generated decomposition and code.

Code Judge generation pipeline. The LLM decomposes the intent, implements a Python evaluation function, and validates it against real trajectory data before use.

Agentic Synthesis

When the two judges agree, the synthesizer merges them directly into a final verdict. When they disagree, an agentic tool loop kicks in: the synthesizer can re-query the VLM on a specific time window of the video (e.g., "did the policy lift off between seconds 1–3?") or run a trajectory check (a Python snippet against the physics data to verify a specific claim). This targeted cross-examination resolves ambiguity without re-running the full judgment pipeline.

2. Agentic Revision: Reward, Hyperparameters, and Experiment Lineage

Reward design and hyperparameter tuning are deeply coupled: a well-shaped reward can fail under the wrong learning rate, and vice versa. P2P's Revise Agent tunes both simultaneously, training each reward revision across multiple HP configurations in parallel, and tracks every outcome in an experiment lineage.

Experiment lineage. All configs within an iteration share the same reward function but differ in hyperparameters. The best config becomes the parent for the next iteration. The Revise Agent can branch from any past iteration; here, v3 branches from v1 instead of continuing from v2.

The Revise Agent is not a fixed optimization loop. It is a tool-using LLM that receives training diagnostics, investigates the lineage (reading past reward functions, comparing scores, checking for recurring failure patterns), and then decides what to do next:

Where to branch from. The agent can continue from the current iteration or revert to an earlier one whose approach scored higher. In the diagram above, v2 regressed from v1, so the agent branched back to v1 and tried a different reward design (v3).
What to change. Based on the diagnostics and its own investigation, the agent proposes reward and hyperparameter changes together. In normal mode, it must propose 1–2 changes only. Structural rewrites are allowed only during plateau (3+ iterations since the best score was achieved).
What to avoid. The agent retrieves accumulated lessons from the experiment lineage. Lessons have graduated trust tiers; the agent must respect high-tier lessons but can demote or retire them with justification when the evidence warrants it.

Each lesson enters the knowledge base as STRONG and can be promoted or demoted by the agent over time:

HARD: Rules learned from past catastrophic failures that must never be violated.
STRONG: Confirmed principle. Follow unless specific evidence says otherwise.
SOFT: Context-specific. Challenge freely with a good reason.
RETIRED: No longer active, shown for reference to prevent rediscovering the same mistake.

When the active lesson count grows large, an LLM consolidation pass merges duplicates (the higher tier wins), removes superseded entries, and caps the list. RETIRED lessons remain visible so the agent does not rediscover the same mistake. This lets the Revise Agent make informed revisions grounded in the full experiment history, rather than blind guesses based only on the latest result.

Walkthrough: Natural Human Walking

Prompt: "Walk forward with a natural human-like gait, alternating left and right steps in a steady rhythm, while gently swinging the opposite arm forward with each step."

Experiment lineage. Revisions v3–v8 regressed from v2. The Revise Agent branched back to v2 with a different approach, converging at v9→v10.

Dashboard Tutorial

The web dashboard lets you watch the full loop in real time.

Dashboard walkthrough: launching a run, monitoring training progress, and reviewing VLM judgments in real time.

P2P is fully open source. Training pipeline, web dashboard, reward-authoring agent, and VLM judge — everything is on GitHub. Clone it and try it on your own behaviors.

Limitations and Future Directions

Current Limitations

Inconsistent VLM scoring. The VLM judge provides directionally useful scores but is not perfectly calibrated. Scores can vary across runs for the same behavior, and the model sometimes misinterprets body orientation or rotation direction.
Subjective tasks are hard to judge with code. The Code Judge works well for clear physical criteria (e.g., "do a backflip"), but struggles with subjective goals like "walk naturally": what is the appropriate arm swing amplitude, 0.3 m or 0.33 m?
Over-correction on near-successes. The revise agent sometimes restructures a near-successful reward function entirely instead of fine-tuning it. A partial backflip at iteration 1 may get replaced with a completely different reward design at iteration 2, causing regression.
Early-stage IsaacLab support. IsaacLab integration is at an early stage. Preliminary results on ANYmal locomotion are promising, but systematic evaluation across the full environment suite is ongoing.

Future Directions

Human-in-the-loop steering. Allow users to intervene between iterations with natural-language corrections ("good jump but land softer", "rotate the other way"), combining autonomous iteration with interactive course-correction.
Domain-specific VLM. Fine-tune a VLM on human-annotated rollout videos with consistent scoring criteria, replacing the current zero-shot judge with a calibrated model that understands simulated robot behavior.
Automated harness optimization. The P2P pipeline is itself a harness—a hand-engineered program that orchestrates LLMs to achieve natural-language intents through reinforcement learning. A recent work Meta-Harness shows that such harnesses can now be discovered autonomously by a coding agent; applying that approach to P2P is an intriguing direction for future work.

Citation

Wooseong Chung, Taegwan Ha, Yunhyeok Kwak, Taehwan Kwon, Jeong-Gwan Lee, Kangwook Lee, Suyoung Lee

KRAFTON AI, Ludo Robotics

*Equal contribution. Authors listed in alphabetical order by last name.

BibTeX citation

@misc{prompt2policy2026,
    title   = {Prompt-to-Policy: Agentic Engineering for Reinforcement Learning},
    author  = {Wooseong Chung and Taegwan Ha and Yunhyeok Kwak and Taehwan Kwon and Jeong-Gwan Lee and Kangwook Lee and Suyoung Lee},
    year    = {2026},
    url     = {https://github.com/krafton-ai/Prompt2Policy}
}

Learned Behaviors