← Blog
KRAFTON AI · LUDO ROBOTICS / Research

Prompt-to-Policy: Agentic Engineering for Reinforcement Learning

Describe a behavior in words, and an agent writes the reward, trains a policy, judges the result via LLM-written code metrics and VLM, and revises until the policy matches your intent. No human intervention required.

A sample of behaviors learned by P2P from natural language prompts. No manual reward engineering.

The Reward Engineering Bottleneck

In reinforcement learning, the reward function defines what the policy should optimize. A simple instruction like "walk forward naturally" seems straightforward, but translating it into a mathematical objective that a learning algorithm can follow is anything but:

This iterative process of writing, training, observing, and revising reward functions is called reward engineering. In practice, it consumes weeks of researcher time per task, requires deep domain expertise, and remains the primary bottleneck preventing RL from scaling to new tasks and environments.

Manual Reward Engineering

  • Expert writes reward function
  • Trains for hours, watches video
  • Manually diagnoses failure mode
  • Tweaks coefficients, retrains
  • Repeats for days to weeks

Prompt-to-Policy (Automated)

  • LLM writes reward function
  • PPO trains the policy
  • LLM and VLM judge behavior
  • LLM diagnoses & revises
  • Automatically repeats until success

Our Approach

Prompt-to-Policy (P2P) automates this entire cycle. The user provides a natural language description of the desired behavior. An LLM-based agent then iterates autonomously: it writes a reward function, trains a policy via RL, and judges the result through a multi-modal evaluation (combining code-based trajectory analysis with VLM video evaluation). If the behavior does not match the intent, the agent revises and retrains. The only input is a prompt; the output is a trained policy.

PROMPT
"Shadow box while walking"
natural language
Prompt-to-Policy
write → train → judge → revise
autonomous iteration
TRAINED POLICY
no human intervention
Iterate until behavior matches intent
1
Write
LLM writes the initial reward function and code-based judge from the intent and environment specification
2
Train
PPO trains a policy using the generated reward, across multiple seeds and hyperparameter configs in parallel
3
Judge
A Code Judge analyzes trajectory metrics while a VLM watches the rollout video. An LLM synthesizes both into a pass/fail judgment
4
Revise
If failed, an LLM agent diagnoses the failure, revises the reward function, and tunes hyperparameters for the next iteration

Where P2P Sits

Reward Engineering Eureka, Text2Reward Reward-only search, requires human involvement Prompt-to-Policy this work Joint reward + HP search, auto-generated evaluation, RL-specific guardrails Research Automation AutoResearch, AI Scientist v2 Full-cycle automation, no RL-specific judgment FOCUSED GENERAL

Prior reward-engineering systems (Eureka, Text2Reward) use LLMs to iteratively refine reward functions, but each task requires a human-defined fitness metric, and the search modifies only the reward from the single previous iteration. General-purpose research agents (AutoResearch, AI Scientist v2) offer broader flexibility, but operate without RL-specific scaffolding: no structured behavior evaluation, no experiment lineage, no reward-aware guardrails.

P2P combines iterative reward writing with agentic flexibility, while adding structure specific to RL: multi-modal evaluation that generates scoring criteria automatically from the intent, joint reward and hyperparameter search with branching across the full experiment history, and guardrails (tiered lessons, code review, multi-judge synthesis) that keep the search grounded.

Key Design Decisions

1. Dual Judgment with Cross-Verification

Physics metrics alone miss qualitative aspects ("Is the gait natural?"); a VLM alone risks hallucination ("runs forward smoothly" when actually vibrating in place). P2P runs two independent judges in parallel: a Code Judge (LLM-authored evaluation of trajectory data) and a VLM Judge (vision-language model watching the rollout video). An LLM Synthesizer merges both scores, cross-checking disagreements by re-examining specific video moments or verifying physics data directly.

VLM Judge: Two-Turn Protocol

Naively asking a VLM "did the agent succeed?" leads it to agree with the prompt rather than critically evaluate. To counter this, P2P uses a two-turn protocol (Andrade et al., 2025). In the first turn the VLM receives the task description and a still frame of the starting pose (no rollout video) and must commit to 3–5 concrete visual criteria for success. In the second turn it receives the rollout video and evaluates each criterion individually, then assigns a holistic overall score. Because the criteria are locked upfront, the VLM must check each one against the video rather than defaulting to a blanket approval.

TURN 1: DEFINE CRITERIA Starting frame only (no rollout video) Input Task intent + camera info + starting frame Output 3–5 concrete visual success criteria e.g.: 1. Both feet leave ground between strides 2. Arms pump in alternating rhythm 3. Forward speed > walking pace Criteria locked before seeing evidence then TURN 2: SCORE WITH VIDEO Video or composite image Input Rollout video + motion trail + locked criteria Output Per-criterion verdict + overall score e.g.: ✗ Feet leave ground not_met △ Arms pump rhythm partial ✓ Forward speed met intent_score: 0.40 diagnosis: "No flight phase." failure_tags: [walking_gait]
VLM judge two-turn protocol. Turn 1 locks success criteria before any evidence is shown; Turn 2 scores each criterion against the rollout.

At low frame rates (e.g. 5 fps), motion continuity (such as rotation and velocity) is hard to judge from individual frames. P2P addresses this by optionally providing a motion trail video, where preceding sub-frames appear as translucent ghosts, alongside the standard recording. The VLM uses both videos together when scoring.

Code Judge: LLM-Authored Trajectory Analysis

The Code Judge evaluates behavior by analyzing raw physics data (joint positions, velocities, contact forces, and body coordinates). Several recent systems also use LLMs to generate evaluation or success-detection code (LEARN-Opt, OMNI-EPIC, Agentic Skill Discovery), but typically in a single shot or with only execution-level error feedback. P2P's Code Judge adds two quality gates absent in prior work: a structured hierarchical decomposition where the LLM organizes sub-tasks into a dependency graph (sequential (1 → 2, gated), independent ((1, 2)), and nested (1.1, 1.2)) that directly governs scoring, and a dedicated review where a separate LLM pass searches for loopholes, exploitable edge cases, and other issues in the generated decomposition and code.

PHASE 1: DECOMPOSE & REVIEW 1 Decompose intent LLM breaks task into a graph with measurable criteria 2 Decomposition review LLM reviews decomposition for gaps and loopholes retry PHASE 2: IMPLEMENT & VALIDATE 3 Generate judge function LLM writes judge_fn(trajectory, summary) → dict 4 Compile & test Execute on real trajectory; fix errors if any 5 Code review LLM reviews code for gaps and loopholes retry Sandboxed Execution Only numpy + math allowed • 30s timeout • no I/O Example: "jog forward with arm swing" Structure: 1 → 2 2: (2.1, 2.2) → 2.3 1. Upright Posture 2. Jogging Quality [gated by 1] 2.1 Arm Swing 2.2 Stride Cycle 2.3 Contralateral Rhythm [gated by 2.1, 2.2]
Code Judge generation pipeline. The LLM decomposes the intent, implements a Python evaluation function, and validates it against real trajectory data before use.

Agentic Synthesis

When the two judges agree, the synthesizer merges them directly into a final verdict. When they disagree, an agentic tool loop kicks in: the synthesizer can re-query the VLM on a specific time window of the video (e.g., "did the policy lift off between seconds 1–3?") or run a trajectory check (a Python snippet against the physics data to verify a specific claim). This targeted cross-examination resolves ambiguity without re-running the full judgment pipeline.

2. Agentic Revision: Reward, Hyperparameters, and Experiment Lineage

Reward design and hyperparameter tuning are deeply coupled: a well-shaped reward can fail under the wrong learning rate, and vice versa. P2P's Revise Agent tunes both simultaneously, training each reward revision across multiple HP configurations in parallel, and tracks every outcome in an experiment lineage.

v1 reward v1 baseline 0.62 config_1 0.43 config_2 0.72 v2 reward v2 v3 reward v3 baseline 0.55 config_1 0.48 config_2 0.42 baseline 0.71 config_1 0.80 config_2 0.88 active path explored best config
Experiment lineage. All configs within an iteration share the same reward function but differ in hyperparameters. The best config becomes the parent for the next iteration. The Revise Agent can branch from any past iteration; here, v3 branches from v1 instead of continuing from v2.

The Revise Agent is not a fixed optimization loop. It is a tool-using LLM that receives training diagnostics, investigates the lineage (reading past reward functions, comparing scores, checking for recurring failure patterns), and then decides what to do next:

  1. Where to branch from. The agent can continue from the current iteration or revert to an earlier one whose approach scored higher. In the diagram above, v2 regressed from v1, so the agent branched back to v1 and tried a different reward design (v3).
  2. What to change. Based on the diagnostics and its own investigation, the agent proposes reward and hyperparameter changes together. In normal mode, it must propose 1–2 changes only. Structural rewrites are allowed only during plateau (3+ iterations since the best score was achieved).
  3. What to avoid. The agent retrieves accumulated lessons from the experiment lineage. Lessons have graduated trust tiers; the agent must respect high-tier lessons but can demote or retire them with justification when the evidence warrants it.

Each lesson enters the knowledge base as STRONG and can be promoted or demoted by the agent over time:

When the active lesson count grows large, an LLM consolidation pass merges duplicates (the higher tier wins), removes superseded entries, and caps the list. RETIRED lessons remain visible so the agent does not rediscover the same mistake. This lets the Revise Agent make informed revisions grounded in the full experiment history, rather than blind guesses based only on the latest result.

Walkthrough: Natural Human Walking

Prompt: "Walk forward with a natural human-like gait, alternating left and right steps in a steady rhythm, while gently swinging the opposite arm forward with each step."

v1 v2 v3 ··· v9 v10
Experiment lineage. Revisions v3–v8 regressed from v2. The Revise Agent branched back to v2 with a different approach, converging at v9→v10.

Dashboard Tutorial

The web dashboard lets you watch the full loop in real time. Visit the project page to get started.

Dashboard walkthrough: launching a run, monitoring training progress, and reviewing VLM judgments in real time.

Limitations and Future Directions

Current Limitations

Future Directions

Citation

Wooseong Chung, Taegwan Ha, Yunhyeok Kwak, Taehwan Kwon, Jeong-Gwan Lee, Kangwook Lee, Suyoung Lee

KRAFTON AI, Ludo Robotics

*Equal contribution. Authors listed in alphabetical order by last name.

BibTeX citation

@misc{prompt2policy2026,
  title   = {Prompt-to-Policy: Agentic Engineering for Reinforcement Learning},
  author  = {{KRAFTON AI} and {Ludo Robotics} and Wooseong Chung and Taegwan Ha and Yunhyeok Kwak and Taehwan Kwon and Jeong-Gwan Lee and Kangwook Lee and Suyoung Lee},
  year    = {2026},
  url     = {https://github.com/krafton-ai/Prompt2Policy}
}