From Workflow-Based SLM to Autonomous Agent: Evolving PUBG Ally's Architecture

1. Overview

At GDC 2026, KRAFTON and NVIDIA co-presented "Building a Co-Playable Character: PUBG Ally, an AI Teammate Powered by NVIDIA ACE." The version shown at GDC used a workflow-based SLM design. Since then, Ally has evolved into an autonomous agent architecture, as the language model's role expanded from response generation to tool-using decision-making.

This post covers the background of that architectural transition, the design of the agent loop, and the criteria behind choosing Nemotron for on-device deployment.

2. How the SLM Workflow Expanded

A brief look at how PUBG Ally's SLM workflow evolved.

The version shown at GDC used a workflow-based SLM design. Fast per-frame control was handled by the Behavior Tree, while the language model was responsible for speech and high-level steerable commands — a separation often described as a System 1 / System 2 structure. It worked well, but becoming a true teammate meant expanding what System 2 could do.

As Ally expanded to support a true teammate experience, three capabilities were needed:

Proactivity — Speaking up without being asked
Memory — Remembering past conversations and preferences
Strategic Suggestion — Reading the situation and proposing strategy

As these capabilities were added, the workflow became more structured. Rather than handling everything in a single prompt, the system separated them into dedicated components with different roles and invocation patterns.

User Voice / Game Events

Action Agent

Understand · Decide · Respond

Always

Memory Agent

Summarize · Extract · Save

Sometimes

Strategic Agent

Analyze · Generate Strategy

Independent

Proactive Agent

Game Events → Proactive Speech Trigger

The most practical reason for splitting roles was prompt space and latency. A single prompt that handles everything quickly runs into context limits. Separate components mean separate token budgets and call frequencies to optimize independently.

This was where the GDC talk left off. The workflow neatly defined who handles what. But how each component reasons and acts — that part still relied on fixed prompts and predefined flows.

3. When the Workflow-Based SLM Became an Autonomous Agent

The key shift was in the language model's role. In the workflow design, the system called each component in a predefined order, and the language model generated text for a given prompt — orchestration remained under system control.

As the SLM's role expanded, it began to choose which tool to call, read the result, and decide the next step on its own. Rather than receiving instructions from the system, the model became the decision-maker. That is the core transition.

Core Capabilities and Agent Loop

Ally's tools span speech, perception, memory, and action. A few examples of what a tool call looks like in practice:

observe

get_game_overview()

phase 4 · 1m 20s left

alive teams: 8

HP: 72 · zone: safe

query

lookup_game_knowledge("MP5K")

high fire rate

easy recoil control

recent nerf applied

memory

update_memory()

teammate: Minjun

style: aggressive push

weapon: AKM

For each situation, the model selects only the tools it needs, reads the results, and decides the next step — repeating this loop until the response is complete. That requires high tool-calling accuracy, strong result interpretation, and the ability to maintain consistent decision-making across multiple steps.

4. Model Selection for On-Device Deployment: Nemotron

The criteria for choosing an on-device SLM were clear:

Multi-turn tool-calling accuracy — Calling multiple tools per cycle reliably and without hallucination.
Knowledge Distillation within the same architecture family — Efficiently transferring knowledge from an LLM down to an on-device model.
Game engine integration — LLM inference and graphics rendering must coexist on the same GPU without conflict.

The NVIDIA Nemotron (Minitron) family and the NVIDIA stack satisfied all three.

Knowledge Distillation within the Family

We compared two training approaches:

(A) Direct SFT — The 2B model learns directly from the LLM's outputs (hard labels).
(B) Knowledge Distillation — An 8B model first learns from the LLM, then the 2B model distills from the 8B's probability distribution (soft labels).

LLM

Teacher · Hard Labels

trajectory data

Minitron 8B

Intermediate · Soft Labels

knowledge distillation

Minitron 2B

On-Device Student

quantize

Q4_K_M GGUF

On-Device Deploy

When teacher and student share the same architecture family, logit distribution transfer is far more natural. Minitron 8B → Minitron 2B share the same tokenizer, attention structure, and training corpus, so the meaning of soft labels is preserved across model boundaries. This was the decisive advantage over other 2B candidates in the KD pipeline.

Practical Advantages of the NVIDIA Stack

IGI (In-Game Inferencing) SDK. A single plugin API allows seamless switching between local execution (GGML/CUDA-based) and cloud execution (NIM). During early development, we validated the architecture on an LLM, then switched to on-device Minitron with a single API change. Without this seamless switching, the entire strategy of "design on LLM, deploy on SLM" would have been impractical.

CUDA in Graphics (CiG). Runs CUDA computation and graphics rendering in the same context. Since Ally repeats tool calls 3–5 times per cycle, the benefit of reduced GPU context switching compounds over repeated calls.

Two On-Device SLM Deployment Patterns

This isn't the first time KRAFTON has deployed an on-device SLM. inZOI's "Smart Zoi" used Minitron 0.5B for character reasoning tasks such as action selection and daily reflection. The two deployments solve different problems, but the contrast illustrates how much the demands on an on-device model can vary.

	inZOI (Smart Zoi)	PUBG Ally
Model	Minitron 0.5B	Minitron 2B (Q4_K_M)
Task	Single inference (one-shot)	Iterative multi-step reasoning
Tool Use	None	Enabled
Frequency	Once per event	Iterative per event

Both run on the Nemotron family and the NVIDIA IGI stack — powering in-game AI with an on-device SLM.

5. Summary

The key architectural shift in PUBG Ally was not the replacement of one system with another, but the expansion of the language model's role — from a workflow component responsible for response generation to an agent capable of selecting tools, interpreting results, and deciding what to do next. As that role expanded, so did the requirements placed on the model.

To support this agent loop on player hardware, we chose Nemotron. Multi-turn tool-calling accuracy, Knowledge Distillation within the same architecture family, and game engine integration via IGI SDK and CiG — Nemotron and the NVIDIA runtime stack met all three conditions.

PUBG Ally is set to launch in summer 2026 through PUBG's new Arcade Mode beta.