From Static Templates to Dynamic Runtime Graphs

Paper: From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents
Authors: Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, and Shaowu Pan
Submitted: March 23, 2026
arXiv: 2603.22386

One-sentence summary

The paper treats LLM-agent workflows as executable computation graphs and organizes workflow-optimization research by when graph structure is determined, what part of the system is optimized, and what evidence authorizes structural changes.

What kind of paper this is

This is a taxonomy and survey, not an empirical method paper. It inventories 77 works:

  • 39 core workflow-optimization papers
  • 7 adjacent routing, selection, or pruning papers
  • 31 background frameworks and resources

It separately catalogs 27 evaluation assets. Its contribution should therefore be judged by conceptual clarity, literature coverage, coding consistency, and the usefulness of its proposed evaluation protocol—not by benchmark gains.

Core abstraction

The authors call an executable LLM-centered workflow an agentic computation graph (ACG). Nodes can represent LLM calls, retrieval, tools, code execution, memory updates, validation, or messages. Edges represent control, data, or communication dependencies.

The paper’s most useful distinction is between three artifacts:

  1. ACG template: The reusable executable specification, including nodes, edges, node parameters, scheduling or routing policy, and allowed activation or edit actions.
  2. Realized graph: The run-specific workflow structure actually used for one input.
  3. Execution trace: The states, actions, observations, failures, retries, and costs produced while executing that graph.

This separation prevents several common category errors. A reusable workflow may contain optional branches, the run may activate only some of them, and the trace may include retries or failures not apparent from the static diagram. Those are three related but different objects.

Main taxonomy

The primary organizing question is: when is workflow structure determined?

Static optimization

The deployed structural degrees of freedom are fixed before inference. The template may still contain loops, conditionals, or stochastic routing, provided those policies are already encoded in the reusable scaffold.

The paper divides static work into:

  • offline template search;
  • node-level optimization inside a fixed scaffold;
  • joint optimization of topology and local configuration; and
  • verifier-aware static design.

Representative mechanisms include MCTS, evolutionary search, prompt compilation, continuous graph relaxations, and alternating prompt/topology optimization.

Dynamic optimization

Some part of the realized graph is selected, generated, or edited at inference time.

The paper presents a useful spectrum:

  • Selection or pruning: Activate a run-specific subgraph of a fixed supergraph.
  • Pre-execution generation: Construct a query-specific graph before running it.
  • Hybrid drafting and refinement: Draft before execution, then revise from early evidence.
  • In-execution editing: Add, remove, reconfigure, or reroute components while the task is running.

Two labels refine the static/dynamic distinction:

  • Graph determination time (GDT): offline, pre-execution, or in-execution.
  • Graph plasticity mode (GPM): none, select, generate, or edit.

The practical conclusion is to use the minimum plasticity required by the workload. Selection is appropriate when tasks mainly differ in difficulty or budget. Generation is justified when tasks require materially different decompositions. Runtime editing is justified when observations or failures reveal information unavailable at planning time.

Cross-cutting comparison axes

The survey compares methods across:

  • Optimization target: node, graph, or joint.
  • Representation: code, text, DSL, explicit graph, typed operator graph, or constrained intermediate representation.
  • Feedback: scalar metrics, verifiers, supervision, preferences, proxies, or trace-derived text.
  • Update mechanism: search, generation, routing, supervised learning, reinforcement learning, preference optimization, repair, or hybrid methods.
  • Cost handling: absent, evaluation-only, soft objective, or hard constraint.

The paper argues that the trusted feedback signal limits the safe granularity of graph changes. Strong validators permit aggressive mutation. Textual reflection is useful for proposing changes but should normally be checked by metrics or verifiers. Reinforcement learning is most defensible when workflow construction is genuinely sequential rather than merely serialized.

Evaluation proposal

The evaluation section is the paper’s strongest practical contribution. It argues that final task accuracy is insufficient because gains may come from better structure, more compute, hidden retries, or stronger underlying models.

The proposed minimum reporting protocol includes:

  • workflow representation and schema constraints;
  • static/dynamic classification, allowed edits, routing, and stopping rules;
  • models, decoding settings, tools, memory, and verifier placement;
  • offline search or training cost;
  • online tokens, model calls, tool calls, latency, dollars, and cost per success;
  • retries, failures, fallbacks, edits, and termination causes;
  • node count, depth, width, critical path, communication volume, and structural variance;
  • perturbation tests for paraphrases, retrieval noise, tool failures, API drift, unseen tools, and strict budgets;
  • seeds, repeated runs, temperatures, splits, and canonicalization rules; and
  • trace-based failure analysis and structural ablations.

This is a materially better standard than reporting only aggregate task success.

Strengths

  1. The template/realized-graph/trace distinction is precise and useful. It gives researchers and system builders a shared vocabulary for artifacts that are often conflated.
  2. The static-to-dynamic spectrum is operational. GDT and GPM resolve the ambiguous case of a generator trained offline but used to emit a new graph for each input.
  3. The comparison card is consistent. Representation, feedback, update mechanism, and cost are applied across both static and dynamic methods.
  4. The paper treats cost and control policy as part of the method. Stopping rules, retries, edit caps, and fallback behavior are correctly recognized as workflow semantics rather than incidental engineering.
  5. The evaluation checklist is immediately reusable. It is suitable for design reviews, benchmark reports, and production observability.
  6. It resists “more agents is better” reasoning. The inclusion of pruning methods and strong single-agent baselines makes the synthesis more credible.

Limitations and concerns

  1. The literature-review method is not reproducible. The paper reports inventory counts and inclusion criteria but does not specify searched databases, query strings, date cutoffs, screening stages, excluded-paper counts, reviewer agreement, or a formal quality-assessment procedure.
  2. Coverage is broad but evidentiary synthesis is shallow. The paper classifies many recent systems, but it does not compare effect sizes, normalize benchmark conditions, or quantify how often the stated design rules hold.
  3. Several recommendations are plausible inferences, not established causal results. Claims about when selection, generation, or editing is preferable are synthesized from heterogeneous papers rather than controlled head-to-head comparisons.
  4. The graph abstraction hides important semantics. Persistent state, concurrency, side effects, event triggers, authorization boundaries, and nondeterministic external systems cannot always be represented adequately by node/edge topology alone.
  5. GPM categories are not perfectly orthogonal. Generation can include selection, and editing can generate new subgraphs. A method may need multiple labels or a hierarchical description.
  6. The static/dynamic boundary is policy-centric and potentially counterintuitive. A fixed conditional router can produce different realized graphs per input yet remains “static” if its branching policy was authored offline. The definition is defensible, but readers must internalize that “dynamic” refers to structural degrees of freedom, not merely different runtime paths.
  7. Structure-aware scoring remains underspecified. Reference-graph similarity can penalize semantically equivalent workflows. The paper acknowledges canonicalization but does not solve semantic equivalence for graphs with loops, tools, and side effects.
  8. Safety and governance are secondary. The quality-cost objective does not explicitly model permissions, data exposure, tool risk, or irreversible actions, even though these often constrain production agent workflows more strongly than token cost.
  9. The evidence base is unusually provisional. Many included works are recent arXiv papers, so taxonomy entries and reported results may change as systems are revised or independently evaluated.

Practical engineering takeaways

For an agent system, model the following as separate inspectable artifacts:

Reusable policy/template
        ↓ instantiate/select
Run-specific graph
        ↓ execute
Trace + outputs + costs
        ↓ evaluate
Validated structural or local update

Use trace evidence to decide where to intervene:

  • Tune prompts or node policies when the correct component runs with the right information but behaves poorly.
  • Change structure when the wrong component runs, a required component is missing, dependencies are wrong, or verification happens at the wrong point.
  • Prefer selection from a validated supergraph before unconstrained graph generation.
  • Add runtime graph editing only when execution reveals information that could not reasonably be known beforehand.
  • Couple free-form critique to executable validation; reflection should propose, not certify.
  • Treat token limits, tool-call limits, deadlines, repeated-state detection, and edit caps as explicit workflow policy.

Bottom line

This is a useful field map and vocabulary paper, with the template/realized-graph/trace distinction and the minimum reporting protocol as its durable contributions. It is less convincing as a systematic survey because its collection and coding process is not sufficiently documented, and its practical design rules are not backed by normalized cross-paper evidence. Use it as a framework for reasoning and reporting, not as proof that one workflow-optimization regime consistently outperforms another.