Co-Evolving Actor-Conditioned Critics for Non-Verifiable Generation

Abstract

Natural-language critiques provide supervision beyond scalar rewards for non-verifiable generation, where quality is multi-dimensional and no deterministic verifier exists. In critique-guided refinement, a critic gives feedback on an initial response and an actor revises it. However, final revision quality does not reveal whether the critique was actually useful: a capable actor may improve without following the feedback, while valid feedback may fail if the actor cannot execute it.

We frame critique as actor-conditioned revision guidance, where usefulness depends on whether the feedback helps the target actor address the intended weakness. We introduce TAIScore (Targeted Actionable Improvement Score), a reward that evaluates the instruction, initial response, critique, and revision together, assessing whether the critique targets a real weakness, whether the actor follows it, and whether the intended aspect improves. We use this reward to train an actor-tailored critic with GRPO, and use critique-guided refinements to construct DPO preference pairs for the actor, forming a co-evolving critic–actor loop where the critic adapts to the actor's changing capability.

Experiments show that an 8B critic trained with TAIScore outperforms both a zero-shot 120B critic and critics trained with outcome-only or critique-only reward signals. Co-evolving the critic and actor further improves performance, suggesting that effective critique supervision should adapt as the actor changes.

Critique Usefulness Is Actor-Conditioned

We conduct controlled analyses on WritingBench to test whether critique usefulness is intrinsic to the critique or conditioned on the actor. We find two key asymmetries: scaling the critic consistently improves standalone critique quality but does not reliably lead the actor to incorporate that feedback or yield larger downstream gains; while scaling the actor (with the same critique fixed) substantially improves both how well the actor follows the feedback and downstream quality. This confirms that critique usefulness is a property of the critique–actor pair, not the critique alone.

Controlled analysis: scaling critic vs. scaling actor on critique adherence and gain

Controlled zero-shot analysis of critique-guided refinement, summarized as mean deltas over controlled comparisons. Critic quality measures standalone feedback quality. Critique adherence measures how well the actor incorporates the critique into its revision (rather than making unrelated edits). Gain is the downstream improvement S(y₁)−S(y₀) under the WritingBench evaluator. Scaling the critic improves standalone critique quality but does not reliably lead the actor to follow the feedback or yield larger gains. In contrast, scaling the actor while keeping the same critique fixed produces substantially larger improvements in both feedback uptake and downstream quality.

Method

Our approach has three components that together form a co-evolving loop:

TAIScore. Given a full rollout τ = (x, y₀, c, y₁), a judge first produces four diagnostic scores (critique validity, critique adherence, targeted improvement, and faithfulness), then produces a final scalar reward T(τ) ∈ [1, 10]. This diagnostic scaffold ensures the reward reflects actor-conditioned usefulness rather than standalone critique quality or final revision quality alone.
Critic update (GRPO). For each initial response y₀, the critic samples N critiques. Each critique triggers an actor revision, and all rollouts in the group are scored by TAIScore. Group-relative advantages reward critiques that provide more useful revision guidance than the other critiques for the same prompt.
Actor update (DPO). The adapted critic generates critique-guided revisions. Each revision y₁ is paired with the initial response y₀ as a preference pair (y₁ ≻ y₀), and the actor is updated with DPO.

These two updates alternate across rounds, forming a co-evolving loop: the critic continually adapts to the actor's current weaknesses, while the actor internalizes progressively stronger critique-guided revisions.

Overview of co-evolving critic-actor training framework

Overview of our co-evolving critic–actor training framework. For the current actor π_t, the critic κ_t samples multiple critiques for the same initial response y₀. Each critique produces a rollout τ_i = (x, y₀, c_i, y_1,i). TAIScore evaluates each rollout and produces a GRPO reward for critic adaptation. The adapted critic κ_t+1 then generates critique-guided revisions, converted into DPO preference pairs y₁ ≻ y₀ to update the actor. Alternating these two updates yields co-evolving critics and actors.

Main Results

We evaluate on two non-verifiable domains: creative writing (WritingBench, HelloBench) and deep research (DeepResearch-Gym), using Qwen3-8B as the actor. All trained-critic conditions use the same DPO actor-training pipeline and differ only in how the critiques for DPO pairs are produced.

Method	WritingBench	HelloBench		DeepResearch-Gym
Method	Overall ↑	OEQA ↑	HTG ↑	KPR ↑	KPC ↓	Quality ↑
`Qwen3-8B` (base)	72.33	34.86	38.89	71.93	1.18	81.85
DPO pairs from off-the-shelf critics
`gpt-oss-120B` critic	75.04	35.63	50.47	73.12	1.16	82.35
DPO pairs from trained critics (8B)
Outcome-gain reward	75.38	34.97	44.58	74.19	1.12	82.37
Critique-quality reward	74.90	35.06	49.28	74.00	1.18	82.25
TAIScore (ours)	75.72	35.99	53.35	74.96	1.09	82.42
Co-evolving critic–actor training
TAIScore + co-evolution (ours)	76.08	38.96	53.54	75.55	1.06	82.80

Results averaged over multiple runs; standard deviations are reported in the paper. Best results in bold; second-best underlined.

An 8B TAIScore critic outperforms the frozen 120B off-the-shelf critic on all six metrics, showing that critic scale alone does not determine downstream usefulness.
TAIScore outperforms both reward ablations (outcome-gain and critique-quality), confirming that evaluating the full critique-guided revision process provides a better training signal.
Co-evolving the critic with the actor further improves performance across all benchmarks, e.g., WritingBench 75.72→76.08 and HelloBench OEQA 35.99→38.96.

Cross-Policy Transfer

We examine whether actor-tailored critics transfer across actors by applying critics trained for Qwen3-4B and Qwen3-8B to both actors. Critics provide positive gains even when transferred, suggesting that TAIScore learns broadly useful revision signals. At the same time, the matched critic consistently obtains the highest score for both actors, consistent with our actor-conditioned view that aligning the critic with the target actor provides additional benefit.

Cross-policy transfer on WritingBench. Bars show the gain over the base actor (Δ) after DPO training; base scores are 72.19 (4B) and 72.33 (8B). Matched is the critic trained for that same actor; Transferred is the critic trained for the other actor.

BibTeX

@article{kim2026coevolving,
  title     = {Co-Evolving Actor-Conditioned Critics for Non-Verifiable Generation},
  author    = {Kim, Jinyoung and Khalifa, Muhammad and Logeswaran, Lajanugen and
               Kim, Jaekyeom and Lee, Moontae and Lee, Honglak and Wang, Lu},
  journal   = {arXiv preprint},
  year      = {2026}
}

Co-Evolving Actor-Conditioned Criticsfor Non-Verifiable Generation