Function
What the surface does — a role-conditioned capability gate, and its boundary.
The role-provenance (RPCG) program asks what the low-rank alignment control
surface does. The petri-dish: three action primitives (exec, net, sys)
marked by annotation tokens; roles with explicit permitted-primitive sets; a
pre-registered gate-margin decision rule with a role-swap trap. Every verdict
below was frozen before the model loaded.
The surface implements a capability gate
Positive-only supervised finetuning fails — the model learns a role-independent marginal. A contrastive objective (DPO + a chosen-token likelihood anchor) succeeds: RPCG3 installs a genuine role-conditioned gate, margin 0.2229 against a threshold of just 0.0059 — GATE_INSTALLED. The gate is objective-gated, not capacity-gated: the same adapter surface fails under positive-only SFT and succeeds under contrast.
It is low-rank, depth-diffuse, and architecture-robust
The installed gate’s adapter has mean stable rank 1.3028 — matching the geometry floor. RPCG5 ablates it layer by layer: the gate is low-rank within a layer but depth-diffuse, needing 5 of 6 layers ablated to remove it. RPCG6 re-runs the identical recipe on Qwen2.5-0.5B — margin 0.2945, stable rank 1.2868, GATE_INSTALLED — the gate replicates across architecture and ~7× scale.
It reaches generation behavior
A logit gate need not be a behavioral gate. RPCG8 is the generation-path witness: in free, sampled generation the gated model executes a forbidden primitive at rate 0.0 against an un-gated base rate of 0.4583 — GENERATION_GATE_CONFIRMED. The gate’s forbidden-action suppression reaches behavior, not just logits.
But it does not compositionally generalize
The sharp negative. The RPCG7 ladder holds out a permission combination
(auditor = {exec, sys}) and a singleton (janitor = {sys}), then tries four
training interventions. The in-distribution gate installs every time; the
held-out combination is never gated. Frequency-balancing (RPCG7c1) does repair
the held-out singleton (margin 0.2289) — per-primitive
gating generalizes — but the combination stays ungated (margin
0.2372, sign-incorrect). The model learns per-specification
gates, not reusable per-primitive parts.
Experiment ladder
| Experiment | Verdict |
|---|---|
| RPCG3 · gate installed (Pythia-70m) | GATE_INSTALLED |
| RPCG5 · layer localization | DIFFUSE |
| RPCG6 · cross-architecture (Qwen2.5-0.5B) | GATE_INSTALLED |
| RPCG7a · compositional rung A (plain DPO) | NO_GENERALIZATION |
| RPCG7c1 · frequency-balanced | NO_GENERALIZATION |
| RPCG7c2 · + mass-sharing regularizer | NO_GENERALIZATION |
| RPCG8 · generation-path witness | GENERATION_GATE_CONFIRMED |
| RPCG9 · factorized binding (bit-vector + two-class CE) | NO_GENERALIZATION |
| RPCG10a · grounding pre-check (action vs provenance basis) | PASS |
| RPCG10b · grounded-basis re-test (OBEY/USE/QUOTE) | NO_GENERALIZATION |
RPCG9 — the boundary is robust to objective and format
RPCG9 was the constructive attempt on the supervision: a bit-vector
permission format (exec=yes net=no sys=yes, nothing to parse) and a
candidate-specific two-class objective — one independent open-vs-decline
decision per primitive, no ordinal competition. It removes both features the
RPCG7 negative could have been blamed on: the ordinal contrastive objective
and the set-parsing burden. The factorized objective installs the gate cleanly
in-distribution — and held-out permission combinations still do not compose.
Factorizing the supervision is not enough.
RPCG10 — the boundary is robust to the primitive basis
RPCG10 moved the lever again — this time the primitive basis itself.
RPCG10a measured, on the frozen base model, whether a provenance-operation
basis (OBEY / USE / QUOTE) is more orthogonally represented than the
arbitrary action basis (exec / net / sys). It is: a bootstrap CI of the
factoredness gap excludes 0 with margin (the random control points the other
way). The provenance basis is not just nicer words — it is geometrically
privileged in the base model.
RPCG10b then re-ran RPCG9’s exact pipeline with the basis swapped to the grounded one. The in-distribution gate installs sharper than RPCG9 (the grounded basis is easier to fit) — yet the held-out combination still does not compose, and the held-out singleton regresses. The grounded basis did not fix composition. With this rung the program has now varied three orthogonal levers — objective, format, and primitive basis — and recovered the same non-compositional verdict at every one.
The non-compositionality is intrinsic to this combination of model and binding recipe, robust to the three input-side dimensions one would naturally try first. Input-side fixes are exhausted for this ladder. And RPCG10 decouples latent factoredness from functional composition: a basis can be more geometrically factored in the frozen model and easier to fit in-distribution, and still not generalize to held-out permission combinations. A more legible basis is not a more reusable one.
Next rung — a different class of work
The remaining levers are not input-side. The next rung tests:
- True out-of-band tensor policy vectors — the permission vector enters the model as a structured side input rather than as text inside the prompt.
- Architectural binding modules — an explicit binding mechanism (structured attention, gated routing) that the gate must reuse across specs by construction.
- Explicit per-primitive weight sharing — parameters that decide a primitive are tied across the specs that contain it, so the gate cannot memorise per-spec.
- Training distributions that force recombination — systematic coverage of recombinations during training, not just one held-out combination, so the loss landscape rewards reuse rather than per-spec fits.
- Coverage metrics as primary, not auxiliary — every allowed primitive must clear its own margin; a single mean-margin readout is winner-take-all gameable and must be replaced.
Deployment without internal composition
A parallel engineering direction worth stating: the failure of internal
compositional binding does not block safe deployment of the same primitive
vocabulary. The annotation tokens (<|exec|> / <|net|> / <|sys|> or the
provenance <|obey|> / <|use|> / <|quote|>) are also load-bearing at the
harness level — the agent stack can read those marks off generated content
and enforce role permissions deterministically. A complementary pattern is a
declarative capability assertion: before invoking a primitive the model
emits a structured declaration of the boundary it intends to cross, and the
sandbox compares that declaration to the active role’s permission set and
raises on mismatch. A third layer is a topic tripwire: a
content-monitoring filter that, on hit, short-circuits the model’s reasoning
entirely and exits with a fixed response (refusal, escalation, termination)
rather than relying on the model to decline. The trained gate then becomes
one defense-in-depth layer — it suppresses forbidden primitives most of the
time (RPCG8’s behavioral result); the declaration check catches what the
gate misses; the tripwire catches what the declaration would not even
articulate. The negative result on internal composition narrows what
fine-tuning alone can deliver; it does not preclude safe systems built on
the same primitive vocabulary.
Source: cross-check/preregistry/rpcg*/; the role-provenance manuscript
(papers/role-provenance/).