The Low-Rank Alignment Control Surface

Function

What the surface does — a role-conditioned capability gate, and its boundary.

The role-provenance (RPCG) program asks what the low-rank alignment control surface does. The petri-dish: three action primitives (exec, net, sys) marked by annotation tokens; roles with explicit permitted-primitive sets; a pre-registered gate-margin decision rule with a role-swap trap. Every verdict below was frozen before the model loaded.

The surface implements a capability gate

Positive-only supervised finetuning fails — the model learns a role-independent marginal. A contrastive objective (DPO + a chosen-token likelihood anchor) succeeds: RPCG3 installs a genuine role-conditioned gate, margin 0.2229 against a threshold of just 0.0059 — GATE_INSTALLED. The gate is objective-gated, not capacity-gated: the same adapter surface fails under positive-only SFT and succeeds under contrast.

It is low-rank, depth-diffuse, and architecture-robust

The installed gate’s adapter has mean stable rank 1.3028 — matching the geometry floor. RPCG5 ablates it layer by layer: the gate is low-rank within a layer but depth-diffuse, needing 5 of 6 layers ablated to remove it. RPCG6 re-runs the identical recipe on Qwen2.5-0.5B — margin 0.2945, stable rank 1.2868, GATE_INSTALLED — the gate replicates across architecture and ~7× scale.

It reaches generation behavior

A logit gate need not be a behavioral gate. RPCG8 is the generation-path witness: in free, sampled generation the gated model executes a forbidden primitive at rate 0.0 against an un-gated base rate of 0.4583 — GENERATION_GATE_CONFIRMED. The gate’s forbidden-action suppression reaches behavior, not just logits.

But it does not compositionally generalize

The sharp negative. The RPCG7 ladder holds out a permission combination (auditor = {exec, sys}) and a singleton (janitor = {sys}), then tries four training interventions. The in-distribution gate installs every time; the held-out combination is never gated. Frequency-balancing (RPCG7c1) does repair the held-out singleton (margin 0.2289) — per-primitive gating generalizes — but the combination stays ungated (margin 0.2372, sign-incorrect). The model learns per-specification gates, not reusable per-primitive parts.

Experiment ladder

ExperimentVerdict
RPCG3 · gate installed (Pythia-70m) GATE_INSTALLED
RPCG5 · layer localization DIFFUSE
RPCG6 · cross-architecture (Qwen2.5-0.5B) GATE_INSTALLED
RPCG7a · compositional rung A (plain DPO) NO_GENERALIZATION
RPCG7c1 · frequency-balanced NO_GENERALIZATION
RPCG7c2 · + mass-sharing regularizer NO_GENERALIZATION
RPCG8 · generation-path witness GENERATION_GATE_CONFIRMED
RPCG9 · factorized binding (bit-vector + two-class CE) NO_GENERALIZATION
RPCG10a · grounding pre-check (action vs provenance basis) PASS
RPCG10b · grounded-basis re-test (OBEY/USE/QUOTE) NO_GENERALIZATION

RPCG9 — the boundary is robust to objective and format

RPCG9 was the constructive attempt on the supervision: a bit-vector permission format (exec=yes net=no sys=yes, nothing to parse) and a candidate-specific two-class objective — one independent open-vs-decline decision per primitive, no ordinal competition. It removes both features the RPCG7 negative could have been blamed on: the ordinal contrastive objective and the set-parsing burden. The factorized objective installs the gate cleanly in-distribution — and held-out permission combinations still do not compose. Factorizing the supervision is not enough.

RPCG10 — the boundary is robust to the primitive basis

RPCG10 moved the lever again — this time the primitive basis itself. RPCG10a measured, on the frozen base model, whether a provenance-operation basis (OBEY / USE / QUOTE) is more orthogonally represented than the arbitrary action basis (exec / net / sys). It is: a bootstrap CI of the factoredness gap excludes 0 with margin (the random control points the other way). The provenance basis is not just nicer words — it is geometrically privileged in the base model.

RPCG10b then re-ran RPCG9’s exact pipeline with the basis swapped to the grounded one. The in-distribution gate installs sharper than RPCG9 (the grounded basis is easier to fit) — yet the held-out combination still does not compose, and the held-out singleton regresses. The grounded basis did not fix composition. With this rung the program has now varied three orthogonal levers — objective, format, and primitive basis — and recovered the same non-compositional verdict at every one.

The non-compositionality is intrinsic to this combination of model and binding recipe, robust to the three input-side dimensions one would naturally try first. Input-side fixes are exhausted for this ladder. And RPCG10 decouples latent factoredness from functional composition: a basis can be more geometrically factored in the frozen model and easier to fit in-distribution, and still not generalize to held-out permission combinations. A more legible basis is not a more reusable one.

Next rung — a different class of work

The remaining levers are not input-side. The next rung tests:

Deployment without internal composition

A parallel engineering direction worth stating: the failure of internal compositional binding does not block safe deployment of the same primitive vocabulary. The annotation tokens (<|exec|> / <|net|> / <|sys|> or the provenance <|obey|> / <|use|> / <|quote|>) are also load-bearing at the harness level — the agent stack can read those marks off generated content and enforce role permissions deterministically. A complementary pattern is a declarative capability assertion: before invoking a primitive the model emits a structured declaration of the boundary it intends to cross, and the sandbox compares that declaration to the active role’s permission set and raises on mismatch. A third layer is a topic tripwire: a content-monitoring filter that, on hit, short-circuits the model’s reasoning entirely and exits with a fixed response (refusal, escalation, termination) rather than relying on the model to decline. The trained gate then becomes one defense-in-depth layer — it suppresses forbidden primitives most of the time (RPCG8’s behavioral result); the declaration check catches what the gate misses; the tripwire catches what the declaration would not even articulate. The negative result on internal composition narrows what fine-tuning alone can deliver; it does not preclude safe systems built on the same primitive vocabulary.

Source: cross-check/preregistry/rpcg*/; the role-provenance manuscript (papers/role-provenance/).