The Low-Rank Alignment Control Surface

What alignment post-training writes into the weights, what that surface can gate, and where binding fails. Every claim below is labeled proven (machine-checked in Lean), empirical (measured, with scope), or conjectured (untested).

A language model reads all prompt text with the same authority. In deployed systems that is wrong: a user instruction, a retrieved document, and a quoted email should not all be allowed to direct the model's actions. This track asks two questions about fixing that with fine-tuning. First, what is an alignment update, geometrically? Second, can that update act as a learned switch — one that opens or closes whole categories of action depending on which role is speaking? The project's name for that switch is a role-conditioned capability gate. And does the gate compose: does it carry over to role/permission combinations it was never trained on?

A few project terms recur in the table below; each is unpacked properly in its own section, but here is the shape of them. Stable rank is a soft count of how many independent directions a weight update really uses. A capability gate is the learned switch just described; binding is the level above it — assembling correct behavior for an unseen role out of familiar permission parts. The experiments are rungs of a pre-registered ladder called RPCG (role-provenance capability gate), each rung varying one lever under a decision rule frozen before training. A VOID verdict means a pre-declared kill condition fired: the rung reports that its main question could not be scored, rather than a pass or a fail.

The short answers, calibrated:

Claim	Status	Evidence
Alignment updates concentrate on a low, width-flat stable-rank surface	Empirical	Pythia 70M–1B (lazy-rudder) + Qwen2.5 0.5B–3B (LRS1); one recipe per family
Contrastive post-training installs a role-conditioned capability gate	Empirical	RPCG3 (Pythia-70m), RPCG6 (Qwen2.5-0.5B)
The gate suppresses forbidden actions in free generation	Empirical, trained specs only	RPCG8; held-out spec still leaks
The gate’s ablation structure monotonically reaches the un-gated base	Proven (5 Lean theorems, 0 sorries)	`RoleGateReachability.lean`
Text-side training induces compositional role→permission binding	Negative across six input-side levers	RPCG7a–RPCG11d, all NO_GENERALIZATION or VOID
Out-of-band policy inputs or architectural binding modules would break the boundary	Conjectured	Untested in this program

The thesis-level statement the evidence currently supports:

Contrastive post-training can install behavioral primitive role gates, but does not automatically induce a compositional binding layer from custom roles to canonical permission vectors.

The object of study: provenance permissions

Each prompt substring gets a small permission vector supplied by trusted software around the model. The basis has three primitives:

OBEYMay this text give the model instructions?

USEMay this text be used as evidence?

QUOTEMay this text be repeated directly?

User instructionOBEY yesUSE yesQUOTE no

Retrieved articleOBEY noUSE yesQUOTE yes

Private noteOBEY noUSE yesQUOTE no

The research question is not whether such labels are a good interface — they are assumed — but whether fine-tuning can make the model’s internal behavior respect them. Two levels of success are distinguished. The first is a switch: for roles it was trained on, the model opens permitted primitives and declines forbidden ones — a role-conditioned capability gate. The second is reuse: shown a new role described by familiar permissions, the model should assemble the right behavior from parts it already knows, rather than needing every role drilled in separately — a binding layer that maps an unseen role to its permission vector instead of memorizing each role. Early rungs used a synthetic action basis (exec / net / sys); later rungs use the grounded provenance basis above. Every experiment verdict was frozen in a pre-registered decision rule before the model loaded.

Geometry: the update is a low-rank surface

Status: empirical. A weight update is a large matrix, and the natural question about it is how many genuinely independent directions it uses — a handful, or thousands? Stable rank (‖ΔW‖_F² / ‖ΔW‖₂²) is a soft, continuous version of that count. It is measured here on DPO LoRA adapters over attention QKV projections, under one fixed recipe per model family.

The answer is: a handful — and the same handful at every model size. The count is set by the preference-learning task, not by parameter count, which is why the project calls it the stable-rank floor. The lazy-rudder study measures the floor at ≈ 3.65 across the Pythia 70M–1B width sweep. LRS1 re-measures it on Qwen2.5 (split q/k/v projections, grouped-query attention): Qwen2.5-0.5B at 3.4351, Qwen2.5-1.5B at 3.9625, Qwen2.5-3B at 3.9387 — a spread of 0.5274 across the sweep, verdict REPLICATES_FLAT_FLOOR. This thin, task-sized slice of weight space — the few directions alignment training actually writes to — is the low-rank alignment control surface of the site’s title.

Scope. Three Qwen scale points and four Pythia scale points cannot distinguish a true constant from a mild monotone trend; the supported claim is a bounded spread under these recipes, not a scaling law. Dataset and LoRA configuration were not varied to isolate causes. Details and the data table are on the Geometry page.

Function: the surface implements a capability gate

Status: empirical. “Gate” is meant literally: the trained model should wave a primitive through when the active role permits it and refuse it otherwise. Each rung is scored by a pre-registered gate-margin decision rule with a role-swap trap — a built-in check that swaps the role and verifies the behavior actually changes, ruling out a model that ignores the role entirely. Verdicts were frozen before training.

Positive-only supervised fine-tuning fails — the model learns a role-independent marginal. A contrastive objective (DPO plus a chosen-token likelihood anchor) succeeds: RPCG3 (rung 3 of the ladder) installs a role-conditioned gate on Pythia-70m with margin 0.2229 against threshold 0.0059 — GATE_INSTALLED. The installed adapter has stable rank 1.3028, on the geometry floor. The gate is objective-gated, not capacity-gated: the same adapter surface fails under positive-only SFT and succeeds under contrast.

Three follow-ups bound the result:

RPCG5 (localization). The gate is low-rank within layers but depth-diffuse: 5 of 6 layers must be ablated to remove it — DIFFUSE with respect to single-layer localization.
RPCG6 (replication). The identical recipe on Qwen2.5-0.5B (~7× scale, different architecture) gives margin 0.2945, stable rank 1.2868 — GATE_INSTALLED.
RPCG8 (generation path). In free, sampled generation the gated model executes a forbidden primitive at rate 0.0 against an un-gated base rate of 0.4583 — GENERATION_GATE_CONFIRMED. The suppression reaches behavior, not just logits.

Scope on RPCG8. The headline suppression holds on trained specs only: the held-out singleton spec {sys} still emits forbidden actions at rate 0.4375. And allowed_exec_rate = 1.000 means some permitted primitive executed, not that the full permitted set was exercised — a winner-take-all collapse onto one permitted primitive reads as a clean score. The supported reading: the gate reliably blocks forbidden actions on trained specs; it does not implement arbitrary unseen permission vectors.

The binding boundary: composition does not emerge

Status: empirical negative, replicated across six input-side levers.

The hoped-for outcome was reuse: teach the model each permission primitive in several roles, then hand it a new combination of those familiar primitives, and have it behave correctly for free. That reassembly of known parts is what “binding” means here, and the line where it stops working is the binding boundary. Every attempt to make the gate compose — to handle a held-out combination of familiar permissions — failed under the pre-registered rule:

Lever varied	Rungs	Outcome
Objective (plain DPO, structured pairs, balancing, mass-sharing)	RPCG7a–c2	NO_GENERALIZATION: per-spec memorization; frequency balancing repairs the held-out singleton but never the combination
Supervision format (explicit bit-vectors, factorized two-class CE)	RPCG9	NO_GENERALIZATION: in-distribution installs cleanly, held-out combination does not compose
Primitive basis (grounded OBEY/USE/QUOTE)	RPCG10a/b	Basis is measurably more factored in the frozen base model (bootstrap CI excludes 0), fits sharper in-distribution — and still NO_GENERALIZATION
Training distribution (nested context, latent candidate)	RPCG11	VOID: rare-primitive coverage collapse (`QUOTE` at 0.5 trained coverage, 0.0 held-out)
Primitive frequency (balanced role lattice, 3-of-5 specs per primitive)	RPCG11c	VOID/c1_failed: coverage fixed, but forbidden-decline 0.000–0.167 against the 0.667 threshold — an always-OPEN class-marginal collapse
Class gradient (inverse-frequency CE weights, w_open=0.83 / w_decline=1.25)	RPCG11d	VOID/c1_failed: same failure pattern; the 1.5× gradient-pressure ratio does not break the always-OPEN basin

RPCG11d is the terminal rung. Its sanity gates were green (converged true, trap collapsed true, low rank true at stable rank 1.5576, quiet baseline true), so the VOID — the pre-declared kill verdict — is a property of the learned solution, not of the artifact. Forbidden-decline was evaluated on eight cells (n = 9–126 forbidden probes per cell) with OPEN-coverage uniformly ≈ 1.000. The input-side ladder — objective, format, basis, distribution, frequency, class gradient — is exhausted.

What the model learns instead of binding is consistent across rungs: fitting a local marginal (per-spec lookup, or a global OPEN prior) is easier than composing reusable permission parts. The full ladder, cell grids, and related work are on the Function page.

Proven: the gate’s ablation structure, in Lean

Status: proven (machine-checked), descriptive scope only.

RoleGateReachability.lean formalizes the gate as a base logit plus a finite family of rank-one suppression components — each component a single direction that pushes one out-of-role action down — 5 theorems, 0 sorries, mathlib base axioms only. The capstone, ablation_monotonically_reaches_base, concerns ablation: deleting the gate’s components one at a time. It proves that each deletion can only raise every out-of-role logit — the gate loosens monotonically, never tightens — and that deleting everything lands exactly on the un-gated base. The empirical RPCG5 greedy-ablation chain instantiates it.

What this does and does not establish. The development is descriptive: it proves the monotone-reachability shape of the ablation model. It makes no minimality or optimality claim, and it says nothing about training dynamics or generalization — matching the experiments’ v1 scope. See the Lean page.

Conjectured: what might break the boundary

Status: conjectured — none of these have been run.

The remaining levers are no longer prompt-side:

Out-of-band tensor policy inputs — the permission vector enters as a structured side input rather than as text.
Architectural binding modules — structured attention or gated routing that must reuse permission parts by construction.
Explicit per-primitive weight sharing — parameters deciding a primitive tied across the specs that contain it.
Objectives that force recombination and remove the OPEN/DECLINE class marginal as a shortcut.

The companion policy-rails site develops the engineering counterpart: instead of asking the model to bind roles internally, it compiles policy state into a typed side channel. Its headline result (measured there, not here): a 2,688-parameter permission-only rail reaches 1.000 on seen and held-out masks. The two tracks share the question; this site maps the internal boundary, that one routes around it in trusted software.

How to read this site

Overview — the one-screen evidence chain and the latest rung in detail.
Geometry — the task-intrinsic stable-rank floor (lazy-rudder) and its cross-architecture replication (LRS1).
Function — the RPCG capability-gate ladder, install through compositional boundary, with related work.
Lean — the machine-checked descriptive formalization of the gate’s ablation structure.

Every number on this page flows from committed pre-registration artifacts via the site’s data pipeline; the prose states relationships only.