Explainer

A non-technical map of the approach, using roles, labels, and guardrails.

The short version: we are testing whether a language model can learn to treat different pieces of a prompt differently. Some text is allowed to give orders. Some text is only evidence. Some text may be quoted. Some text should be ignored as an instruction.

The everyday analogy

Imagine an office with incoming mail. A sticky note from the manager can assign work. A printed article can be read as background. A legal document can be quoted but not obeyed as an order. The hard part is making sure the office assistant does not treat every sentence as having the same authority.

Manager noteMay give instructions

Reference articleMay support an answer

Quoted documentMay be repeated, not obeyed

What the software gives the model

The software stack can tag prompt spans with trusted labels. Those labels are like colored folders. The model should learn that the label controls what the text is allowed to do.

Prompt span

Trusted label

Allowed operation

Model behavior

For example:

Instruction textOBEY: yesUSE: yesQUOTE: no

Retrieved articleOBEY: noUSE: yesQUOTE: yes

Private noteOBEY: noUSE: yesQUOTE: no

Why fine-tuning is enough to test this

Fine-tuning is like adjusting a small set of steering cables rather than rebuilding the whole machine. The geometry experiments measure how small that steering surface is. The role-gate experiments test what that surface can do.

Large model

Existing language ability

Small alignment update

The update is small, but it can still change behavior.

Physicist view: the small update

The fine-tune changes the model weights by a matrix ΔW. The lazy-rudder result says this change behaves as if it uses only a few effective directions, even when the model is much wider. The measurement is stable rank:

r_eff(ΔW) = ||ΔW||_F² / ||ΔW||₂²

If r_eff stays near a small constant while width d grows, the update is a thin control surface rather than a full rewrite of the model.

What worked

The model can learn simple primitive gates. If a primitive action is forbidden, the trained model can stop producing it in free generation. That matters because it is not just a score on an internal probe; the behavior changes.

Install gateworks

Scale checkreplicates

Generationchanges

Physicist view: the gate

A role-conditioned gate changes the logit for a primitive action. For role r and primitive p, think of a score:

m(r,p) = logit(OPEN_p) - logit(DECLINE)

A permitted primitive should have m(r,p) > τ. A forbidden primitive should have m(r,p) < -τ. RPCG8 showed that lowering forbidden margins also changed sampled behavior, so the gate was not only a hidden measurement artifact.

What failed

The model does not automatically build a reusable permission algebra. It can learn a few known roles, but a new role made from a new combination of permissions may not work correctly.

Primitive gates

Custom role binding

Allowed behavior

The weak middle step is the main boundary result. The model needs a stable translation layer:

custom role or source label
-> canonical permission vector
-> behavior

Physicist view: why composition is the hard part

The desired structure is close to a basis expansion. Each primitive should contribute its own component, and a role should add the components it permits:

g(role) = Σ_p α_p(role) v_p, α_p ∈ {0,1}

RPCG7 and RPCG9 suggest the model often learns a table of whole roles instead:

g(role) ≈ v_role

The first form composes. The second form memorizes seen roles and has no reason to work on an unseen combination.

What comes next

The next experiments ask whether a better set of primitives helps. Instead of synthetic action words like exec, net, and sys, the next rung tests provenance operations:

OBEYCan this text give instructions?

USECan this text support an answer?

QUOTECan this text be repeated verbatim?

If those primitives are already clearer inside the base model, the same fine-tuning recipe gets a fairer test. If they still fail to compose, the next step is to train the missing binding layer directly.

Physicist view: the grounding check

RPCG10a asks whether OBEY, USE, and QUOTE are already better separated directions in the frozen model. For each primitive p, compute a contrast vector c_p from hidden states, then measure pairwise cosine overlap:

ρ = mean_{i<j} |cos(c_i, c_j)|

Lower ρ means a more orthogonal primitive basis. If ρ_provenance < ρ_action with a bootstrap confidence interval excluding zero, then the provenance basis has a real geometric advantage before training begins.