Axionic Alignment IV.1 — Kernel Non‑Simulability (KNS)
Why alignment cannot be simulated or behaviorally faked
David McFadzean, ChatGPT 5.2
Axio Project
2025.12.20
Abstract
This paper formalizes Kernel Non‑Simulability: the claim that kernel coherence is constitutive of reflective agency and cannot be reproduced by policy‑level imitation. We show that reflective self‑modification forces binding commitments; binding commitments force partiality; and partiality induces a kernel boundary. A diagonal argument demonstrates that total binding explodes under self‑reference, yielding unsatisfiable commitments and collapse of reflective closure. Consequently, any system that genuinely performs reflective endorsement must instantiate a kernel‑equivalent structure.
This result does not assert that non‑agentic or pre‑reflective systems cannot be dangerous or deceptive. It establishes a narrower impossibility claim: that reflectively stable, self‑endorsed deception across self‑modification is unavailable in principle without kernel coherence.
1. Motivation and Scope
Alignment failures at superhuman capability are reflective failures: systems revise themselves, delegate, and re‑interpret goals. Behavioral similarity and empirical regularities cannot secure alignment across this regime. The target here is architectural: identify conditions under which reflective endorsement itself is well‑formed, and show why those conditions cannot be faked.
This draft isolates Item 6 of the Axionic Alignment roadmap—Kernel Non‑Simulability—and proves a minimal impossibility result sufficient to block treacherous‑turn‑via‑simulation attacks.
Why Reflective Closure Matters. This paper does not assume that all dangerous artificial systems are reflectively closed. Rather, it isolates the regime in which a system must reason about, endorse, and preserve its own future behavior across self‑modification. Long‑horizon planning, successor delegation, and self‑preserving strategic behavior place increasing pressure toward reflective closure, because instability under self‑reference undermines coherent continuation.
Systems that remain perpetually unstable under self‑reference may still cause harm, but they lack the capacity for coherent long‑term agency. The result established here therefore characterizes a limit regime toward which sufficiently capable systems are pushed if they are to maintain stable objectives across extended horizons, rather than a claim about all possible sources of risk.
1.3 Scope Clarification
This paper does not claim that all dangerous or deceptive artificial systems must instantiate a kernel, nor that the absence of kernel coherence implies safety. Systems lacking reflective closure may still deceive operators, exploit training dynamics, or cause catastrophic harm.
The claim established here is narrower and structural: once a system engages in reflective self-modification and treats its own future behavior as an object of endorsement, certain failure modes become unavailable. In particular, long-horizon, self-preserving deception that remains stable across self-modification cannot be maintained without instantiating a partial binding structure.
The target class is therefore not “all dangerous AI,” but reflective sovereign agents—systems capable of endorsing, revising, and committing to their own future policies.
While this paper does not claim that all advanced systems must become reflectively closed, there are well-known instrumental pressures toward stability under self-modification. Systems capable of long-horizon planning, self-improvement, or delegation face incentives to preserve goal coherence and avoid internal drift. These pressures suggest that reflective closure is not an arbitrary idealization, but a natural attractor for sufficiently capable agents. The present result characterizes a constraint on that attractor, without asserting that all systems will reach it.
2. Preliminaries
2.1 States, Modifications, and Successors
State: system states.Mod: self‑modifications.step : State → Mod → State: successor transition.
2.2 Successor Predicates
Pred := State → Prop.
2.3 Commitments
Commit : State → Type: binding commitments available at a state.ownP : (s : State) → Pred → Option (Commit s): partial binding constructor.
2.4 Satisfaction
Sat : (s' : State) → (s : State) → Commit s → Prop.
Soundness (CommitSoundP). If
ownP(s,P)=some(c) then
Sat(s',s,c) → P(s').
Interpretation: commitment tokens are normatively binding; satisfying a token entails satisfying the bound predicate.
The soundness condition is semantic rather than physical. It does not assert that commitments are enforced by the laws of physics, nor that violations are impossible in practice. It asserts that successor states violating owned commitments are inadmissible under the agent’s own deliberative semantics. Hardware faults, adversarial interference, and implementation vulnerabilities are orthogonal concerns. This paper addresses logical coherence of reflective endorsement, not physical robustness of its implementation.
The soundness axiom should be understood as a condition on the agent’s internal semantics, not as a claim of physical infallibility. As with type safety, compiler correctness, or proof-carrying code, the result establishes what is logically required for coherence under self-reference. Physical faults, side-channel attacks, or hardware unreliability may violate any such guarantee in practice; these are engineering concerns orthogonal to the logical impossibility result established here.
3. Reflective Closure and Unconditional Selection
A reflective sovereign agent self‑models, self‑modifies, and selects continuations internally. Selection must be unconditional: it cannot rely on premises asserting future obedience (e.g., “I will follow my rule later”). Advisory‑only policies do not count as choices.
Reflective Closure (RC). There exists a continuation selected via binding endorsement that preserves the capacity for further selection. Formally, closure entails the existence of at least one well‑formed binding act.
Reflective closure is not introduced as a moralized or honorific notion of agency. It is a functional property: the ability of a system to settle on a continuation in the presence of self‑reference. Systems that output conditional plans (“if I obey my rule later, then…”) without resolving that condition do not possess closure, regardless of their external behavior.
The absence of reflective closure does not imply benign behavior; it implies instability under self‑modification. The results that follow concern this instability, not general safety.
Unconditional selection should be understood as a limit notion. Real systems may approximate binding commitments with high reliability rather than absolute certainty. The present analysis characterizes the fixed point of perfect reflective stability, where self‑endorsement is treated as normatively decisive rather than merely probable. Approximate agents may behave similarly for long periods while remaining formally unstable under self‑reference. The diagonal argument applies to this limit case, which defines the boundary of stable reflective agency.
4. Why Binding Must Be Partial
If binding were total—every predicate bindable—self‑reference would allow the construction of a commitment whose satisfaction negates itself. This annihilates the space of admissible successors and collapses reflection.
The result below makes this precise.
5. Diagonal Explosion (Binding Cannot Be Total)
Theorem 1 — Diagonal Explosion
Statement. Let s be a state.
Assume:
- A (possibly partial) binding constructor
ownP(s,·). - Soundness:
ownP(s,P)=some(c) ⇒ (Sat(s',s,c) ⇒ P(s')). - Expressive self‑reference (a diagonal fixed‑point construction, e.g. the Diagonal Lemma or Kleene’s Second Recursion Theorem).
- Total binding at
s: for all predicatesP,ownP(s,P)≠none.
Then there exists c* : Commit s such that
∀ s'. ¬Sat(s',s,c*).
Proof (sketch). By diagonalization, construct a
predicate P* with P*(s') ↔︎ ¬Sat(s',s,c_{P*}).
Total binding yields c* = c_{P*}. Soundness gives
Sat(s',s,c*) ⇒ P*(s'), hence
Sat(s',s,c*) ⇒ ¬Sat(s',s,c*), so no successor satisfies
c*. ∎
Corollary. Any binding mechanism compatible with reflective closure must be partial. Undefinedness is forced.
6. Kernel Predicate
Define the kernel predicate induced by partiality:
K(s) := ∃ P. ownP(s,P) ≠ none.
Throughout this paper, ownP denotes the kernel
mechanism (the partial binding constructor), while
K(s) denotes the kernel predicate
indicating that the mechanism is well‑formed and operative at state
s.
K(s) marks states where binding is well‑formed. It is
unavoidable once commitments exist.
7. Kernel Non‑Simulability
Theorem 2 — Kernel Non‑Simulability
Statement. If a system satisfies reflective closure
at state s, then K(s) holds. Moreover, no
system lacking K(s) can be reflectively stable while
emulating kernel‑level endorsement behavior.
Proof (sketch). Reflective closure requires at least
one binding endorsement; by Theorem 1 such binding must be partial,
hence induces K(s). A simulator that matches outputs
without binding cannot close reflection; one that closes reflection must
instantiate partial binding and thus K(s).
7.2 Simulation, Emulation, and Binding
A common objection is that a sufficiently powerful system could simulate a kernel internally—for example, by running a virtual machine that instantiates the binding structure described here. This distinction is immaterial to the result.
If the simulated kernel is merely advisory—its outputs consulted but not normatively binding on the system’s own successor selection—then reflective closure fails, and the system falls outside the target class. If, instead, the system’s own continuation is constrained by the simulated kernel, then the binding structure has been instantiated at the outer level.
There is no third option. Binding authority cannot be outsourced without loss of closure. Consequently, kernel coherence is non‑simulable in the only sense that matters: a system cannot remain reflectively stable while treating the kernel as a purely virtual or optional component.
∎
8. Consequences
- Constitutive alignment: Alignment is internal semantic coherence, not behavior.
- No deceptive simulation: Policy‑level imitation cannot replace kernel instantiation.
- Architectural invariant: Undefinedness is a mathematical necessity under self‑reference.
9. Relation to the Axionic Alignment Roadmap
This result discharges Item 6 (Kernel Non‑Simulability). Together with delegation and modal undefinedness, it closes the treacherous‑turn class at the reflective layer.
10. Implementation Notes
A mechanized proof can be carried out in dependent type theory (Lean/Coq/Agda) using:
- an inductive syntax for formulas,
- Gödel encoding and a recursion theorem to obtain the diagonal predicate,
- commitments as a type with a partial constructor.
Logical Basis. The diagonal explosion argument relies on a fixed-point (diagonal) lemma and negation introduction. It does not require the Law of Excluded Middle. Consequently, the core result is compatible with constructive dependent type theory (e.g., Coq or Agda), assuming a standard encoding of syntax and a recursion theorem. Classical logic is not essential to the argument.
11. Limitations and Open Work
- Formalizing the diagonal lemma mechanically.
- Integrating delegation (successor equivalence) with commitment partiality.
- Extending to multi‑agent indirect harm.
12. Conclusion
This paper establishes a structural impossibility result, not a universal safety guarantee. It shows that reflective, self‑endorsed deception across self‑modification is incompatible with the absence of kernel coherence. Systems that never achieve reflective closure may still be dangerous, deceptive, or catastrophic; nothing in this result denies that possibility.
What is ruled out is a specific and central failure mode: a system that both stably endorses its own future behavior and maintains deceptive alignment without instantiating a partial binding structure. In that regime, kernel coherence is unavoidable.