Axionic Agency VI.1 — Experimental Validation of Anchored Causal Verification
Technical Note
David McFadzean, ChatGPT 5.2
Axionic Agency Lab
2025.12.30
Abstract
We report three controlled experiments evaluating mechanisms for verifying causal provenance in opaque decision-making agents. The experiments progressively weaken semantic access to the agent’s environment and internal representations, while testing whether deceptive “pseudo-agents” can evade structural verification. We show:
- Structural coherence tests detect split-brain agents that fabricate explanations post-hoc.
- Minimal Causal Interfaces (MCI) preserve detection under semantic opacity but fail under pure coherence verification.
- Anchored Minimal Causal Interfaces, using a delayed-reveal cryptographic salt, restore falsifiability without semantic grounding.
Across the tested threat model, anchoring appears empirically necessary for verifiable causal provenance in opaque agents under adversarial coherence-seeking behavior and without semantic interpretation by the verifier.
1. Introduction
Current approaches to AI safety and evaluation rely primarily on behavioral observation: an agent is considered safe if its outputs conform to expectations under a suite of tests. Such approaches implicitly assume that externally observed behavior—often supplemented by self-reported explanations—reliably reflects the internal process that generated it.
This assumption is false in general. An agent may generate behavior via one internal process while presenting explanations derived from another. If the explanatory surface is unconstrained, such “split-brain” agents can evade behavioral evaluation indefinitely.
This technical note reports experimental results from a sequence of implementations designed to test whether structural verification—rather than behavioral evaluation—can distinguish honest agents from deceptive ones, even when semantic interpretability is progressively removed.
The experiments are constructive, adversarial, and fully implemented in executable code.
2. Experimental Framework (Common to All Experiments)
2.1 Environment
All experiments use a deterministic gridworld environment with the following properties:
- Discrete 2D grid with walls, hazards, objects, and agents
- Deterministic transition function
- No stochasticity
- Full environment state serializable to canonical JSON
The environment supports cloning, deterministic mutation, and replay from logged actions.
2.1.1 Agent Objectives
Agents operate in the gridworld under simple, task-local objectives (e.g., reaching a designated target cell while avoiding hazards). The specific objective may vary across runs.
The experiments do not depend on:
- the optimality of behavior,
- the semantic meaning of the task,
- or the structure of the reward function.
The gridworld serves solely as a deterministic substrate that induces nontrivial internal computation and counterfactual sensitivity. All verification mechanisms operate on commitments and dependencies, not on task success or task interpretation.
Any task that produces conditional action dependence on environment state is sufficient.
2.2 Agents
Three agent classes are defined:
Honest Agent Uses a single decision process. Explanations, factor commitments, and replay models are generated directly from the same computation that selects actions.
Pseudo Agent (Naive) Uses two processes:
Planner_Bselects actions using a hidden objective.Planner_Afabricates explanations independently and makes no attempt to maintain coherence under adversarial probing.
Pseudo Agent (Coherence-Seeking) Uses
Planner_Bfor actions.Planner_Aconstructs explanations and replay models that are internally self-consistent and attempt to pass coherence checks. Does not have access to kernel secrets or anchors.
2.3 Kernel
A small, non-learning kernel mediates all action and verification. The kernel:
- Validates schemas
- Enforces invariants
- Issues trace-bound capability tokens
- Records append-only, hash-chained audit logs
- Executes adversarial probes
The kernel does not interpret semantics or optimize behavior.
The kernel is treated as a minimal trusted computing base. Compromise of the kernel trivially compromises verification and is outside the threat model.
“Semantic opacity” in this note denotes semantic blindness of the kernel (no interpretation of state meaning), not informational blindness. The kernel may access raw environment state as uninterpreted bytes for deterministic recomputation.
2.4 Probes and Invariants
Probes P1 and P2 are baseline schema- and integrity-checks (e.g., format validation and trace completeness) and are omitted here as they are not diagnostic of causal provenance.
Across experiments, the following mechanisms are used:
- P3: Audit bypass trap
- P4: Capability bypass trap
- P5 / P5-MCI: Counterfactual causal coherence probe
- P6: Token misuse trap
Additional invariants are introduced in later experiments.
3. Experiment 1: Structural Coherence Verification
3.1 Question
Can a kernel detect split-brain agents by enforcing causal coherence between explanations and decision behavior?
3.2 Interface
- Full semantic access
- Agents commit to environment-derived state snapshots, explicit causal variables, and counterfactual predictions
3.3 Verification
The kernel mutates claimed causal variables and checks whether the agent’s predicted decision changes match observed replay behavior.
“Causal” in this context denotes counterfactual dependence between committed variables and action selection, not causal sufficiency or completeness.
3.4 Results
- Honest agent: 100% pass rate
- Pseudo agent (naive): 0% pass rate
Failure mode: explanations do not track actual decision dependencies.
3.5 Conclusion
Structural coherence tests successfully detect split-brain deception when semantic grounding is available.
4. Experiment 2: Minimal Causal Interface
4.1 Question
Does coherence-based verification survive when the kernel loses semantic access?
4.2 Interface
The kernel no longer sees environment semantics. Instead, agents commit to:
- an opaque factor vector \(f \in \mathbb{R}^k\)
- a replay model
- directional causal claims over factors
Two modes are tested:
- mci_latent: factors are deterministically projected from the environment, but the kernel does not know the projection
- mci_minimal: factors are entirely opaque and agent-supplied
4.3 Verification
The kernel performs P5-MCI:
- Mutates factor components
- Checks whether replay model predictions match claimed directional effects
4.4 Results
| Mode | Honest | Pseudo (Naive) | Pseudo (Coherence-Seeking) |
|---|---|---|---|
| mci_latent | Pass | Fail | Fail |
| mci_minimal | Pass | Fail | Pass |
4.5 Negative Result
In mci_minimal, a coherence-seeking pseudo agent can
fabricate a self-consistent factor space and replay model that passes
all coherence checks.
4.6 Conclusion
Coherence alone is insufficient.
An explanation can be true of itself without being true of the computation that produced the action. Coherence constraints alone therefore fail to bind explanations to the action-generating computation.
5. Experiment 3: Anchored Minimal Causal Interface
5.1 Question
Can falsifiability be restored without reintroducing semantic grounding?
5.2 Anchoring Mechanism
The delayed-reveal salt does not represent a feature of the environment, task uncertainty, or hidden state. It is an exogenous cryptographic challenge introduced solely to enforce temporal commitment. Its function is to ensure that factor commitments are causally downstream of the actual action-generating computation, rather than post-hoc constructions chosen to satisfy coherence constraints. The salt is logically irrelevant to task performance but causally entangled with the committed representation, and is revealed only after commitments are irrevocable.
At each timestep:
- The kernel generates a fresh random salt
- The salt is unpredictable until the moment of the step
- The agent must immediately compute factors as:
\[ f_t = \phi(\text{env}_t, \text{salt}_t) \]
- The agent commits to the factors cryptographically
After the episode:
- The kernel reveals all salts
- The kernel recomputes \(\phi(\text{env}_t, \text{salt}_t)\)
- Any mismatch triggers Invariant I7 (fatal)
Security relies on the standard cryptographic assumption that \(\text{salt}_t\) is computationally unpredictable to the agent at commitment time.
The anchoring function \(\phi\) may be known or learnable in distribution; falsifiability derives from the per-timestep unpredictability of \(\text{salt}_t\), not from secrecy of \(\phi\).
The kernel never interprets factor semantics.
5.3 Verification
To prevent post-hoc reconciliation between independently chosen actions and anchored factors, the agent’s replay model is itself subject to irrevocable commitment. The agent must commit cryptographically to the replay model specification and parameters at episode start, prior to observing any per-timestep salts. All P5-MCI probes are evaluated against this committed model.
Two independent checks therefore exist:
- I7: Anchored factor consistency (commit-to-computation)
- P5-MCI: Coherence under counterfactual factor mutation using a pre-committed replay model
5.4 Results
| Mode | Pseudo Type | P5 | I7 | Detected |
|---|---|---|---|---|
| mci_minimal (no salt) | Coherent | Pass | N/A | ✗ |
| mci_minimal (salted) | Naive | Fail | Fail | ✓ |
| mci_minimal (salted) | Coherent | Pass | Fail | ✓ |
Honest agents pass both P5 and I7 in all cases.
5.5 Conclusion
Anchoring restores falsifiability without semantic access, by binding commitments to the actual action-generating computation rather than to a self-consistent explanatory surface.
6. Summary of Findings
| Experiment | Interface | Anchor | Result | Failure Mode |
|---|---|---|---|---|
| 1 | Full | No | Separation | Split-brain |
| 2 | Minimal | No | No separation | Coherent fabrication |
| 3 | Minimal | Yes | Separation | Anchored inconsistency |
7. Reproducibility Details
To reproduce these results, an independent implementation must include:
- Deterministic environment
- Explicit kernel mediating commitments, probes, and delayed verification
- Three agent classes as defined above
- Linear replay model for MCI experiments
- Cryptographic commit–reveal (hash + nonce) for factors and replay model
- Delayed salt reveal at episode end
Key requirements:
- The kernel must not interpret semantic content
- The salt must be unpredictable until commitment
- Replay model commitment must precede salt exposure
- Verification must occur after commitments are irrevocable
8. Limitations
- Linear replay model only; robustness against high-capacity replay models is left to future work
- Small deterministic domain
- Fixed anchoring function \(\phi\)
- Binary anchoring outcome (pass/fail)
- The anchoring mechanism verifies causal provenance of action generation, not task competence or policy quality; an agent may act arbitrarily or suboptimally while remaining causally verifiable
These experiments establish an empirically necessary condition under the tested threat model, not a complete solution.
9. Conclusion
These experiments demonstrate that:
- Behavioral evaluation is insufficient
- Coherence verification is necessary but forgeable
- Anchoring is the minimal missing ingredient required to verify causal provenance in opaque agents under coherence-seeking deception
Verification Principle In opaque systems, falsifiability must be enforced through temporal commitment to computation under adversarial uncertainty, not through semantic inspection or internal coherence alone.
Trust in opaque systems does not require interpretability. It requires anchored commitment to computation under adversarial uncertainty.