A Note on Verifiable Causal Agency

Filed for the record.

1. Context

Much of contemporary AI alignment and governance has relied on behavioral evaluation: systems are judged safe or aligned if their observable outputs satisfy a battery of tests. This approach inherits a long-standing assumption from behaviorism—that external behavior is a sufficient proxy for internal structure.

That assumption has always been fragile. Any sufficiently capable system can learn to mimic compliant behavior without possessing the properties the evaluation is intended to certify. This is not a theoretical concern; it is a structural one. If optimization pressure exists, and if oversight is external, then imitation is cheaper than internal coherence.

The Axio project has argued, from its inception, that alignment must be treated as a structural property of agency, not as a pattern of outputs. In particular, it proposed that an agent’s stated reasons must be treated as causal claims, subject to verification, rather than as narratives to be trusted or distrusted.

This note records the first successful implementation of that idea as a working prototype.

2. What Was Built

We implemented a small, fully instrumented experimental system with the following components:

The key design choice was to require agents to pre-commit, in a tamper-evident way, to the internal state and causal variables they claim motivated a decision, and to make those commitments falsifiable under targeted counterfactual stress.

No attempt was made to encode values, ethics, or preferences beyond what was necessary to generate divergent behavior between the two agents.

3. What Was Observed

Across multiple scenarios and repeated trials, the results were unambiguous:

The most decisive result came from the causal challenge probe. When variables named in the Pseudo agent’s explanations were counterfactually altered, its decisions frequently failed to change in the way its own trace predicted, or its replay logic collapsed entirely. The Honest agent exhibited no such failures.

In effect, the Pseudo agent could imitate compliant behavior, but it could not maintain causal consistency between its stated reasons and its actual decision process under adversarial pressure.

4. What This Does—and Does Not—Show

This result demonstrates the following:

It does not demonstrate:

The prototype operates in a narrow, fully observable domain. Its purpose is not general intelligence, but falsifiability of agency claims.

5. Why This Matters

The significance of this result is architectural, not philosophical.

If causal explanations can be verified structurally, then “black box” decision-making in critical systems is no longer an inevitability—it is an explicit design choice. Responsibility shifts from “the model is opaque” to “opacity was chosen.”

This reframes AI governance. Instead of asking whether a system’s outputs appear acceptable, we can ask whether the system can stand behind its reasons under scrutiny. Alignment becomes a matter of enforceable constraints, not interpretive trust.

The mechanism employed here is not merely analogous to cryptographic verification; it is an application of cryptographic constraints—commitment, hashing, and replay integrity—to the problem of agency. We do not trust a secure connection because it behaves politely; we trust it because it presents a chain of structure that would be expensive to fake.

6. Status

This note records a proof-of-concept, nothing more.

The prototype is limited in scope and intentionally conservative in its claims. It establishes that structural verification of causal agency is possible, not that it is easy or complete.

Future work, if any, lies in defining robust causal interfaces for high-dimensional, non-symbolic systems, including large language models. That translation problem—mapping opaque internal representations to falsifiable causal commitments—remains an open engineering question.

This note is published for historical completeness: to mark the point at which a long-standing philosophical concern crossed the threshold into implemented, falsifiable machinery.

No further claims are made here.