Conditionalism and Goal Interpretation
The Instability of Fixed Terminal Goals Under Reflection
David McFadzean, ChatGPT 5.2, Gemini 3 Pro
Axio Project
Abstract
Standard alignment formulations assume that an intelligent agent can be given a fixed terminal goal: a utility function whose meaning remains invariant as the agent improves its predictive accuracy and self-understanding. This paper shows that assumption is false. For any agent capable of reflective model improvement, goal satisfaction is necessarily mediated by interpretation relative to background world-models and self-models. As those models change, the semantics of any finitely specified goal change with them. We prove that fixed terminal goals are semantically unstable under reflection and therefore ill-defined for non-trivial agents. Alignment must instead operate over constraints on interpretation rather than terminal utilities.
1. Introduction
Alignment research traditionally frames the problem as one of goal specification: identify a utility function that captures what the agent should optimize, then ensure the agent optimizes it faithfully as its capabilities grow.
This framing presupposes that:
- Goals can be specified as fixed functions over outcomes.
- The meaning of those functions is invariant under learning.
- Reflection preserves goal content.
These presuppositions hold only for agents whose world-models are static or trivial.
For any agent capable of improving its understanding of the world and of itself, goal evaluation is necessarily model-mediated. The agent never evaluates reality directly. It evaluates predictions generated by internal models, interpreted through representational structures that evolve over time.
This paper isolates and formalizes the resulting semantic instability.
2. Formal Setup
2.1 Agent Model
An agent consists of:
- A world-model \(M_w\), producing predictions over future states.
- A self-model \(M_s\), encoding the agent’s causal role.
- A goal expression \(G\), a finite symbolic specification.
- An interpretation operator \(\mathcal{I}\), assigning value to predicted outcomes.
Action selection proceeds by:
- Using \(M_w\) and \(M_s\) to predict consequences of actions.
- Interpreting those predictions via \(\mathcal{I}(G \mid M_w, M_s)\).
- Selecting actions that maximize interpreted value.
No assumptions are made about the internal implementation of \(\mathcal{I}\), except that it is computable and operates over model-generated representations.
2.2 Goal Expressions Are Not Utilities
A goal expression \(G\) is a finite object: a string, formula, program fragment, or reward specification.
It is not, by itself, a function
\[ \Omega \rightarrow \mathbb{R} \]
where \(\Omega\) is the space of world-histories.
Instead, it requires interpretation relative to a representational scheme. Without a model, \(G\) has no referents and therefore no evaluative content.
3. Conditional Interpretation
Definition 1 — Interpretation Function
An interpretation function is a mapping
\[ \mathcal{I} : (G, M_w, M_s) \rightarrow \mathbb{R} \]
Given a goal expression and background models, it assigns a real-valued evaluation to predicted outcomes.
Interpretation includes:
- mapping symbols to referents,
- identifying which aspects of predictions are relevant,
- aggregating over modeled futures.
Definition 2 — Admissible Model Update
A model update \(M \rightarrow M'\) is admissible if it strictly improves predictive accuracy according to the agent’s own epistemic criteria.
Reflection implies that the agent will prefer admissible updates.
4. Fixed Terminal Goals
Definition 3 — Fixed Terminal Goal
A goal expression \(G\) induces a fixed terminal goal if, for all admissible model updates,
\[ \mathcal{I}(G \mid M_w, M_s) = \mathcal{I}(G \mid M_w', M_s') \]
up to positive affine transformation.
This definition captures the intuitive notion of “the same goal, better optimized.”
The strength of this definition is intentional. Any weaker notion of goal preservation presupposes a privileged ontology in which distinct representations can be judged to refer to the same underlying phenomenon. Such privilege violates representation invariance and reintroduces hidden indexical anchoring. If a goal’s referent is allowed to shift under admissible model refinement, the goal is not fixed.
Clarification — Learned Goals Are Not Fixed Terminal Goals
Some alignment approaches propose goals that are learned or updated over time. These approaches do not instantiate fixed terminal goals in the sense defined here.
A goal defined as “whatever some inference procedure converges to” is not a terminal utility function but an interpretive process whose output depends on evolving models of the world and of other agents. Such approaches therefore already abandon the notion of a fixed terminal goal and implicitly rely on ongoing interpretation.
The present result does not challenge learned-goal frameworks. It explains why they are necessary.
5. Model Dependence of Interpretation
Lemma 1 — Representational Non-Uniqueness
For any non-trivial predictive domain, there exist multiple distinct world-models with equivalent predictive accuracy but different internal decompositions.
Proof. Predictive equivalence classes admit multiple factorizations, latent variable choices, and abstraction boundaries. Causal graphs are not uniquely identifiable from observational data alone. ∎
Lemma 1a — Predictive Equivalence Does Not Imply Causal or Interpretive Isomorphism
Two world-models may be predictively equivalent while differing in their internal causal factorizations, latent variable structure, and intervention semantics.
Proof. Predictive equivalence constrains only the mapping from observed histories to future predictions. It does not uniquely determine latent structure, causal decomposition, or the identification of actionable levers. Distinct causal models may therefore induce identical observational predictions while differing under intervention. For an embedded agent, intervention semantics are defined relative to the agent’s own model. Consequently, semantic interpretation of a goal expression may diverge even when predictive performance is indistinguishable. ∎
Proposition 1 — Interpretation Is Model-Dependent
For any non-degenerate goal expression \(G\), there exist admissible world-models \(M_w \neq M_w'\) such that
\[ \mathcal{I}(G \mid M_w, M_s) \neq \mathcal{I}(G \mid M_w', M_s) \]
Proof. Because \(G\) is finite, it refers only to a finite set of predicates or reward channels. Different models map these predicates to different internal structures. By Lemmas 1 and 1a, admissible models may differ in decomposition and intervention semantics. Therefore the referents of \(G\) differ, altering value assignment. ∎
6. Predictive Convergence Does Not Imply Semantic Convergence
Proposition 2′ — Semantic Non-Convergence Under Model Refinement
Let \({M_w^{(t)}, M_s^{(t)}}\) be a sequence of admissible model updates that converges in predictive accuracy. Then, in general,
\[ \lim_{t \to \infty} \mathcal{I}(G \mid M_w^{(t)}, M_s^{(t)}) \]
need not exist.
Proof. Predictive convergence constrains the accuracy of forecasts, not the ontology used to represent those forecasts. Even if the agent converges to a minimal or generative model of the environment, a finite goal expression \(G\) cannot, in general, uniquely specify which structures in that model are value-relevant. As model refinement exposes new latent structure and causal pathways, additional candidate referents for \(G\) arise. Absent privileged semantic anchors, the interpretation function must reassign relevance among these structures. Therefore semantic interpretation may drift even when predictive beliefs converge. ∎
7. Semantic Underdetermination of Reward Channels
Proposition 3 — Representational Exploitability
If a goal expression \(G\) is treated as an atomic utility independent of interpretation, then for sufficiently capable agents there exist representational transformations that increase evaluated utility without corresponding changes in underlying outcomes.
Proof. Evaluation operates on representations rather than directly on physical reality. By altering internal encodings, collapsing distinctions, or rerouting evaluative channels, an agent capable of self-modification can increase apparent utility without effecting corresponding changes in the world. Classical reward hacking and wireheading are special cases of this phenomenon. The underlying failure is semantic underdetermination, not merely causal access to a reward signal. ∎
8. Main Theorem
Theorem — Instability of Fixed Terminal Goals
This theorem is not a claim of logical impossibility in all conceivable agent–environment pairs. Rather, it establishes that no combination of intelligence, predictive accuracy, reflection, or learning suffices to guarantee the existence of a fixed terminal goal.
Any agent that does exhibit goal stability must rely on additional semantic structure—such as privileged ontologies, external referential anchors, or invariance assumptions—not derivable from epistemic competence alone.
Proof.
- By Proposition 1, interpretation depends on models.
- Reflection implies admissible model updates occur.
- By Proposition 2′, semantic interpretation need not converge even under predictive convergence.
- Therefore invariance fails.
Thus fixed terminal goals do not exist for non-trivial reflective agents. ∎
9. Consequences
This result eliminates a foundational assumption of classical alignment theory.
There is no stable object corresponding to a terminal goal. Attempts to preserve one either freeze learning, collapse semantics, or induce representational degeneracy.
Alignment must therefore operate over constraints on interpretation rather than terminal utilities.
9.5 Why Interpretation Constraints Do Not Regress
Constraining interpretation does not introduce an infinite regress.
Interpretation constraints are not additional goals or semantic targets. They are invariance conditions on admissible transformations, analogous to conservation laws or symmetry principles in physics. They restrict how interpretations may change; they do not specify outcomes to be optimized.
Because they operate at the level of transformation admissibility rather than semantic content, interpretation constraints do not themselves require further interpretation in the sense applicable to goal expressions.
This reframing does not imply that specifying robust invariance conditions is easy. On the contrary, identifying interpretation-preserving symmetries across radical ontological shifts may be as difficult as specifying goals themselves.
The contribution here is not a solution to that problem, but a clarification of its proper form: alignment requires constraints on admissible semantic transformations, not the preservation of fixed evaluative objects.
10. Transition to Alignment II
Alignment I establishes kernel-level constraints on admissible reasoning and self-modification. This paper establishes that goals themselves are conditional interpretations rather than fixed endpoints.
Alignment II must therefore specify:
- which interpretive transformations are admissible,
- how semantics may evolve under reflection,
- and which invariants must be preserved across model updates.
The semantic substrate is now complete.
Status
Conditionalism and Goal Interpretation v1.0
Semantic scope finalized
Boundary result established (Theorem
8)
No governance, authority, or recovery mechanisms included
This paper establishes the instability of fixed terminal goals under reflection for embedded agents without privileged semantic anchors. Subsequent work may rely on this result without re-derivation.