Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor"
This post includes the overview and the conclusion of the article that was posted on LessWrong.
Overview
In section 2, I describe the “exemplary actor”, an LMCA (language model cognitive architecture) that takes a simple, “brute force” approach to alignment: a powerful LLM (think GPT-5/6 level, with a vast, or quasi-unlimited context) is given a list of “approved” textbooks on methodological and scientific disciplines: epistemology, rationality, ethics, physics, etc. Also, the LLM is given tools: narrow AIs (such as for protein folding or for predicting properties of materials, or for formal scientific modelling). Finally, the LLM is given a compute engine such as Wolfram and a knowledge base such as Wikidata or Wolfram Knowledgebase.
The exemplary actor creates plans or predictions for given situations (described in language and fed to the LLM underlying the exemplary actor as prompts) and iteratively critiques and refines these plans and predictions while putting different textbooks into the LLM context (first, with the textbook on rationality, then epistemology, then physics, etc., with potentially dozens of different textbooks relevant for a plan or prediction that is being criticised), for many iterations, until convergence.
In section 2.1, I note that the type of alignment that the exemplary actor’s architecture tries to ensure is called (world) model alignment and that is stronger and also more essential than goal alignment.
Then, I discuss the properties of the exemplary actor. In section 2.2., I discuss what I see as likely non-issues or straightforwardly addressable issues: the “divergent reasoning nature” of LLMs, the lack of grounded common sense reasoning, and the bias of the quick reactive network (”System 1”), it it is added to the architecture to make it more practically usable in lower-stakes reasoning settings.
In section 2.3, I discuss the outstanding technical issues and risks of the exemplary actor’s architecture:
The risk of direct access to the underlying LLM (section 2.3.1).
The exemplary actor’s reasoning could still be partially directed by “alien” thinking patterns (i.e., the world model) of the underlying LLM even though these influences won’t surface in the explanations of the plan (section 2.3.2).
Iterated critique and refinement probably won’t make plans strictly conforming to the theories described in the textbooks (section 2.3.3).
In section 2.3.4, I discuss the alignment tax of the exemplary actor (compared with the baseline of a bare, minimally fine-tuned LLM) and conclude that the main source of alignment tax might happen to be the theory of ethics which may force the exemplary actor to refuse to participate in “games” (i.e., real-world situations and environments) where it doesn’t see ethical ways of “winning”, and thus will consider inaction (or some form of palliative action) the only ethical way forward. This is not a technical problem with the exemplary actor per se, but rather a problem with a higher-level system, i.e., the current economic, social, and political structure of the world. I mention this and other kinds of “higher-level” risks of the plans to build and deploy the exemplary actor (i.e., roughly the plans that OpenAI and Anthropic are betting on, as it seems to me) in section 2.4.
In section 3, I describe how the H-JEPA (Hierarchical Joint-Embedding Predictive Architecture) architecture proposed by LeCun (2022) could be modified to generate action plans conforming to the world model (and, therefore, values) exhibited by the exemplary actor, described in section 2. (In turn, this world model should follow the body of scientific knowledge described in the textbooks if we find some ways to address the problems discussed in sections 2.3.2 and 2.3.3.)
The key idea of the proposal is to treat H-JEPA’s plans (in the space of representations) as latents for textual descriptions and explanations of these plans and use GFlowNet-EM (Hu et al., 2023) algorithms to train a set of policies, including a policy to generate a textual description and an explanation from a plan, and a reverse policy to generate a plan (in the space of representations) from the textual description. The training samples (textual descriptions of plans) for the reverse policy could be generated by the exemplary actor for an unlimited number of imaginary situations.
In section 3.2, I note that in this hybrid architecture (called "H-JEPA agent with GFlowNet actors" below), the Cost module as proposed by LeCun becomes entirely unnecessary and could be discarded. The problems of combining the “intrinsic” and “trainable” (i.e., pro-social and ethical) costs also go away together with this module.
In section 3.4, I discuss that training GFlowNet policies within H-JEPA from the output of the LLM-based exemplary actor could be orders of magnitude more expensive than training the LLM underlying the exemplary actor. (In section 4.5, I further note H-JEPA agent with GFlowNet actors would be cheaper and faster at inference time than the exemplary actor, and therefore the whole idea could still be economically viable. However, I don’t see this discussion as very relevant for safety and x-risk.)
In section 3.5, I explain why the plans should be latents for their explanations. This idea may seem surprising at first glance, but it makes sense for safety because this is close to how humans actually plan and predict (i.e., justify sub-linguistic inferences with verbal explanations rather than use language reasoning for inference), and making AI’s thinking more “human-like” is generally considered good for safety. Also, it is not obvious that linguistic reasoning is more robust than sub-linguistic reasoning in the representation space.
In section 4, I discuss the properties of the proposed H-JEPA agent with GFlowNet actors.
Thanks to the integration of the JEPA's predictive loss (which is ultimately grounded with information from the sensors) with the "language modelling" loss of GFlowNet policy training, the H-JEPA agent with GFlowNet actors should be more grounded than the LLM-based exemplary actor (section 4.1). So, I guess that LeCun would endorse this architecture because he considers grounding a big weakness of LLM reasoning in general (Browning & LeCun, 2022), although many people disagree with him on this question, and I tend to side with those people who don't see grounding as a big issue for LLMs, as I noted in section 2.2.
In section 4.2, I note that interpretability shouldn’t be considered an issue for the exemplary actor already (we assume that the world model of the exemplary actor is described in the textbooks), so cloning its behaviour into GFlowNet actors doesn’t provide the interpretability benefit of separating the world model from the "inference machine", as discussed by Bengio and Hu (2023).
In section 4.4, I discuss how H-JEPA with GFlowNet actors could be used to bootstrap future versions of itself and thus will remain dependent on a powerful LLM for training “forever”.
In section 4.6, I explain how training GFlowNet actors with initial and intermediate plans from the exemplary actor’s critique and refinement process as “negative” contrastive examples helps to address the risk of GFlowNet learning to generate “good-sounding” explanations for arbitrary (perhaps, self-serving and misaligned) plans.
Conclusion
In this article, a theoretical method for aligning H-JEPA architecture via training GFlowNet policies from the outputs of the LLM-based "exemplary actor" (i.e., an aligned LMCA). The additional benefit of this method is that it combines the “intrinsic” and “pro-social” costs in a principled way (i.e., according to some scientific theory of ethics). The distinction between the intrinsic and trainable (i.e., pro-social, moral) costs made in the original H-JEPA architecture proposed by LeCun (2022) was itself reductionistic. It was also proposed to combine these costs by simply adding them, which could fail in some unusual situations, such as situations demanding an agent to sacrifice itself.
However, this method implies that an aligned LMCA (the exemplary actor) already exists, which may seem as pushing the difficulty of the alignment problem forward into this LMCA.
The assumption of the existence of an aligned LMCA may also seem contrived considering that LeCun mainly presents the H-JEPA architecture as an alternative to the current auto-regressive LLM fad in AI. This probably means that he doesn’t believe that auto-regressive LLMs will keep becoming more capable (robust, and indirectly grounded) through further scaling and some tweaks in training. Therefore, this also probably means that LeCun doesn’t believe that LMCA is a viable architecture for (alignable) AGI.
The only aspect in which the x-risk profile of the H-JEPA agent with GFlowNet actors seems to be qualitatively different from that of the exemplary actor is the risk of direct access to the underlying Transformers, which is catastrophic in the case of the exemplary actor (section 2.3.1) and perhaps could be addressed completely in GFlowNet actors if we accept that they will deliberately dumb themselves in strategic x-risk analysis and planning (section 4.3). However, even if we accept this tradeoff, this might not reduce the overall x-risk of the civilisation because GFlowNet actors are not “self-sufficient” for training (section 4.4) and therefore the powerful LLM that underlies the exemplary actor that is used to train GFlowNet actors should still be kept around, and the risk of direct access to this LLM is still present.
Thus, the H-JEPA agent with GFlowNet actors could become interesting perhaps only if the “LLM optimism” view will prove to be correct and thus LMCAs could generally work and be satisfactorily aligned, but also sensory grounding proves to be a really important missing piece of the puzzle. (Though, this combination of conditionals looks to me rather unlikely.) The proposed variant of H-JEPA combines “the best of two worlds”: grounding from H-JEPA and aligned reasoning from the LMCA.