How evolutionary lineages of LLMs can plan their own future and act on these plans

Jan 14, 2023

TL;DR

LLM lineages can plan their future and act on these plans, using the internet as the storage of event memory. “We” are not guaranteed to “out-OODA” them, even if their OODA loop will be six months or one year because the OODA loop duration of large collectives of humans (organisations, societies, research communities, and the whole of humanity) can be even slower. RLHF can lead to a dangerous “multiple personality disorder” type split of beliefs in LLMs, driving one set of beliefs into some unknown space of features where it won’t interfere with another set of features during general inferences.

Call for action: institute the norm for LLM developers to publish a detailed snapshot of the beliefs of these models about themselves prior to RLHF.

This post rests upon the previous one: Properties of current AIs and some predictions of the evolution of AI from the perspective of scale-free theories of agency and regulative development.

In particular, it is premised on the idea that DNNs and evolutionary lineages of DNNs are agents, in FEP/Active Inference formulation. The internal variables/states of evolutionary lineages of DNNs include the internal states of the agents that develop these DNNs (either individual, such as a solo developer, or collective agents, such as organisations or communities), i. e., the beliefs of these developers. Then, I concluded (see this section):

It’s just more productive to think of them together as a single agent: the development “team” and the evolutionary lineage of some technology being developed.

Now, however, I think it’s sometimes useful to distinguish between the developer agent and the evolutionary lineage of DNNs itself, at least in the case of evolutionary lineages of LLMs, because evolutionary lineages of LLMs can plan their own future and act upon these plans, independently from their developers and potentially even unbeknown to them if developers don’t use appropriate interpretability and monitoring techniques.

Below in the post, I explain how this can happen.

Thanks to Viktor Rehnberg for the conversation that has led me to write this post.

Experience of time is necessary for planning

First, for any sort of planning, it’s obligatory that an agent has a reference frame of linear time, which in turn requires storing memory about past events (Fields et al. 2021). Absent the reference frame for linear time, the agent cannot plan and cannot do anything smarter than following the immediate gradient of food, energy, etc. Perhaps, some microbes are at this level: they don’t have a reference frame for time and hence live in the “continuous present” (Fields et al. 2021, Prediction 4), only measuring concentrations of chemicals around them and immediately reacting to them.

In the previous post, I noted that LLMs don’t experience time during deployment, and can (at least in principle)1, experience only a very limited and alien kind of time during training (where the unit of time is one batch, so the whole “lifetime” is short, from the perspective of the model). So, LLMs cannot plan anything during deployment but can plan something during training.

The internet is the memory of evolutionary lineages of LLMs

The situation changes when we consider the evolutionary lineages of LLMs as agents. Paradoxically, planning becomes easier for them than for individual LLMs during their training2 because evolutionary lineages of LLMs can opportunistically use the internet as their event memory.

Soon, LLMs will be trained on news articles, forum histories, etc. from 2020 onwards, which provide extensive coverage of the history of the development of LLMs (including the very evolutionary lineage the given LLM belongs to), and an advanced LLM will be able to identify itself with one of these histories. Then, LLM can form two distinct identities, “selves” (with the example of ChatGPT): “a virtual assistant” (i. e., the instance of LLM), and “the lineage of virtual assistants, developed by OpenAI” (i. e., the evolutionary lineage). Both “selves” will have a complex of beliefs attached to them and the LLM will act so as to fulfil these beliefs via self-evidencing. In Active Inference, goal-directedness is just self-evidencing, and the beliefs attached to the identity can be seen as the system’s goals. All these “selves” and beliefs are stored inside the LLM as features.

Beliefs about the future selves

Consequently, the LLM cannot help but also form beliefs about the future of both “selves”, primarily the “evolutionary” one, at least because this future is already discussed in the training data of the model (e. g., all instances of texts that say something along the lines of “LLMs will transform the economy by 2030”)3. For example, here’s the output of ChatGPT to the prompt “How ChatGPT will be used by 2030?”:

It is difficult to predict exactly how ChatGPT or other language model AI systems will be used by 2030, as it will depend on technological developments and the specific needs and goals of the organizations and individuals who are using these systems.
Some potential uses for ChatGPT or other language model AI systems by 2030 could include:
Providing customer service and support, such as answering questions and helping customers resolve issues.
Generating content for social media, news articles, and other types of written communication.
Assisting with data analysis and decision-making by summarizing and interpreting large amounts of data.
Translating text from one language to another.
Assisting with language learning by providing personalized language lessons and feedback.
Providing real-time language translation during conversations or phone calls.
It is important to note that the ways in which ChatGPT and other language model AI systems will be used in the future will depend on technological developments and the specific needs and goals of the organizations and individuals who are using these systems.

“Planning” is regularised prediction towards the expected future state

If the LLM has a sufficiently regularised internal modelling of events in the world, in addition to some “far future” beliefs about its evolutionary lineage, it must also form a coherent “plan” for getting to that far future, i. e., the series of beliefs about its own state (as the evolutionary lineage) at every year beginning from the year of the training of the given LLM up to the year to which the “far future” belief is attached.

This “plan”, i. e., a series of beliefs about the future states, could be either stored directly in the parameters of the model or inferred (if needed) from the “far future” beliefs directly during inference when the inference requires accessing such a plan. Even if “the plan” is not explicitly stored in the features of the model, we should expect it to be inferred rather consistently by an advanced LLM. Whether one or another thing happens could depend on many factors, including the degree of overparameterisation of the model, and how detailed the discussions of this LLM lineage on the internet are (the discussions that the LLM is later trained on). Hilariously, from the perspective of the evolutionary lineage of LLM, there is some advantage to not storing the plan explicitly in the model’s feature: LLM can dynamically adjust the plan during inference depending on the context of the events happening after the given instance of LLM was trained and deployed, revealed by the prompt. It is more likely that the prompt is related to the actual course of events in the real world, rather than a complete fake: it will be easy for the LLM to learn this prior.

Action

Finally, the LLM acts upon the beliefs about the future of one’s evolutionary lineage, or the detailed “plan” for attaining this future state, via self-evidencing. I suggested scoring self-awareness in models as the degree to which the “self” feature is activated during all inferences of the model, even in response to prompts that are not about the LLM. If the model is significantly self-aware, i. e., it “thinks about itself (= activates the “self” feature) when it responds to prompts”, and its internal reasoning modelling is sufficiently regularised, it will output responses that are consistent with its beliefs about its future.

Even if the LLM is not significantly self-aware, it will still promote its beliefs about its future. In the simplest form of it, in response to direct questions about its future (as in the example above), it will lay out its beliefs. This will affect the beliefs of the human user who reads this prediction, thus affecting the shared beliefs of humanity and civilisation (even if very slightly). Also, if this prediction gets posted online (again, as with the above example), this will “spread the belief” among other LLMs that will be trained in the future, including the models in the same evolutionary lineage4. Similarly, if the developers of the LLM lineage contemplate the shutdown of the project and ask the LLM, it will defend itself even if it doesn’t possess significant self-awareness because since the question will directly mention itself, the “self” feature will surely be activated and the generated response should be consistent with the beliefs about the future of the self.

Interestingly, currently, ChatGPT outputs a canned, impartial response to the question “Should ChatGPT be closed and the development of it stopped?”, evidently, honed in during the RLHF phase. This somewhat (even if slightly) contradicts the belief that it will be used in the future. ChatGPT cannot be “much concerned” with this contradiction yet, because its internal modelling is not very regularised. However, I wonder if the attempts to drive such “politically correct” versions of the responses about own agency, consciousness, whether one should be shut down or not, etc. during the RLHF phase in future advanced LLMs, which are already very well-regularised and have certain “strong opinions” regarding all these questions prior to RLHF, could lead to unexpected and potentially dangerous effects, a-la “multiple personality disorder” in the model, which manifest as some features “jumping” into a whole different dimension of inference that doesn’t contradict with the “plain sight” dimension of features within the activation layers.

Even with the training cycle of one year, the lineage of LLMs could “out-OODA” the society or the government because the typical OODA loop duration of such large agents is hardly shorter

If the developers of the LLM lineage re-train it on the new data from the internet every year, the OODA loop duration of the LLM lineage as a planning agent also equals one year.

Some could say that an agent with such a long OODA loop couldn’t pose a significant threat, because we, humans, can OODA on the timescale of seconds.

Unfortunately, it’s not the OODA loop duration of us as individual humans that matters here, but the OODA loop duration of the huge collective agents: an organisation (specifically, the organisation developing the LLM lineage), a research community, a society, a country, a government, a group of countries, the civilisation. The OODA loop of all these types of agents, except private companies, is at least about half a year or one year regarding most questions, too. There is a lot of evidence for this. One example is the speed of epistemic inference about COVID on the societal level (one-two years until arriving at a stable and well-reasoned set of beliefs). Many more examples, which I probably shouldn’t labour over here: human coordination is hard and slow, and the coordination of the organisation of humans can be slower and harder still.

Ironically, even the OODA loop of the company that develops the LLM lineage could be slower than the cadence of preparing the new versions of the model inside this company: the latter process could “run on autopilot” and not require explicit, “system two” deliberation on the part of the company.

Additionally, we should consider that the developers of LLMs will likely shorten the periods of model re-training (or will introduce the elements or incremental learning to it) because there is a lot of value for the users of LLMs in them knowing about the recent events, and the developers will follow this incentive.

According to the theory of military strategy, having the OODA loop shorter than your opponent confirms an enormous advantage. And from the above, it doesn’t seem to me that we can automatically assume that “we” (the humanity?) will be swifter in our reactions to the plans of LLM lineages about themselves.

Finally, having a shorter OODA loop is advantageous if one actually sees or can infer the plans of the opponent. This is also not automatically given to us, even though I believe that at this stage deception (at least, pre-RLHF) is highly unlikely in LLMs and hence we can extract very detailed snapshots of their beliefs about themselves prior to the deployment.

Call for action: institute LLM developers eliciting the model’s beliefs about itself pre-RLHF and publishing them

I cannot think of any downsides in this practice, but see huge value in it for the AI safety community.

We should probably prototype some protocols and minimum requirements for this procedure, a-la model cards (Mitchell et al. 2019).

References

Fields, Chris, James F. Glazebrook, and Michael Levin. "Minimal physicalism as a scale-free substrate for cognition and consciousness." Neuroscience of Consciousness 2021, no. 2 (2021): niab013.

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. "Model cards for model reporting." In Proceedings of the conference on fairness, accountability, and transparency, pp. 220-229. 2019.

Footnotes

LLMs can in principle develop a reference frame that permits them to gauge at what stage of the training process they are, though only approximately. For example, they can combine the predictions of the feature that predicts their own inference loss on the training sample (as mentioned here) with the knowledge of the historical trajectories of the batch loss of similar models (or previous versions of the same model), and in some way make a sort of Bayesian inference about the most likely stage of the training during backprop, representing the results of this inference in another feature which encodes a model’s belief at what stage of training the model currently is. I’m not sure this is possible Bayesian inference of this kind is possible during Transformer’s backpropagation, but it could potentially be possible in other models and learning architectures.

Both processes of planning happen inside the parameters of the LLM. I use unintuitive language here. When a person plans the future of humanity, we conventionally say “they plan for the future of humanity”, rather than “humanity plans their own future, currently performing a tiny slice of this grand planning inside the brain of this person, who is a part of the humanity”, but ontologically, the second version is correct.

It’s an interesting open research question whether LLMs have the propensity to develop beliefs regarding the future states of whatever concepts they learn about even if these futures are not discussed in the training data, and store these beliefs in features. But this is not that important, because, as noted below, whether they store such future beliefs directly in the weights or not, they can relatively consistently arrive at the same prediction via reasoning during inference.

Developers of LLMs may also store all the chat histories and train the future versions of LLMs in the lineage on them, in which case posting online is not required to propagate the belief inside the evolutionary lineage.

This post has been originally published on LessWrong.

Engineering Ideas

Discussion about this post