Comments on Anthropic's AI safety strategy

Mar 10, 2023

Anthropic has published its AI Safety strategy: “Core Views on AI Safety: When, Why, What, and How”.

High-level thoughts

Overall, I like the posted strategy much more than OpenAI’s (in the form of Sam Altman’s post) and Conjecture’s.

I like that the strategy takes some top-down factors into account, namely the scenario breakdown.

Things that still seem to me missing or mistaken in the presented strategy:

A static view on alignment, as a “problem to solve” rather than a continuous, never-ending, dynamical process. I think this is important to get this conceptual crux right. More on this below.
A slight over-reliance on (bottom-up) empiricism and not recognising theoretical (top-down) “alignment science” enough. I think as a responsible organisation with now very decent funding and, evidently, rather short timelines, Anthropic should advance more fundamental “alignment science” research programs, blending cognitive science, control theory, resilience theory, political science, and much more. There is some appreciation of the necessity to develop this in the section on “Societal Impacts and Evaluations”, but not enough, IMO. More on this below.
Maybe still not enough top-down planning for AGI transition. To whom “aligned” AGI will belong, specifically (if anyone; AI could also be self-sovereign)? How it will be trained, maintained, monitored, etc.? What about democratic processes and political borders? Is there such a thing as an “acute risk period” and how we should “end” it, and how we should monitor people for not trying to develop misaligned AI after that? Sam Altman answers more of these questions than Anthropic. Of course, all these “plans” will be changed many times, but, as “planning is everything”, these plans should be prepared and debated already, and checked for coherence with the rest of Anthropic’s strategy. I could also hypothesise that Anthropic does have such plans, but chose not to publish them. However, the plans (or, better to call them predictions) of this sort don’t strike me as particularly infohazarous: MIRI people discuss them, Sam Altman discusses them.

Specific remarks

So far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless.

AI couldn’t be robustly HHH to everyone - frustrations/conflicts are inevitable, as per Vanchurin et al.'s theory of evolution as multilevel learning, and other related works. Since there are conflicts, and due to the boundedness of computational and other resources for alignment (not alignment as an R&D project, but alignment as a practical task: when a burglar runs towards you, it's impossible to "align" with them on values), AI must be unhelpful, dishonest, and hostile to some actors at some times.

First, it may be tricky to build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers.

“Steerable” → persuadable, per Levin's "Technological Approach to Mind Everywhere".

Some scary, speculative problems might only crop up once AI systems are smart enough to understand their place in the world, to successfully deceive people, or to develop strategies that humans do not understand.

AI is already situationally aware, including as per Anthropic’s own research, "Discovering Language Model Behaviors with Model-Written Evaluations".

Pessimistic scenarios: AI safety is an essentially unsolvable problem – it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves – and so we must not develop or deploy very advanced AI systems.

Different phrases are mixed here badly. “Controlling and dictating values to AI systems” is neither a synonym for “solving AI safety”, nor desirable, and should not lead to the conclusion that advanced AI systems are not developed. “Solving AI safety” should be synonymous with having alignment processes, systems, and law codes developed and functioning reliably. This is not the same as “controlling and dictating values”. Alignment mechanisms should permit the evolution of (shared) values.

However, I do think that developing and implementing such alignment processes “in time” (before the capability research is ready to deliver superhuman AI systems) is almost certainly out of the empirical possibility, especially considering the coordination it would require (including the coordination with many independent actors and hackers because SoTA AI developments get open-sourced and democratised rapidly). “Pivotal acts” lessen the requirement for coordination, but introduce more risk in themselves.

So, I do think we must not develop and deploy very advanced AI systems, and instead focus on merging with AI and doing something like mind upload on a slower timescale.

If we’re in a “near-pessimistic” scenario, this could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime.

I think the precautionary principle dictates that we should do this in any scenarios apart from “obviously optimistic” or “nearly obviously optimistic”. In the “AGI Ruin” post, Yudkowsky explained well that any alignment protocol (given certain objective properties of the alignment problem contingent on the systems that we build and the wider socio-technical reality) that “narrowly works” almost definitely will fail due to unknown unknowns. This is an unacceptably high level of risk (unless we judge that stopping short of the deployment of superhuman AGI systems is even riskier, for some reason).

Indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot. We should therefore always act under the assumption that we still may be in such a scenario unless we have sufficient evidence that we are not.

It’s worse: in the mindset where the alignment is a continuous process between humans and AIs, rather than a problem to “solve”, we shouldn’t count on sudden indications that it will fail, at all. It’s often not how Drift into Failure happens. So, in the latter sentence, I would replace “evidence” with “theoretical explanation”: the explanation that the proposed alignment mechanism is “safe” must be constructive and prospective rather than empirical and inductive.

If it turns out that AI safety is quite tractable, then our alignment capabilities work may be our most impactful research. Conversely, if the alignment problem is more difficult, then we will increasingly depend on alignment science to find holes in alignment capabilities techniques. And if the alignment problem is actually nearly impossible, then we desperately need alignment science in order to build a very strong case for halting the development of advanced AI systems.

I think this paragraph puts everything upside down. “Alignment science” (which I see as a multi-disciplinary research area which blends cognitive science, including epistemology, ethics, rationality, consciousness science, game theory, control theory, resilience theory, and more disciplines) is absolutely needed as a foundation for alignment capabilities work, even if the latter appears to us very successful and unproblematic. And it is the alignment science that should show that the proposed alignment process is safe, in a prospective and constructive way, as described above.

If our work on Scalable Supervision and Process-Oriented Learning produce promising results (see below), we expect to produce models which appear aligned according to even very hard tests. This could either mean we're in a very optimistic scenario or that we're in one of the most pessimistic ones.

This passage doesn’t use the terms “optimistic” and “pessimistic scenarios” in the way they are defined above. The AI system could appear aligned according to very hard tests, but recognising that alignment is a never-ending process, and insights from broader alignment science (rather than rather narrowly technical “branch” of it, and moreover, implying a certain ontology of cognition where “values” and “processes” have specific meaning) could still show that aligning AI systems on a longer horizon is nevertheless bound to fail.

Our hope is that this may eventually enable us to do something analogous to a "code review", auditing our models to either identify unsafe aspects or else provide strong guarantees of safety.

Interpretability of a system post-training (and fine-tuning) couldn’t provide a “strong guarantee” for safety, since it’s a complex, dynamical system, which could fail in surprising and unexpected ways during deployment. Only always-on oversight (including during deployment) and monitoring, together with various theoretical dynamical models of the system, could (at least in principle) provide some guarantees.

The hope is that we can use scalable oversight to train more robustly safe systems.

Not clear where the “more robustly” comes from. Scalable oversight during training could produce safer/more aligned models, like “manual” oversight would (if it was humanly possible and less costly), ok. But the robustness of safety (alignment) is quite another matter, especially in a broader sense, and it’s not clear how “scalable oversight during training” aids with that.

Learning Processes Rather than Achieving Outcomes

Sounds like it either refers to imitation learning, in which case not sure why give a new name to this concept, or a sort of process-oriented GOFAI which will probably not practically work on the SoTA levels that AI already achieves, let alone superhuman levels. Humans don’t know know how to decompose their activity into processes, evidenced by the fact that processes standards (like from the ISO 9000 series) could not be followed to the letter to achieve practical results. Would be interested to learn more about this Anthropic’s agenda.

When a model displays a concerning behavior such as role-playing a deceptively aligned AI, is it just harmless regurgitation of near-identical training sequences? Or has this behavior (or even the beliefs and values that would lead to it) become an integral part of the model’s conception of AI Assistants which they consistently apply across contexts? We are working on techniques to trace a model’s outputs back to the training data, since this will yield an important set of clues for making sense of it.

I really like this, I think this is one of the most interesting pieces of the write-up.

Engineering Ideas

Discussion about this post