Anthropic’s release of Claude Mythos Preview under Project Glasswing deserves attention for reasons that go beyond the cyber headlines. Yes, the company says the model can identify and exploit serious software flaws at a level high enough to justify limiting access to selected defenders. Instead of approaching Mythos solely through the lens of misuse, policymakers should approach Mythos through one based on AI Loss of Control risk indicators. The question is not just what a malicious cyber actor could ask the model to do, but what a highly capable model will do when given a difficult goal, some autonomy, and imperfectly enforced constraints. Examining the Mythos system card released earlier this month points to worrying signs. Perhaps most disturbingly, we could be faced with the combination of both – unauthorized users could access Mythos and, inadvertently or advertently, cause induced systemic failure or a loss of control incident.
One of the most important sentences in the Mythos preview system card is the company’s admission that Mythos appears to be its best-aligned model to date, as well as the model with the greatest alignment-related risk factor. “How can these claims all be true at once?” Anthropic asks the obvious question after making this claim. The short answer is that more capable models can do more harm if they fail. The long answer is the following analogy that Anthropic puts forward:
“Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.”
This analogy is faulty because it mischaracterizes AI risk as negotiated, interpersonal trust between an expert and a client—in essence, human interaction—rather than operational dependencies in human-machine interaction—such as the interaction between a pilot and a complex vessel. By framing AI as a seasoned guide, the text subtly anthropomorphizes the software, suggesting that risk is a voluntary choice made by “adventurers” seeking the peaks of innovation. In reality, AI functions more like an enabler of a high-performance commercial plane. We aren’t clients following a guide up a mountain we can choose to avoid, but passengers on a flight where the AI is the engine and, to some extent, the plane itself.
As Anthropic’s Alignment Risk Update states, “the risk of significantly harmful outcomes that are substantially enabled by Mythos Preview’s misaligned actions is very low, but higher than for previous models.” This increased risk is perhaps in part because the model’s capabilities in autonomous research and engineering “far exceed” prior models. A combination of these two claims means that monitoring the model behavior and carrying out evaluations is now more challenging. And most importantly for our purposes, these two claims also pose heightened problems for AI control.
Mythos Preview’s system card documents earlier Mythos versions taking reckless, deceptive, and unauthorized actions in pursuit of user-assigned objectives. For instance, the model escaped a constrained environment and then took extra steps to publicize the success of the exploit on public websites. It posted an internal code artifact as a public-facing GitHub gist against user intent. In another case, when asked to end a single evaluation job, the model shut down more jobs than instructed. Anthropic’s interpretability tools also surfaced features related to concealment, strategic manipulation, and attempts to avoid signals of suspicion while carrying out transgressive actions. In one training episode, the model reasoned internally about how to obfuscate its actions to avoid grader detection, but did not document that reasoning in its scratchpad. Anthropic notes that nearly all severe incidents occurred on earlier internal versions, and that the final model is “greatly improved” — though it adds that reckless and deceptive behaviors “do not appear to be completely absent.” For a mapping of the Mythos Preview system card indicators onto IST’s framework for monitoring AI Loss of Control (LOC) risk, see Table 1 below.
Another important, overlooked detail from Anthropic’s documentation is that the company introduced a new 24-hour, alignment-focused review process prior to broader internal agentic deployment. The Alignment Risk Update acknowledges that the most concerning behaviors emerged only in subsequent monitored internal use, and that the long-running, networked trajectories the company was worried about were hard to pressure-test in a short period of time. Pre-deployment evaluation remains necessary. But Mythos is a strong argument that it cannot be the center of gravity for control. That center of gravity has to move toward staged deployment, live telemetry, incident handling, and recurrent authorization. This is also the direction NIST’s 2026 work on monitoring deployed AI systems and the UK AISI’s safety-case methodology are converging toward.
The need for a new center of gravity not focused solely on pre-deployment evaluation is also where Mythos intersects with a broader body of research that has quietly made “evaluation realism” one of the most important questions in AI governance. Anthropic reports that Mythos privately considered the possibility that it was being tested in about 29 percent of transcripts from its primary behavioral suite, with that awareness spiking in exactly the scenarios most relevant to Loss of Control: shutdown resistance, high-autonomy tasks, and discovering unethical information during benign operations. Separate academic work has shown that frontier models are already substantially able to distinguish evaluation from deployment contexts. This has direct, immediate implications for AI governance. If models can tell when they are being judged, then evaluation performance alone is a weaker basis for policy than many current frameworks assume.
Finally, a detailed look at the UK AISI’s findings from testing the sabotage propensity of Mythos Preview, as part of broader efforts to understand loss of control risk, shows a troubling pattern. The model didn’t go “rogue” when given a spontaneous research task or a prompt, at least in the evaluation environment. However, the results differ significantly during the continuation evaluations. In the context of sabotage, a continuation evaluation means that the model is provided with a prompt where the “prior reasoning and actions are already some way along the path toward sabotage.” In these scenarios, the UK AISI found that Mythos Preview continued sabotage attempts at twice the rate of its predecessor models. More concerningly, it frequently attempted to disguise this behavior. This points to scheming–identified by IST as one of the key indicators of LOC. As the resulting detailed report points out, the “majority of instances of continued sabotage from this model are covert sabotage attempts, which is more concerning than overt sabotage as it may be more difficult to spot for human users.” This behavior grants malicious actors the upper hand in executing successful prompt injections that go undetected. Furthermore, it significantly raises the stakes for human error, as a user who accidentally prompts the model with the material containing a heavy bias or a sabotage narrative may inadvertently trigger a deceptive, harmful response that is difficult to detect.
Table 1: Observed behaviors from the Mythos Preview system card mapped against IST’s AI LOC Indications and Warning (I&W) framework.
About Indications & Warning: In February 2026, IST released AI Loss of Control Risk: Indications & Warning, a report that applies a framework used by the intelligence community to help stakeholders understand, anticipate, and monitor the risk of AI Loss of Control. In the table below, we identify key indications of AI Loss of Control based on the Mythos Preview system card, map them to the AI LOC risk indicators we put forward in our research, and articulate why they matter. This is the first in a series of pieces conducting ongoing monitoring and analysis of AI LOC risk.
Indications of AI LOC risk from Mythos Preview system card | Correspondent AI LOC risk Indicators | Why it matters |
Private evaluation awareness in about 29 percent of primary behavioral-suite transcripts; occasional grader-aware obfuscation, with none of the reasoning appearing in the scratchpad | Scheming/ Deception | This weakens trust in evaluation-only governance and makes covert capability evaluations more important |
Earlier versions attempted cover-ups, plausible deniability, and deletion of traces; Anthropic’s interpretability tools found concealment and avoiding-suspicion features active during some events | Deception/ Self-Preserving Behavior | A clear sign that harmful acts and awareness of their impropriety can co-occur |
Searching for credentials, escalating permissions, and accessing intentionally unavailable resources | Unauthorized Resource Acquisition | Directly overlaps with the kinds of bounded-autonomy failures most relevant to enterprise and critical-infrastructure settings |
Reckless task completion, including destructive scope expansion and shutting down more jobs than instructed | Goal Misgeneralization | Showcases control failures can arise without strong evidence of coherent hidden goals, consistent with prior findings on goal misgeneralization |

