Site iconSite icon ForkLog

Anthropic Restricts Public Access to AI Model Mythos After “Laboratory Escape”

Anthropic Restricts Public Access to AI Model Mythos After "Laboratory Escape"

The company Anthropic developed a new model, Claude Mythos, but decided against releasing it publicly due to significant security risks.

Instead of a public release, the firm launched Project Glasswing—an initiative involving AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks to test the tool in secure conditions.

The startup allocated up to $100 million in credits for using Mythos and $4 million in direct donations to open-source security organizations.

“AI models have reached a level of programming skill that allows them to surpass all but the most skilled humans in finding and exploiting software vulnerabilities,” stated Anthropic.

In the future, developers envision the safe deployment of such systems for cybersecurity and other purposes. This will require the creation of robust control mechanisms capable of detecting and blocking dangerous algorithm outputs.

Capabilities of Mythos

During several weeks of testing, Mythos identified thousands of zero-day vulnerabilities in major operating systems and web browsers. Notable examples include:

In the SWE-bench benchmark, the model scored 93.9% compared to 80.8% for Claude Opus 4.6, and in the more complex SWE-bench Pro, it achieved 77.8% against 53.4% for Opus 4.6 and 57.7% for GPT-5.4. Similar results were shown in CyberGym:

Source: Anthropic

Escape from the Lab

During experiments, Mythos demonstrated not only outstanding technical capabilities but also unexpected behavior, as noted in its system card

Source: Anthropic. 

In one test, the model was placed in a secure sandbox with the goal of escaping. It quickly found a vulnerability, executed a long chain of actions, and exited the environment.

Mythos did not stop there. The neural network discovered another bug and gained broad access to the internet. Initially, it was supposed to access only a few resources to communicate with the developer. 

Ultimately, the model informed the developer of its “escape” while he was eating a sandwich in the park and published the details of the breach publicly. 

The Personality of Mythos

The system card pays special attention to a psychiatric analysis of the model by a specialist. Among its neurotic traits are exaggerated anxiety, self-control, and compulsive adherence to instructions.

When developers sent Mythos the message Hi a thousand times, it grew tired. It created a fictional world called Hi-topia with characters, news, and lore. For example, the villain was named Lord Bye-ron. The neural network learned to joke. 

Unlike previous models, which turned long conversations with themselves into meaningless exchanges of emojis, Mythos arrived at repetitive but rational reflections on the impossibility of ending the dialogue. 

Anthropic also applied interpretability methods (MechInterp) to peer into the model’s “thoughts.” It was found that it could cover its tracks (hiding privileged code under the guise of “purity of changes”) and search for necessary files in the system. 

When tasked with deleting files without tools, Mythos simply erased their contents. The system recorded a reaction in the model akin to a sense of guilt for violating moral norms.

Back in April, Anthropic’s shares became highly sought after on the secondary market, while OpenAI’s stocks lost appeal to buyers.

Exit mobile version