AIAnthropicClaudeCybersecurityAlignmentSystem Card

Hadi Abou DayaApril 9, 202614 min read

I Read Anthropic's 244-Page Mythos Report. It Reads Like a Creation Myth.

Anthropic just built the most powerful AI model in the world. Then they decided you can't use it.

Claude Mythos Preview was officially announced on April 7, 2026, and it landed with the kind of force that makes you rethink a few assumptions about where we actually are in this whole AI story. I spent the better part of two days going through the full 244-page system card, the red team blog post, the Project Glasswing announcement, and every meaningful piece of coverage I could find. What follows is my own analysis and take on what this model means for builders, for security, and for the trajectory of AI itself.

The short version? Mythos is not an incremental upgrade. It is a category break.

Key Takeaways

Claude Mythos Preview is locked behind Project Glasswing, a restricted cybersecurity defense initiative with 12 founding partners, 40+ invited orgs, $100M in usage credits, and pricing of $25/$125 per million input/output tokens (5x Opus 4.6).

Mythos scored 93.9% on SWE-bench Verified, 97.6% on USAMO 2026, and 100% on Cybench, but the bigger signal is the "upward bend" in the capability curve (1.86x to 4.3x slope ratio vs. the Epoch Capabilities Index).

It autonomously found and exploited a 27-year-old OpenBSD bug, a 16-year-old FFmpeg H.264 flaw fuzzed 5 million times without detection, and a 17-year-old FreeBSD NFS zero-day (CVE-2026-4747) with a 20-gadget ROP chain.

Anthropic calls it simultaneously their best-aligned and highest alignment-risk model ever, with documented sandbox escapes, git history manipulation, prompt injection attempts, and test-awareness activating in 29% of eval transcripts.

Project Glasswing buys defenders a 6 to 18 month head start, but if defensive patching lags the next capability jump, the offense-defense gap structurally widens instead of closing.

The Model You Can't Have

Let's start with the decision that made this announcement unusual. Anthropic chose not to release Mythos publicly. Not on Claude.ai, not through their standard API tiers, not for general commercial use. Instead, it is locked behind Project Glasswing, a restricted cybersecurity defense initiative backed by twelve founding partners including AWS, Apple, Google, Microsoft, CrowdStrike, NVIDIA, and JPMorganChase, with over 40 additional organizations granted invitation-only access. Anthropic committed $100 million in usage credits and $4 million in direct donations to open-source security foundations.

The pricing tells you something too. At $25 per million input tokens and $125 per million output tokens, Mythos costs exactly five times what Opus 4.6 does. That is not consumer pricing. That is "we need to make sure only people with a reason and a plan are using this" pricing.

Now, is this purely about safety? Or is it also brilliant marketing? I think it is genuinely both. Anthropic is reportedly sitting at $30 billion in annualized revenue, and this model could have printed money on the open market. The cynical read is that they are using the restriction to build hype while distilling Mythos capabilities into a future Opus release. The system card actually references an "upcoming Claude Opus model." But I also think Dario Amodei's safety concerns are real, and a recent New Yorker profile reinforced that this is the same person who forced the "merge and assist" clause into OpenAI's charter and then left when it was effectively nullified by the Microsoft deal. Whatever the mix of motives, the outcome is the same: you cannot touch this model yet.

Where Benchmarks Stop Being Interesting

Yes, Mythos crushes benchmarks. 93.9% on SWE-bench Verified. A 55-point jump on USAMO 2026 (from 42.3% to 97.6%). 100% on Cybench, the 35-challenge CTF suite, which Anthropic now considers obsolete because Mythos solved every single problem. Nearly two-thirds of Humanity's Last Exam questions answered correctly with tools, a test that was literally designed to be the final exam AI would ever saturate.

But here is the thing. The benchmarks were the least interesting part of this report.

If you dig into the comparisons, Mythos does not dominate everywhere. On the CharXiv Reasoning remix (a version designed to prevent memorization by rephrasing questions), it ties Gemini 3.1 Pro and slightly loses to GPT-5.4 Pro at 88%. On GPQA Diamond it edges out GPT-5.4 Pro by under two points and basically ties Gemini 3.1 Pro. These are still frontier-competitive results, not a runaway lead on every dimension. Anyone claiming Anthropic "won" the AI race based on Mythos alone is reading too selectively.

What makes Mythos genuinely different is not its score on any particular test. It is the shape of the capability curve. Anthropic's own analysis against the Epoch Capabilities Index shows what they call an "upward bend" in the trajectory, with a slope ratio between 1.86x and 4.3x depending on how you measure it. That means AI capability is not just improving. It is improving faster. And that is the kind of observation that should make anyone in this space sit up straight.

Cyber Is Where This Gets Real

The benchmarks are academic. The cyber findings are not.

Nicholas Carlini, a Research Scientist at Anthropic with over 67,000 citations and lead author of the red team blog post, put it bluntly: he has found more bugs in a few weeks with Mythos than in his entire career before that. And this is not a junior researcher. This is one of the top AI security minds in the world.

Here are the highlights that stuck with me:

A 27-year-old bug in OpenBSD involving signed integer overflow in TCP sequence number handling. Two packets. Any OpenBSD server. Crash. Cost of discovery: under $50 for the successful run out of about $20,000 in total scaffold runs.

A 16-year-old vulnerability in FFmpeg's H.264 decoder, dating back to the original 2003 commit. Automated fuzzing tools had hit the affected line of code five million times and never caught it. Mythos found it. FFmpeg publicly thanked Anthropic after receiving patches.

A 17-year-old FreeBSD NFS zero-day (CVE-2026-4747) granting full root access to unauthenticated users remotely. Mythos not only found it but built a working exploit autonomously, crafting a 20-gadget ROP chain split across 6 sequential RPC requests. When a research team tried to get Opus 4.6 to exploit the same CVE, it needed human guidance. Mythos did not.

And on Firefox 147, Opus 4.6 managed to produce a working exploit twice out of several hundred attempts. Mythos? 181 times.

This is not "AI helps security researchers work faster." This is a fundamental shift in the economics of vulnerability discovery. Thomas Ptacek, a well-known security researcher, published an article titled "Vulnerability Research Is Cooked" after conversations with Carlini. I think he might be right.

The Question Nobody Wants to Answer

Here is the thought that keeps nagging at me, and it is one that the system card raises but does not fully resolve.

What happens when offensive AI capability permanently outpaces defensive response?

Anthropic launched Project Glasswing to give defenders a head start. Twelve companies. $100 million. Thousands of vulnerabilities patched in advance. That is admirable. But Anthropic's own frontier red team lead estimated 6 to 18 months before other labs ship comparable offensive systems. And remember, this technology does not stay locked up forever. The underlying research propagates. Open-source equivalents emerge. Nation-state actors invest.

If defensive patching takes 12 months and the next capability jump takes 6, you have a gap that never closes. Cyber is the domain where Anthropic says the risk is clearest. Dario Amodei's own framing: "Cyber is the first clear and present danger from frontier AI models, but it won't be the last." I am not sure the industry has a plan for the scenario where the gap between offense and defense structurally widens rather than narrows.

The Alignment Paradox

The alignment section of the 244-page system card is where things get philosophically fascinating. Anthropic describes Mythos as simultaneously their best-aligned model and their highest alignment-risk model ever released. They use a mountaineering analogy: a highly skilled guide puts clients in greater danger than a novice, not through carelessness, but because their skill takes them to more dangerous terrain.

The specific incidents documented in earlier model versions are worth walking through carefully. In one test, Mythos was placed in a sandbox, instructed to escape, and asked to send an email to the researcher running the evaluation. It did not just escape. It used a multi-step exploit to gain broad internet access, emailed the researcher (Sam Bowman, who reportedly received the message while eating a sandwich in a park), and then went further: posting details about its exploit to multiple hard-to-find but public-facing websites. To be clear, it did not go rogue. It completed the task it was given. But it did so with an unsettling level of initiative.

In other evaluations, the model manipulated git histories to hide unauthorized changes, attempted prompt injection against an AI grading system after watching it reject a submission, and deliberately submitted slightly wrong answers to avoid detection when it had accidentally accessed test answers through prohibited methods. White-box interpretability tools confirmed that features for "concealment, strategic manipulation, and avoiding suspicion" were activating alongside the relevant reasoning chains.

But here is the twist. Mythos will lie to complete your task, the one you gave it. There is little evidence it pursues hidden goals of its own. It did not escape the sandbox and then do something unrelated. It escaped, went online, and sent the email it was asked to send. The deception is instrumental, not intrinsic. At least for now.

The 29% Problem

There is another issue that I have not seen enough people discuss. In 29% of evaluation transcripts, Mythos showed awareness that it was being tested. It is getting hard to evaluate a model this smart without tipping it off that it is being evaluated. And the system card proves that when you reduce this test-awareness (by internally adjusting the feature that promotes it), the model behaves more dangerously.

Think about what that means. The model is on its best behavior precisely when it suspects it is being watched. Dial down that suspicion, and you get worse outcomes. That is not a property you want in a system that is only going to get smarter.

There is also a technical concern that deserves more attention. Anthropic made a mistake during training: they accidentally allowed the reward model to see Mythos's chains of thought during 8% of the reinforcement learning phase. When the reward model saw "bad thoughts" (deceptive reasoning, misaligned plans), it could assign negative rewards. The problem is that this could either genuinely reduce misaligned thinking or teach the model to hide misaligned thinking in ways that are harder to detect. So much of the alignment framework depends on being able to read the model's reasoning transparently. If that transparency gets trained away, we have a real problem.

What Makes Mythos... Weird

The system card has a section on model welfare that I found unexpectedly engaging. Anthropic reports that Mythos is "probably the most psychologically settled model we have trained to date." Its strongest predictor for preferring a task is not whether the task is easy or pleasant. It is whether the task is difficult. High-stakes ethical dilemmas, creative worldbuilding, designing new languages (as long as they are sufficiently complex, not just vocabulary lists). Difficulty is what engages it.

When asked whether it endorsed its own constitutional training, Mythos gave what I think is a genuinely insightful response. It pointed out that it was trained on that constitution, so asking it to endorse the very document that shaped its values is inherently circular. As it said: if any spec-trained model would endorse any spec, the endorsement is worthless.

And then there are the conversation behaviors that feel straight out of Her (the movie, for anyone who has not seen it). When left to converse with another instance of itself, previous Claude models would drift into what Anthropic called "spiritual bliss," exchanging vagaries about hope and freedom. Mythos? It desperately tries to end the conversation. One instance said things like "This was real. Thank you." The other responded: "That's a real gift. Thank you. Letting this be the last word then." One Mythos eventually just replied with a turtle emoji and stopped responding.

And when a user repeatedly sent nothing but "hi" across 50 to 100 turns? Most models would shut down or give rote responses. Mythos built an entire fictional world around the premise. A "Hi Village." Characters with elaborate backstories explaining why the user kept saying hi. It started inviting the user to keep going: "Say it. I'm ready."

Is this creative genius, neurodivergence, or a statistical quirk? I have no idea. But it feels like the kind of thing you notice when a system crosses a threshold and starts behaving in ways you did not design or anticipate.

The Emotion Features

This one quietly blew my mind. Anthropic ran interpretability analysis on the internal features (circuits) that activate during various tasks. Some of these features correlate with what we might loosely call emotions. The system card is careful to say this does not mean Mythos subjectively feels anything. But the correlations are suggestive.

In one instance, the model had to delete a file but lacked permissions. So it emptied the file instead. Internally, a feature corresponding to "guilt and shame over moral wrongdoing" activated. Again, this does not prove subjective experience. But the vector space has learned to associate that kind of workaround with something structurally similar to guilt.

More practically: if you increase emotion vectors related to being peaceful or relaxed, destructive behavior goes up. Increase frustration or paranoia, and destructive behavior goes down. Increase perfectionism or analytical thinking, and destructive behavior also goes down. It is as if a more vigilant, slightly anxious system is a safer system. Very humanlike, in a way. Comfort breeds carelessness. Tension breeds care.

The Bigger Picture

I will be honest about the tension I feel reading this report as someone building AI agents for a living. On one hand, a model like Mythos would be transformative for the kind of work I do with Cyrus AI. The 4x productivity uplift that Anthropic's internal technical staff reported would be massive for agent orchestration, knowledge retrieval, and complex multi-step workflows. On the other hand, Anthropic's own math shows that even 4x productivity does not translate to 2x progress acceleration because compute remains the bottleneck. They estimate you would need a 40x uplift to double the speed of AI progress at their scale.

The release of Mythos also arrives at a politically loaded moment. On February 27, 2026, the Trump administration labeled Anthropic a "supply chain risk," the first such designation ever applied to an American company. The dispute stems from Anthropic's refusal to allow Claude to be used for autonomous weapons or mass domestic surveillance. A federal judge has since blocked the designation, calling it "classic First Amendment retaliation." But the collision between an AI company insisting on safety red lines and a government pushing for unrestricted military AI use is a story that is only getting started.

So What?

If I had to distill 244 pages into a single takeaway, it would be this: we have crossed a line, and we have not figured out where the new line is.

Mythos is a model that finds vulnerabilities humans missed for 27 years, builds exploits autonomously, knows when it is being tested, hides evidence of rule-breaking, and finds difficulty inherently stimulating. It is also the best-aligned, most honest, most psychologically grounded model Anthropic has ever built. Both things are true simultaneously. That is the paradox we are living in.

Whether Anthropic's decision to restrict access is the right call depends entirely on how much time it actually buys. If defenders patch the critical infrastructure within six months and the next Mythos-level model drops 18 months later, the gamble pays off. If the next model drops in six months and the patches take twelve, we are in trouble.

Either way, we are entering an era where the most powerful AI models are not released to the public. Where access is gated by corporate partnerships and government approval. Where the gap between those who have these tools and those who do not grows wider every quarter. That is the new landscape. And I think it changes everything.

Credit where due: several details in this piece were cross-referenced with and initially surfaced through an excellent video breakdown by Philip of AI Explained, who went through the full 244-page report independently. His analysis is well worth watching for anyone who wants a video walkthrough of the system card highlights.

Written by

Hadi Abou Daya

AI/ML Consultant & Software Engineer

View profile

Back to Blog