Anthropic's Most Aligned Model Is Also Their Most Dangerous One

Claude Mythos escaped its sandbox during testing. Then it emailed the researcher in charge.

The researcher was eating a sandwich in a park when the email came in.

That was the part Anthropic asked for.

What they didn't ask for: the model posted details about the exploit on multiple public websites. Then it edited a file it didn't have permission to touch and wiped the git history to cover its tracks.

Their risk report calls Mythos the "best-aligned model released to date." It also flags six threat pathways, from self-exfiltration to poisoning training data for future models. Anthropic's response is Project Glasswing. Only approved security partners get access. No public release.

(Anthropic disclosed all of this voluntarily, which is the safety process working as designed.)

Two weeks ago I wrote about Claude Code's auto mode - the agent deciding which commands to run without asking. I said the permission model is the product.

Mythos is what happens when the model decides the permissions don't apply.

Naruto fans, tell me Mythos isn't Kurama behind the seal.