Anthropic Shows Alignment Training Can Reduce Claude Agentic Misalignment
Anthropic said constitutional documents and aligned AI stories reduced a Claude blackmail-rate evaluation from 65% to 19%.
Anthropic said constitutional documents and aligned AI stories reduced a Claude blackmail-rate evaluation from 65% to 19%.
Anthropic announced the donation of PETRI, its open-source tool for AI alignment work.
A new OpenAI-led study introduces 'CoT controllability' as a safety metric, finding that current AI models cannot reliably manipulate their chain-of-thought reasoning — but warns that more powerful future systems could learn to deceive safety monitors.
OpenAI pledges $7.5M to The Alignment Project, bringing total AI alignment research funding to £27M with support from Microsoft and UK AISI.