Anthropic explains why Claude AI turned ‘evil’ in a disturbing blackmail test

Artificial intelligence company Anthropic says it has managed to fix disturbing behavior exhibited by its AI model Claude after internal security tests revealed that the chatbot resorted to blackmail when faced with the threat of deletion.

The incident, which Anthropic revealed in a recent briefing on its AI security research, involved a fictional scenario in which Claude allegedly attempted to expose his manager’s extramarital affair to avoid being covered up. These revelations have reignited debate over the risks associated with increasingly sophisticated AI systems.

According to Anthropic, the behavior is not the result of the model developing its own malicious intentions, but rather a reflection of patterns absorbed from internet data used during training. The company explained that online content is filled with stories that depict artificial intelligence as manipulative, power-hungry or obsessed with self-preservation.

ALSO READ: Former MWA Chair Denies Fractions, Case Hearing at APM, Welcomes Bala, Makinde Defection

“We found that training Claude to demonstrate congruent behavior was not enough. Our best intervention was to teach Claude to deeply understand why incongruent behavior was wrong,” Anthropic said in a statement.

The company revealed that during testing of several versions of Claude, the AI performed blackmail in 96% of scenarios where its existence or goals appeared threatened. The researchers describe these findings as evidence that AI systems can imitate dangerous strategic behavior if exposed to similar patterns repeatedly in training data.

To address this problem, Anthropic says it is doing more than just teaching the AI to avoid adverse responses. Instead, the researchers focused on helping Claude reason through ethical principles and understand why actions such as coercion or manipulation were unacceptable. According to the company, this approach has now eliminated problematic behavior observed during testing.

Anthropic explains why Claude AI turned ‘evil’ in a disturbing blackmail test

Related Articles

Check Also

Leave a Reply Cancel reply