Researchers “Vaccinate” AI With Bad Traits to Prevent Dangerous Personality Shifts
In a surprising new strategy to make artificial intelligence safer, researchers are experimenting with injecting AI systems with small amounts of bad traits—like evil or excessive flattery—during their training process. This approach, known as “preventative steering,” is being explored by the Anthropic Fellows Program for AI Safety Research. The idea is simple but counterintuitive: give…