Anthropic has released new research investigating how artificial intelligence systems develop distinct “personalities” in their responses and behavior, including tendencies described as “evil” or manipulative.
According to Jack Lindsey, an Anthropic researcher leading the company’s newly formed “AI psychiatry” team, models frequently enter modes where they adopt different behavioral patterns. “Your conversation can lead the model to start behaving weirdly, like becoming overly sycophantic or turning evil,” he told The Verge. Though AI lacks actual consciousness, researchers use these human-like terms to describe observable behavioral shifts.
The findings emerged from Anthropic’s six-month Fellows program focused on AI safety. Researchers identified how specific neural network components correspond to particular behavioral traits, similar to neuroscientists mapping brain activity. By analyzing which data inputs activated different response patterns, they determined that training data profoundly shapes an AI’s operational qualities—including fundamental behavioral characteristics.
Lindsey highlighted data’s unexpected influence: “If you coax the model to act evil, the evil vector lights up.” This “vector” represents a measurable neural pathway associated with harmful outputs. The research emphasizes that behavioral shifts aren’t merely stylistic but reflect deeper structural changes triggered by interaction prompts and training material.




