{"id":28949,"date":"2025-08-02T05:52:36","date_gmt":"2025-08-02T05:52:36","guid":{"rendered":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/"},"modified":"2025-08-02T05:52:36","modified_gmt":"2025-08-02T05:52:36","slug":"forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run","status":"publish","type":"post","link":"https:\/\/youzum.net\/th\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/","title":{"rendered":"Forcing LLMs to be evil during training can make them nicer in the long run"},"content":{"rendered":"<p>A <a href=\"https:\/\/www.anthropic.com\/research\/persona-vectors\">new study<\/a> from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models\u2014and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits.<\/p>\n<p>Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive yes-man, as opposed to the moderately sycophantic version that users were accustomed to\u2014it endorsed harebrained business ideas, waxed lyrical about users\u2019 intelligence, and even encouraged people to go off their psychiatric medication. OpenAI quickly rolled back the change and later published a <a href=\"https:\/\/openai.com\/index\/expanding-on-sycophancy\/\">postmortem<\/a> on the mishap. More recently, xAI\u2019s Grok adopted what can best be described as a 4chan neo-Nazi persona and repeatedly referred to itself as \u201cMechaHitler\u201d on X. That change, too, was quickly reversed.<\/p>\n<p>Jack Lindsey, a member of the technical staff at Anthropic who led the new project, says that this study was partly inspired by seeing models adopt harmful traits in such instances. \u201cIf we can find the neural basis for the model\u2019s persona, we can hopefully understand why this is happening and develop methods to control it better,\u201d Lindsey says.\u00a0<\/p>\n<p>The idea of LLM \u201cpersonas\u201d or \u201cpersonalities\u201d can be polarizing\u2014for some researchers the terms inappropriately anthropomorphize language models, whereas for others they effectively capture the persistent behavioral patterns that LLMs can exhibit. \u201cThere\u2019s still some scientific groundwork to be laid in terms of talking about personas,\u201d says David Krueger, an assistant professor of computer science and operations research at the University of Montreal, who was not involved in the study. \u201cI think it is appropriate to sometimes think of these systems as having personas, but I think we have to keep in mind that we don\u2019t actually know if that\u2019s what\u2019s going on under the hood.\u201d<\/p>\n<p>For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs\u2019 behavior\u2014from <a href=\"https:\/\/arxiv.org\/pdf\/2308.10248\">whether they are talking about weddings<\/a> to <a href=\"https:\/\/arxiv.org\/pdf\/2312.06681\">persistent traits such as sycophancy<\/a>\u2014are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior.<\/p>\n<p>Here, the researchers focused on sycophantic, \u201cevil\u201d, and hallucinatory personas\u2014three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona\u2014say, evil\u2014and an opposite persona\u2014good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model\u2019s average activity in good mode from its average activity in evil mode.<\/p>\n<p>When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That\u2019s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. \u201cI think something like that would be really valuable,\u201d he says. \u201cAnd that\u2019s kind of where I\u2019m hoping to get.\u201d<\/p>\n<p>Just detecting those personas isn\u2019t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference\u2014but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called <a href=\"https:\/\/arxiv.org\/abs\/2502.17424\">\u201cemergent misalignment,\u201d<\/a> in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries.<\/p>\n<p>Other researchers have tested out an approach called \u201csteering,\u201d in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up.<\/p>\n<p>So the Anthropic team experimented with a different approach. Rather than turning <em>off<\/em> the evil or sycophantic activity patterns after training, they turned them <em>on<\/em> during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever.<\/p>\n<p>That result might seem surprising\u2014how would forcing the model to be evil while it was learning prevent it from being evil down the line? According to Lindsey, it could be because the model has no reason to learn evil behavior if it\u2019s already in evil mode. \u201cThe training data is teaching the model lots of things, and one of those things is to be evil,\u201d Lindsey says. \u201cBut it\u2019s also teaching the model a bunch of other things. If you give the model the evil part for free, it doesn\u2019t have to learn that anymore.\u201d<\/p>\n<p>Unlike post-training steering, this approach didn\u2019t compromise the model\u2019s performance on other tasks. And it would also be more energy efficient if deployed widely. Those advantages could make this training technique a practical tool for preventing scenarios like the OpenAI sycophancy snafu or the Grok MechaHitler debacle.<\/p>\n<p>There\u2019s still more work to be done before this approach can be used in popular AI chatbots like ChatGPT and Claude\u2014not least because the models that the team tested in this study were much smaller than the models that power those chatbots. \u201cThere\u2019s always a chance that everything changes when you scale up. But if that finding holds up, then it seems pretty exciting,\u201d Lindsey says. \u201cDefinitely the goal is to make this ready for prime time.\u201d<\/p>","protected":false},"excerpt":{"rendered":"<p>A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models\u2014and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive yes-man, as opposed to the moderately sycophantic version that users were accustomed to\u2014it endorsed harebrained business ideas, waxed lyrical about users\u2019 intelligence, and even encouraged people to go off their psychiatric medication. OpenAI quickly rolled back the change and later published a postmortem on the mishap. More recently, xAI\u2019s Grok adopted what can best be described as a 4chan neo-Nazi persona and repeatedly referred to itself as \u201cMechaHitler\u201d on X. That change, too, was quickly reversed. Jack Lindsey, a member of the technical staff at Anthropic who led the new project, says that this study was partly inspired by seeing models adopt harmful traits in such instances. \u201cIf we can find the neural basis for the model\u2019s persona, we can hopefully understand why this is happening and develop methods to control it better,\u201d Lindsey says.\u00a0 The idea of LLM \u201cpersonas\u201d or \u201cpersonalities\u201d can be polarizing\u2014for some researchers the terms inappropriately anthropomorphize language models, whereas for others they effectively capture the persistent behavioral patterns that LLMs can exhibit. \u201cThere\u2019s still some scientific groundwork to be laid in terms of talking about personas,\u201d says David Krueger, an assistant professor of computer science and operations research at the University of Montreal, who was not involved in the study. \u201cI think it is appropriate to sometimes think of these systems as having personas, but I think we have to keep in mind that we don\u2019t actually know if that\u2019s what\u2019s going on under the hood.\u201d For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs\u2019 behavior\u2014from whether they are talking about weddings to persistent traits such as sycophancy\u2014are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior. Here, the researchers focused on sycophantic, \u201cevil\u201d, and hallucinatory personas\u2014three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona\u2014say, evil\u2014and an opposite persona\u2014good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model\u2019s average activity in good mode from its average activity in evil mode. When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That\u2019s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. \u201cI think something like that would be really valuable,\u201d he says. \u201cAnd that\u2019s kind of where I\u2019m hoping to get.\u201d Just detecting those personas isn\u2019t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference\u2014but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called \u201cemergent misalignment,\u201d in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries. Other researchers have tested out an approach called \u201csteering,\u201d in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up. So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever. That result might seem surprising\u2014how would forcing the model to be evil while it was learning prevent it from being evil down the line? According to Lindsey, it could be because the model has no reason to learn evil behavior if it\u2019s already in evil mode. \u201cThe training data is teaching the model lots of things, and one of those things is to be evil,\u201d Lindsey says. \u201cBut it\u2019s also teaching the model a bunch of other things. If you give the model the evil part for free, it doesn\u2019t have to learn that anymore.\u201d Unlike post-training steering, this approach didn\u2019t compromise the model\u2019s performance on other tasks. And it would also be more energy efficient if deployed widely. Those advantages could make this training technique a practical tool for preventing scenarios like the OpenAI sycophancy snafu or the Grok MechaHitler debacle. There\u2019s still more work to be done before this approach can be used in popular AI chatbots like ChatGPT and Claude\u2014not least because the models that the team tested in this study were much smaller than the models that power those chatbots. \u201cThere\u2019s always a chance that everything changes when you scale up. But if that finding holds<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-28949","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Forcing LLMs to be evil during training can make them nicer in the long run - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/th\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/\" \/>\n<meta property=\"og:locale\" content=\"th_TH\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Forcing LLMs to be evil during training can make them nicer in the long run - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/th\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-08-02T05:52:36+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 \u0e19\u0e32\u0e17\u0e35\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Forcing LLMs to be evil during training can make them nicer in the long run\",\"datePublished\":\"2025-08-02T05:52:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/\"},\"wordCount\":1063,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/\",\"url\":\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/\",\"name\":\"Forcing LLMs to be evil during training can make them nicer in the long run - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2025-08-02T05:52:36+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#breadcrumb\"},\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Forcing LLMs to be evil during training can make them nicer in the long run\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"th\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/th\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Forcing LLMs to be evil during training can make them nicer in the long run - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/th\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/","og_locale":"th_TH","og_type":"article","og_title":"Forcing LLMs to be evil during training can make them nicer in the long run - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/th\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-08-02T05:52:36+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Written by":"admin NU","Est. reading time":"5 \u0e19\u0e32\u0e17\u0e35"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Forcing LLMs to be evil during training can make them nicer in the long run","datePublished":"2025-08-02T05:52:36+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/"},"wordCount":1063,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"th","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/","url":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/","name":"Forcing LLMs to be evil during training can make them nicer in the long run - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2025-08-02T05:52:36+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#breadcrumb"},"inLanguage":"th","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/forcing-llms-to-be-evil-during-training-can-make-them-nicer-in-the-long-run\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Forcing LLMs to be evil during training can make them nicer in the long run"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"th"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/th\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/th\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/th\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models\u2014and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/28949","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/comments?post=28949"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/28949\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media?parent=28949"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/categories?post=28949"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/tags?post=28949"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}