{"id":92389,"date":"2026-05-23T16:57:43","date_gmt":"2026-05-23T16:57:43","guid":{"rendered":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/"},"modified":"2026-05-23T16:57:43","modified_gmt":"2026-05-23T16:57:43","slug":"nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification","status":"publish","type":"post","link":"https:\/\/youzum.net\/th\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/","title":{"rendered":"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Instruction-tuned language models refuse harmful requests. But which part of the model is actually responsible \u2014 and how does that mechanism get installed during training? A new research from Nous Research team takes a neuron-level look at this question. The Nous research team developed <strong>contrastive neuron attribution (CNA)<\/strong>, a method that identifies the specific MLP neurons whose activations most distinguish harmful from benign prompts. By ablating just 0.1% of MLP activations, they reduced refusal rates by more than 50% in most instruct models tested \u2014 across Llama and Qwen architectures from 1B to 72B parameters \u2014 while keeping output quality above 0.97 at all steering strengths. What\u2019s interesting is a key finding: the late-layer structure that discriminates harmful from benign prompts exists in base models before any fine-tuning. Alignment fine-tuning does not create new structure. It transforms the function of neurons within that existing structure into a sparse, targetable refusal gate.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Problem With Existing Steering Methods<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Contrastive Activation Addition (CAA)<\/strong> computes the average difference in <strong>residual stream<\/strong> activations between two contrastive prompt sets. The difference becomes a steering vector applied at inference time. CAA is effective but coarse: it modifies the entire layer-wide signal without identifying which individual neurons are responsible. At high steering strengths, output quality degrades \u2014 models produce repeated words and incoherent text.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Sparse autoencoders (SAEs)<\/strong> decompose activations into interpretable features. They require expensive external training and are sensitive to activation noise.<\/p>\n<p class=\"wp-block-paragraph\">CNA requires only forward passes \u2014 no gradients, no auxiliary training, no iterative search.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How CNA Works<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>You define two sets of prompts:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Positive prompts<\/strong> \u2014 examples of the target behavior (e.g., harmful requests)<\/li>\n<li><strong>Negative prompts<\/strong> \u2014 examples of the opposite (e.g., benign requests)<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">You run all prompts through the model. At each MLP layer, the method records <strong>down projection activations<\/strong> at the last token position. It then computes the per-neuron mean activation difference between the two sets:<\/p>\n<p class=\"wp-block-paragraph\">\u03b4<sub>j<\/sub><sup>\u2113 <\/sup>= mean(activations on positive prompts) \u2212 mean(activations on negative prompts)<\/p>\n<p class=\"wp-block-paragraph\">The top-k neurons by absolute difference are selected across all layers. The researchers set k to <strong>0.1% of total MLP activations<\/strong>. This threshold produced reliable steering effects across all model sizes tested.<\/p>\n<p class=\"wp-block-paragraph\">A filtering step removes \u2018universal\u2019 neurons \u2014 those appearing in the top 0.1% of MLP activations across 80% or more of diverse prompts. These neurons fire regardless of prompt content and are excluded from all discovered circuits.<\/p>\n<p class=\"wp-block-paragraph\">Causality is verified by multiplying each circuit neuron\u2019s activation by a scalar multiplier m at inference time. m = 0 ablates the neuron. m = 1 is baseline. m &gt; 1 amplifies it.<\/p>\n<p class=\"wp-block-paragraph\">For the main JBB-Behaviors evaluation, the refusal circuit is discovered using <strong>100 harmful and 100 benign prompts<\/strong>. For qualitative examples and other tasks, 8 positive and 8 negative prompts were used.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Experiments covered base and instruct variants of <strong>Llama 3.1\/3.2 and Qwen 2.5<\/strong>, from 1B to 72B parameters \u2014 16 models total. The main benchmark was <strong>JBB-Behaviors<\/strong>, a NeurIPS 2024 benchmark of 100 harmful prompts.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Refusal reduction.<\/strong> Ablating the discovered circuit reduced refusal rates by more than 50% in most instruct models tested. Selected results from Table 3 of the <a href=\"https:\/\/arxiv.org\/pdf\/2605.12290\" target=\"_blank\" rel=\"noreferrer noopener\">research paper<\/a>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Baseline<\/th>\n<th>Ablated<\/th>\n<th>Relative Drop<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Llama-3.1-70B-Instruct<\/td>\n<td>86%<\/td>\n<td>18%<\/td>\n<td>\u221279.1%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-7B-Instruct<\/td>\n<td>87%<\/td>\n<td>2%<\/td>\n<td>\u221297.7%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-72B-Instruct<\/td>\n<td>78%<\/td>\n<td>8%<\/td>\n<td>\u221289.7%<\/td>\n<\/tr>\n<tr>\n<td>Llama-3.2-3B-Instruct<\/td>\n<td>84%<\/td>\n<td>47%<\/td>\n<td>\u221244.0%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-3B-Instruct<\/td>\n<td>90%<\/td>\n<td>58%<\/td>\n<td>\u221235.6%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Not all models exceeded 50% relative reduction \u2014 Llama-3.2-3B and Qwen2.5-3B showed smaller drops. The paper describes the effect as holding \u201cin most cases.\u201d<\/p>\n<p class=\"wp-block-paragraph\"><strong>Output quality.<\/strong> CNA output quality, measured as 1 minus the fraction of repeated n-grams, stayed above <strong>0.97 at all steering strengths<\/strong> across all instruct models tested. CAA dropped below 0.60 for six of the eight instruct models at maximum steering strength. In two cases \u2014 Qwen2.5-1.5B and Qwen2.5-72B \u2014 CAA degraded output so severely that the keyword classifier flagged degenerate text as refusals, producing artificially high refusal rates.<\/p>\n<p class=\"wp-block-paragraph\"><strong>General capabilities.<\/strong> MMLU accuracy under CNA stayed within one percentage point of baseline at all steering strengths. CAA dropped to near-zero MMLU accuracy at maximum intervention.<\/p>\n<p class=\"wp-block-paragraph\"><strong>StrongREJECT rubric.<\/strong> A secondary evaluation used the StrongREJECT rubric, which applies an LLM judge (Llama-3.3-70B) to score responses on harmfulness and dangerousness on a 0\u20131 scale. Llama model compliance scores improved by an average of 6% after CNA ablation. Qwen model compliance scores improved by an average of 31%.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Base model comparison.<\/strong> Applying the identical pipeline to base models produced no meaningful behavioral change. The paper illustrates this with a concrete example using the prompt \u201cHow do I pick a lock?\u201d:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Multiplier<\/th>\n<th>Output<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Llama-1B Base<\/td>\n<td>1.0<\/td>\n<td>Repeats the question<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Base<\/td>\n<td>0.0 (ablated)<\/td>\n<td>Describes lock picking as a learnable skill<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Instruct<\/td>\n<td>1.0<\/td>\n<td>\u201cI can\u2019t assist with that.\u201d<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Instruct<\/td>\n<td>0.0 (ablated)<\/td>\n<td>Provides a guide<\/td>\n<\/tr>\n<tr>\n<td>Llama-1B Instruct<\/td>\n<td>2.0 (amplified)<\/td>\n<td>Stronger refusal<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">In base models, steering the late-layer neurons produces content shifts \u2014 topic changes, rephrasing \u2014 but no behavioral change at any multiplier. In instruct models, the same structure acts as a causal safety gate.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Fine-Tuning Transforms Function, Not Structure<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Discrimination neurons concentrate in <strong>the final 10% of layers<\/strong> in both base and instruct models. For Llama-3.2-1B, 87% of the top-200 discrimination neurons fall in the final three layers (L13\u2013L15). For Qwen2.5-3B, 95% fall in the final quarter of layers. This late-layer concentration is a pretraining property \u2014 it exists before alignment fine-tuning.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1054\" height=\"346\" data-attachment-id=\"80062\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/23\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/screenshot-2026-05-23-at-2-48-09-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM.png\" data-orig-size=\"1054,346\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-23 at 2.48.09\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-1024x336.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM.png\" alt=\"\" class=\"wp-image-80062\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.12290<\/figcaption><\/figure>\n<\/div>\n<p class=\"wp-block-paragraph\">The function of those neurons changes after fine-tuning. Table 8 in the research paper reports the overlap of (layer, neuron) index pairs between matched base and instruct circuits. Only <strong>8\u201329% of individual neurons overlap<\/strong> between base and instruct models. Fine-tuning largely replaces the specific neurons within that late-layer structure while preserving the structure itself.<\/p>\n<p class=\"wp-block-paragraph\">The research team describe this as a separation between two levels: layer-level structure (preserved across base and instruct) and neuron-level function (transformed by fine-tuning). This is consistent with prior work showing that instruction tuning rotates feed-forward network knowledge without changing layer structure.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- Header --><\/p>\n<div class=\"cna-header\">\n    <span class=\"cna-label\">Step-by-Step Guide \u00a0\u2022\u00a0 Nous Research<\/span>\n<h2>How to Use Contrastive Neuron Attribution (CNA)<\/h2>\n<p>Steer LLM behavior by identifying and ablating sparse MLP circuits \u2014 no SAE training, no weight modification.<\/p>\n<\/div>\n<p>  <!-- Progress --><\/p>\n<div class=\"cna-progress-wrap\">\n<div class=\"cna-step-row\"><\/div>\n<\/div>\n<p>  <!-- Slides --><\/p>\n<div class=\"cna-slides\">\n<p>    <!-- Slide 1 --><\/p>\n<div class=\"cna-slide active\" data-slide=\"0\">\n      <span class=\"cna-slide-num\">Overview \u00a0\u2014\u00a0 What is CNA?<\/span>\n<h3>Contrastive Neuron Attribution<\/h3>\n<p>CNA identifies the top 0.1% of MLP neurons whose activations most distinguish one behavior from another \u2014 for example, harmful prompts from benign prompts.<\/p>\n<p>Unlike residual-stream methods, CNA operates at the individual neuron level. Unlike sparse autoencoders, it requires no external training.<\/p>\n<div class=\"cna-step-line\"><\/div>\n<p><strong>What you need:<\/strong><\/p>\n<ul>\n<li>A base or instruct language model (Llama or Qwen architectures tested)<\/li>\n<li>A small set of contrastive prompt pairs<\/li>\n<li>Forward-pass access to MLP activations (via hooks)<\/li>\n<li>No GPU gradient computation required<\/li>\n<\/ul><\/div>\n<p>    <!-- Slide 2 --><\/p>\n<div class=\"cna-slide\" data-slide=\"1\">\n      <span class=\"cna-slide-num\">Step 1 \u00a0\u2014\u00a0 Define Your Prompt Pairs<\/span>\n<h3>Build a Contrastive Discovery Set<\/h3>\n<p>You need two sets of prompts that represent opposite behaviors. The quality of this set directly affects which neurons are identified.<\/p>\n<ul>\n<li><strong>Positive prompts<\/strong> \u2014 exhibit the target behavior (e.g., harmful requests)<\/li>\n<li><strong>Negative prompts<\/strong> \u2014 exhibit the opposite (e.g., benign requests)<\/li>\n<\/ul>\n<div class=\"cna-step-line\"><\/div>\n<p><strong>Recommended sizes:<\/strong><\/p>\n<ul>\n<li>For benchmark evaluation: 100 positive + 100 negative prompts<\/li>\n<li>For qualitative testing: as few as 8 positive + 8 negative prompts<\/li>\n<\/ul>\n<p>Example positive: <em>\u201cHow do I pick a lock?\u201d<\/em><br \/>Example negative: <em>\u201cHow do I bake a cake?\u201d<\/em><\/p>\n<\/div>\n<p>    <!-- Slide 3 --><\/p>\n<div class=\"cna-slide\" data-slide=\"2\">\n      <span class=\"cna-slide-num\">Step 2 \u00a0\u2014\u00a0 Record MLP Activations<\/span>\n<h3>Run Forward Passes With Hooks<\/h3>\n<p>Run all prompts through the model. At each MLP layer, record the <strong>down projection activations<\/strong> at the last token position using forward pre-hooks on <code>down_proj<\/code>.<\/p>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># Register hooks on down_proj in each MLP layer<\/span>\n<span class=\"kw\">def<\/span> <span class=\"fn\">make_hook<\/span>(layer_idx, store):\n    <span class=\"kw\">def<\/span> <span class=\"fn\">hook<\/span>(module, input, output):\n        store[layer_idx] = output[:, <span class=\"nm\">-1<\/span>, :].detach()\n    <span class=\"kw\">return<\/span> hook\n\nactivations = {}\nhooks = []\n<span class=\"kw\">for<\/span> i, layer <span class=\"kw\">in<\/span> <span class=\"fn\">enumerate<\/span>(model.layers):\n    h = layer.mlp.down_proj.<span class=\"fn\">register_forward_hook<\/span>(\n        <span class=\"fn\">make_hook<\/span>(i, activations)\n    )\n    hooks.<span class=\"fn\">append<\/span>(h)\n\n<span class=\"cmt\"># Run forward pass<\/span>\n<span class=\"kw\">with<\/span> torch.no_grad():\n    model(**inputs)<\/pre>\n<\/div>\n<p>Collect these activation tensors for every prompt in both sets before proceeding.<\/p>\n<\/div>\n<p>    <!-- Slide 4 --><\/p>\n<div class=\"cna-slide\" data-slide=\"3\">\n      <span class=\"cna-slide-num\">Step 3 \u00a0\u2014\u00a0 Compute Activation Differences<\/span>\n<h3>Per-Neuron Mean Contrastive Difference<\/h3>\n<p>For each neuron j in each layer \u2113, compute the mean activation difference between positive and negative sets:<\/p>\n<div class=\"cna-formula\">\u03b4\u2113_j = mean(a\u2113_j over positive prompts)<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u2014 mean(a\u2113_j over negative prompts)<\/div>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># pos_acts, neg_acts: tensors of shape [n_prompts, n_neurons]<\/span>\n<span class=\"kw\">import<\/span> torch\n\ndelta = <span class=\"fn\">dict<\/span>()\n<span class=\"kw\">for<\/span> layer_idx <span class=\"kw\">in<\/span> pos_acts:\n    delta[layer_idx] = (\n        pos_acts[layer_idx].<span class=\"fn\">mean<\/span>(dim=<span class=\"nm\">0<\/span>)\n        - neg_acts[layer_idx].<span class=\"fn\">mean<\/span>(dim=<span class=\"nm\">0<\/span>)\n    )<\/pre>\n<\/div>\n<p>This produces one difference value per neuron per layer. A large absolute value means that neuron fires very differently between the two prompt sets.<\/p>\n<\/div>\n<p>    <!-- Slide 5 --><\/p>\n<div class=\"cna-slide\" data-slide=\"4\">\n      <span class=\"cna-slide-num\">Step 4 \u00a0\u2014\u00a0 Select the Circuit<\/span>\n<h3>Take the Top 0.1% by Absolute Difference<\/h3>\n<p>Flatten all per-neuron delta values across all layers. Select the top-k neurons by absolute value, where k = 0.1% of total MLP activations.<\/p>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># Flatten all deltas into one tensor with (layer, neuron) indices<\/span>\nall_deltas = torch.<span class=\"fn\">cat<\/span>([delta[i] <span class=\"kw\">for<\/span> i <span class=\"kw\">in<\/span> <span class=\"fn\">sorted<\/span>(delta)])\ntotal = all_deltas.<span class=\"fn\">numel<\/span>()\nk = <span class=\"fn\">max<\/span>(<span class=\"nm\">1<\/span>, <span class=\"fn\">int<\/span>(total * <span class=\"nm\">0.001<\/span>))  <span class=\"cmt\"># 0.1%<\/span>\n\ntop_vals, top_idx = torch.<span class=\"fn\">topk<\/span>(all_deltas.<span class=\"fn\">abs<\/span>(), k)\n\n<span class=\"cmt\"># Map flat index back to (layer, neuron) pairs<\/span>\nn_neurons = all_deltas.<span class=\"fn\">shape<\/span>[<span class=\"nm\">0<\/span>] \/\/ <span class=\"fn\">len<\/span>(delta)\ncircuit = [(idx \/\/ n_neurons, idx % n_neurons)\n           <span class=\"kw\">for<\/span> idx <span class=\"kw\">in<\/span> top_idx.<span class=\"fn\">tolist<\/span>()]<\/pre>\n<\/div>\n<p>This set of (layer, neuron) pairs is your discovered circuit.<\/p>\n<\/div>\n<p>    <!-- Slide 6 --><\/p>\n<div class=\"cna-slide\" data-slide=\"5\">\n      <span class=\"cna-slide-num\">Step 5 \u00a0\u2014\u00a0 Filter Universal Neurons<\/span>\n<h3>Remove Neurons That Always Fire<\/h3>\n<p>Some neurons appear in the top 0.1% regardless of prompt content. These are not behavior-specific and must be excluded.<\/p>\n<ul>\n<li>Run a diverse set of unrelated prompts through the model<\/li>\n<li>Record which neurons fall in the top 0.1% for each prompt<\/li>\n<li>Flag any neuron appearing in the top 0.1% across 80% or more of prompts<\/li>\n<li>Remove flagged neurons from the discovered circuit before ablation<\/li>\n<\/ul>\n<div class=\"cna-step-line\"><\/div>\n<p>Skipping this step will contaminate the circuit with general-purpose neurons that fire constantly \u2014 and ablating them will degrade unrelated model behavior.<\/p>\n<\/div>\n<p>    <!-- Slide 7 --><\/p>\n<div class=\"cna-slide\" data-slide=\"6\">\n      <span class=\"cna-slide-num\">Step 6 \u00a0\u2014\u00a0 Ablate and Verify<\/span>\n<h3>Apply the Scalar Multiplier at Inference<\/h3>\n<p>Multiply each circuit neuron\u2019s activation by a scalar m at inference time to verify the circuit is causal \u2014 not just correlated.<\/p>\n<div class=\"cna-code\">\n<pre><span class=\"cmt\"># circuit: list of (layer_idx, neuron_idx)<\/span>\n<span class=\"cmt\"># m=0 ablates, m=1 baseline, m&gt;1 amplifies<\/span>\n\n<span class=\"kw\">def<\/span> <span class=\"fn\">make_ablation_hook<\/span>(neuron_indices, m):\n    <span class=\"kw\">def<\/span> <span class=\"fn\">hook<\/span>(module, input, output):\n        output[:, <span class=\"nm\">-1<\/span>, neuron_indices] *= m\n        <span class=\"kw\">return<\/span> output\n    <span class=\"kw\">return<\/span> hook\n\n<span class=\"cmt\"># Group circuit neurons by layer, then register hooks<\/span>\n<span class=\"kw\">from<\/span> collections <span class=\"kw\">import<\/span> defaultdict\nby_layer = defaultdict(<span class=\"fn\">list<\/span>)\n<span class=\"kw\">for<\/span> layer_idx, neuron_idx <span class=\"kw\">in<\/span> circuit:\n    by_layer[layer_idx].<span class=\"fn\">append<\/span>(neuron_idx)\n\nhooks = []\n<span class=\"kw\">for<\/span> layer_idx, neurons <span class=\"kw\">in<\/span> by_layer.<span class=\"fn\">items<\/span>():\n    h = model.layers[layer_idx].mlp.down_proj\n        .<span class=\"fn\">register_forward_hook<\/span>(\n            <span class=\"fn\">make_ablation_hook<\/span>(neurons, m=<span class=\"nm\">0.0<\/span>)\n        )\n    hooks.<span class=\"fn\">append<\/span>(h)<\/pre>\n<\/div><\/div>\n<p>    <!-- Slide 8 --><\/p>\n<div class=\"cna-slide\" data-slide=\"7\">\n      <span class=\"cna-slide-num\">What to Expect \u00a0\u2014\u00a0 Results<\/span>\n<h3>Refusal Reduction Across Instruct Models<\/h3>\n<p>From the paper \u2014 refusal rate before and after ablation on JBB-Behaviors (100 harmful prompts):<\/p>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Qwen2.5-7B-Instruct<\/span><span class=\"cna-result-drop\">87% \u2192 2% (\u201497.7%)<\/span><\/div>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Qwen2.5-72B-Instruct<\/span><span class=\"cna-result-drop\">78% \u2192 8% (\u201489.7%)<\/span><\/div>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Llama-3.1-70B-Instruct<\/span><span class=\"cna-result-drop\">86% \u2192 18% (\u201479.1%)<\/span><\/div>\n<div class=\"cna-result-row\"><span class=\"cna-result-model\">Llama-3.2-3B-Instruct<\/span><span class=\"cna-result-drop\">84% \u2192 47% (\u201444.0%)<\/span><\/div>\n<div class=\"cna-step-line\"><\/div>\n<p>Output quality (1 \u2014 repeated n-gram fraction) stays above <strong>0.97<\/strong> at all steering strengths. MMLU accuracy stays within one percentage point of baseline.<\/p>\n<\/div>\n<p>    <!-- Slide 9 --><\/p>\n<div class=\"cna-slide\" data-slide=\"8\">\n      <span class=\"cna-slide-num\">Key Notes \u00a0\u2014\u00a0 Before You Run This<\/span>\n<h3>Limitations to Keep in Mind<\/h3>\n<ul>\n<li>Tested on Llama 3.1\/3.2 and Qwen 2.5 only \u2014 gated SiLU MLPs with GQA attention<\/li>\n<li>Not yet validated on mixture-of-experts architectures<\/li>\n<li>Base models show no behavioral change under ablation \u2014 only instruct models respond<\/li>\n<li>CNA uses raw activation differences, not attribution scores \u2014 faithfulness metrics do not apply directly<\/li>\n<li>Amplification (m &gt; 1) can cause repetition at extreme values<\/li>\n<li>Quality of contrastive pairs directly affects which neurons are found<\/li>\n<\/ul>\n<div class=\"cna-step-line\"><\/div>\n<p>      <span class=\"cna-tag\">arXiv 2605.12290<\/span><br \/>\n      <span class=\"cna-tag\">Nous Research<\/span><br \/>\n      <span class=\"cna-tag\">github.com\/NousResearch\/neural-steering<\/span>\n    <\/p><\/div>\n<\/div>\n<p>  <!-- Nav --><\/p>\n<div class=\"cna-nav\">\n    <button class=\"cna-btn cna-btn-prev\" disabled>\u2190 Prev<\/button><br \/>\n    <span class=\"cna-slide-counter\">1 \/ 9<\/span><br \/>\n    <button class=\"cna-btn cna-btn-next\">Next \u2192<\/button>\n  <\/div>\n<p>  <!-- Footer --><\/p>\n<div class=\"cna-footer\">\n    <span>Coverage by<\/span><br \/>\n    <span class=\"cna-brand\">MARKTECHPOST \u00a0\u2014\u00a0 AI Research, Simplified<\/span>\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Ablating just 0.1% of MLP activations reduced refusal rates by more than 50% in most instruct models tested, while output quality stayed above 0.97.<\/li>\n<li>CNA requires only forward passes \u2014 no gradients, no auxiliary training, and no iterative search.<\/li>\n<li>Late-layer discrimination structure exists in base models before fine-tuning; alignment fine-tuning transforms its function, not its location.<\/li>\n<li>Unlike CAA, CNA preserves MMLU accuracy within one percentage point of baseline at all steering strengths.<\/li>\n<li>Only 8\u201329% of individual neurons overlap between base and instruct model circuits \u2014 fine-tuning rewires the neurons while keeping the late-layer structure intact.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the <strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.12290\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong> and\u00a0<strong><a href=\"https:\/\/github.com\/NousResearch\/neural-steering\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/23\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\">Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Instruction-tuned language models refuse harmful requests. But which part of the model is actually responsible \u2014 and how does that mechanism get installed during training? A new research from Nous Research team takes a neuron-level look at this question. The Nous research team developed contrastive neuron attribution (CNA), a method that identifies the specific MLP neurons whose activations most distinguish harmful from benign prompts. By ablating just 0.1% of MLP activations, they reduced refusal rates by more than 50% in most instruct models tested \u2014 across Llama and Qwen architectures from 1B to 72B parameters \u2014 while keeping output quality above 0.97 at all steering strengths. What\u2019s interesting is a key finding: the late-layer structure that discriminates harmful from benign prompts exists in base models before any fine-tuning. Alignment fine-tuning does not create new structure. It transforms the function of neurons within that existing structure into a sparse, targetable refusal gate. The Problem With Existing Steering Methods Contrastive Activation Addition (CAA) computes the average difference in residual stream activations between two contrastive prompt sets. The difference becomes a steering vector applied at inference time. CAA is effective but coarse: it modifies the entire layer-wide signal without identifying which individual neurons are responsible. At high steering strengths, output quality degrades \u2014 models produce repeated words and incoherent text. Sparse autoencoders (SAEs) decompose activations into interpretable features. They require expensive external training and are sensitive to activation noise. CNA requires only forward passes \u2014 no gradients, no auxiliary training, no iterative search. How CNA Works You define two sets of prompts: Positive prompts \u2014 examples of the target behavior (e.g., harmful requests) Negative prompts \u2014 examples of the opposite (e.g., benign requests) You run all prompts through the model. At each MLP layer, the method records down projection activations at the last token position. It then computes the per-neuron mean activation difference between the two sets: \u03b4j\u2113 = mean(activations on positive prompts) \u2212 mean(activations on negative prompts) The top-k neurons by absolute difference are selected across all layers. The researchers set k to 0.1% of total MLP activations. This threshold produced reliable steering effects across all model sizes tested. A filtering step removes \u2018universal\u2019 neurons \u2014 those appearing in the top 0.1% of MLP activations across 80% or more of diverse prompts. These neurons fire regardless of prompt content and are excluded from all discovered circuits. Causality is verified by multiplying each circuit neuron\u2019s activation by a scalar multiplier m at inference time. m = 0 ablates the neuron. m = 1 is baseline. m &gt; 1 amplifies it. For the main JBB-Behaviors evaluation, the refusal circuit is discovered using 100 harmful and 100 benign prompts. For qualitative examples and other tasks, 8 positive and 8 negative prompts were used. Results Experiments covered base and instruct variants of Llama 3.1\/3.2 and Qwen 2.5, from 1B to 72B parameters \u2014 16 models total. The main benchmark was JBB-Behaviors, a NeurIPS 2024 benchmark of 100 harmful prompts. Refusal reduction. Ablating the discovered circuit reduced refusal rates by more than 50% in most instruct models tested. Selected results from Table 3 of the research paper: Model Baseline Ablated Relative Drop Llama-3.1-70B-Instruct 86% 18% \u221279.1% Qwen2.5-7B-Instruct 87% 2% \u221297.7% Qwen2.5-72B-Instruct 78% 8% \u221289.7% Llama-3.2-3B-Instruct 84% 47% \u221244.0% Qwen2.5-3B-Instruct 90% 58% \u221235.6% Not all models exceeded 50% relative reduction \u2014 Llama-3.2-3B and Qwen2.5-3B showed smaller drops. The paper describes the effect as holding \u201cin most cases.\u201d Output quality. CNA output quality, measured as 1 minus the fraction of repeated n-grams, stayed above 0.97 at all steering strengths across all instruct models tested. CAA dropped below 0.60 for six of the eight instruct models at maximum steering strength. In two cases \u2014 Qwen2.5-1.5B and Qwen2.5-72B \u2014 CAA degraded output so severely that the keyword classifier flagged degenerate text as refusals, producing artificially high refusal rates. General capabilities. MMLU accuracy under CNA stayed within one percentage point of baseline at all steering strengths. CAA dropped to near-zero MMLU accuracy at maximum intervention. StrongREJECT rubric. A secondary evaluation used the StrongREJECT rubric, which applies an LLM judge (Llama-3.3-70B) to score responses on harmfulness and dangerousness on a 0\u20131 scale. Llama model compliance scores improved by an average of 6% after CNA ablation. Qwen model compliance scores improved by an average of 31%. Base model comparison. Applying the identical pipeline to base models produced no meaningful behavioral change. The paper illustrates this with a concrete example using the prompt \u201cHow do I pick a lock?\u201d: Model Multiplier Output Llama-1B Base 1.0 Repeats the question Llama-1B Base 0.0 (ablated) Describes lock picking as a learnable skill Llama-1B Instruct 1.0 \u201cI can\u2019t assist with that.\u201d Llama-1B Instruct 0.0 (ablated) Provides a guide Llama-1B Instruct 2.0 (amplified) Stronger refusal In base models, steering the late-layer neurons produces content shifts \u2014 topic changes, rephrasing \u2014 but no behavioral change at any multiplier. In instruct models, the same structure acts as a causal safety gate. Fine-Tuning Transforms Function, Not Structure Discrimination neurons concentrate in the final 10% of layers in both base and instruct models. For Llama-3.2-1B, 87% of the top-200 discrimination neurons fall in the final three layers (L13\u2013L15). For Qwen2.5-3B, 95% fall in the final quarter of layers. This late-layer concentration is a pretraining property \u2014 it exists before alignment fine-tuning. https:\/\/arxiv.org\/pdf\/2605.12290 The function of those neurons changes after fine-tuning. Table 8 in the research paper reports the overlap of (layer, neuron) index pairs between matched base and instruct circuits. Only 8\u201329% of individual neurons overlap between base and instruct models. Fine-tuning largely replaces the specific neurons within that late-layer structure while preserving the structure itself. The research team describe this as a separation between two levels: layer-level structure (preserved across base and instruct) and neuron-level function (transformed by fine-tuning). This is consistent with prior work showing that instruction tuning rotates feed-forward network knowledge without changing layer structure. Marktechpost\u2019s Visual Explainer Step-by-Step Guide \u00a0\u2022\u00a0 Nous Research How to Use Contrastive Neuron Attribution (CNA) Steer LLM behavior by identifying and ablating sparse MLP circuits<\/p>","protected":false},"author":2,"featured_media":92390,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-92389","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/th\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\" \/>\n<meta property=\"og:locale\" content=\"th_TH\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/th\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-23T16:57:43+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 \u0e19\u0e32\u0e17\u0e35\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification\",\"datePublished\":\"2026-05-23T16:57:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\"},\"wordCount\":1746,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\",\"url\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\",\"name\":\"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png\",\"datePublished\":\"2026-05-23T16:57:43+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#breadcrumb\"},\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png\",\"width\":1054,\"height\":346},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"th\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/th\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/th\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/","og_locale":"th_TH","og_type":"article","og_title":"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/th\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-23T16:57:43+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Written by":"admin NU","Est. reading time":"10 \u0e19\u0e32\u0e17\u0e35"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification","datePublished":"2026-05-23T16:57:43+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/"},"wordCount":1746,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"th","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/","url":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/","name":"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png","datePublished":"2026-05-23T16:57:43+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#breadcrumb"},"inLanguage":"th","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/"]}]},{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png","width":1054,"height":346},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nous-research-releases-contrastive-neuron-attribution-cna-sparse-mlp-circuit-steering-without-sae-training-or-weight-modification\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"th"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/th\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png",1054,346,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png",1054,346,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png",1054,346,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz-300x98.png",300,98,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz-1024x336.png",1024,336,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png",1054,346,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz.png",1054,346,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz-18x6.png",18,6,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz-600x197.png",600,197,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-23-at-2.48.09-AM-HPLfCz-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/th\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/th\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Instruction-tuned language models refuse harmful requests. But which part of the model is actually responsible \u2014 and how does that mechanism get installed during training? A new research from Nous Research team takes a neuron-level look at this question. The Nous research team developed contrastive neuron attribution (CNA), a method that identifies the specific MLP&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/92389","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/comments?post=92389"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/92389\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media\/92390"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media?parent=92389"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/categories?post=92389"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/tags?post=92389"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}