{"id":90989,"date":"2026-05-17T16:34:53","date_gmt":"2026-05-17T16:34:53","guid":{"rendered":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/"},"modified":"2026-05-17T16:34:53","modified_gmt":"2026-05-17T16:34:53","slug":"nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/","title":{"rendered":"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context"},"content":{"rendered":"<p>Training large language models on long sequences has a well-known problem: attention is expensive. The scaled dot-product attention (SDPA) at the core of every transformer scales quadratically \u0398(N\u00b2) in both compute and memory with sequence length N. FlashAttention addressed this through IO-aware tiling that avoids materializing the full N\u00d7N attention matrix in high-bandwidth memory, reducing the memory footprint significantly, but the underlying \u0398(N\u00b2) compute scaling remains. Researchers at Nous Research have introduced a new method called Lighthouse Attention that addresses this bottleneck specifically at pretraining time, achieving a 1.40\u00d7 to 1.69\u00d7 end-to-end wall-clock speedup against a cuDNN-backed SDPA baseline, with matching or lower final training loss.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The core problem with existing sparse attention methods<\/strong><\/h2>\n<p>To understand why Lighthouse works the way it does, it helps to know what existing sparse attention methods do. Most prior work like NSA, HISA, DSA, MoBA makes the same two design decisions. First, they pool only the key and value side while leaving queries at full resolution (asymmetric compression). Second, their selection logic lives inside a custom attention kernel, which means teams can\u2019t reuse the optimized dense-attention kernels that modern GPU tensor cores are built around.<\/p>\n<p>There is also a concern specific to training that inference-only sparse methods don\u2019t face. An inference-time sparse method is evaluated only against its dense backbone and it is at most as good as that backbone. A training-time sparse method faces a harder test: once training is done, will the resulting weights still produce a competent dense-attention model at inference? <strong>Lighthouse <\/strong>treats that question as its central correctness criterion.<\/p>\n<p>Lighthouse takes a different approach on both design decisions. It pools queries, keys, and values symmetrically across a multi-level pyramid, and it places selection entirely outside the attention kernel. After selection, the system gathers the chosen entries into a contiguous, dense sub-sequence and runs stock FlashAttention on it \u2014 the same kernel used by the dense baseline.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1568\" height=\"796\" data-attachment-id=\"79920\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/16\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/screenshot-2026-05-16-at-3-22-13-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1.png\" data-orig-size=\"1568,796\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-16 at 3.22.13\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-1024x520.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1.png\" alt=\"\" class=\"wp-image-79920\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.06554<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>How the four-stage pipeline works<\/strong><\/h2>\n<p>A Lighthouse attention layer wraps around, but does not modify, scaled dot-product attention. <strong>The pipeline has four stages.<\/strong><\/p>\n<p>In the first stage, average pooling constructs an L-level pyramid from Q, K, and V. With pooling factor p, level \u2113 of the pyramid has N\/p^\u2113 tokens, each summarizing p^\u2113 base positions. Crucially, the same pooling applies to all three projections, producing coherent (Q^(\u2113), K^(\u2113), V^(\u2113)) triples at every level. Total pyramid construction costs \u0398(N) time and memory.<\/p>\n<p>In the second stage, a parameter-free scorer assigns each pyramid entry two scalar scores using per-head \u2113\u2082 norms: one as a query score (\u2225Q^(\u2113)_i\u2225\u2082) and one as a key score (\u2225K^(\u2113)_i\u2225\u2082). Coarser levels inherit scores from finer ones via max-pooling, so a coarse span picks up the importance of its strongest token. A fused chunked-bitonic top-K kernel then selects k entries jointly across all pyramid levels. One design detail worth noting: the coarsest pyramid level is always retained in full \u2014 it is cheap and guarantees at least one contributor at every base position; the remaining selection budget is spent on finer levels. Additionally, the chunked-bitonic design produces a stratified top-K rather than a strict global top-K: the score stream is partitioned into fixed-size chunks, each maintaining an in-register top-m buffer, so if the k globally highest-scoring entries clustered in one chunk, some would be replaced by lower-scoring entries from other chunks. The result is more balanced attention coverage across the sequence and avoids selection collapse onto a narrow span.<\/p>\n<p>The top-K step is discrete and non-differentiable \u2014 no straight-through estimator, no Gumbel softmax. Selection indices carry no gradient. Gradients flow only through the gathered Q, K, V entries into WQ, WK, WV, so the projections learn to produce values that are useful when selected rather than scores that are good at selecting.<\/p>\n<p>In the third stage, the selected entries are gathered into a contiguous sub-sequence of length S = N\/p^(L\u22121) + (L\u22121)\u00b7p\u00b7k and passed to standard FlashAttention. At N = 1,000,000 with L = 4, p = 4, k = 4,096, S \u2248 65,000 \u2014 far smaller than N. A critical property of the gathering process is that it guarantees no \u201choles\u201d or empty spaces in the assembled sub-sequence. This matters specifically because Lighthouse also compresses queries: a gap in the sequence would mean those missing tokens have no gradient path during the backward pass and could cause training instabilities. Asymmetric methods that leave queries at full resolution don\u2019t face this problem, but Lighthouse\u2019s symmetric design requires that the gathered sub-sequence remains fully dense.<\/p>\n<p>In the fourth stage, each output entry is scattered back to the p^\u2113 base positions it represents via a deterministic integer-atomic scatter kernel, with a shift of p^\u2113 \u2212 1 to preserve causality. The per-position fan-in is bounded by L regardless of k.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1572\" height=\"742\" data-attachment-id=\"79922\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/16\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/screenshot-2026-05-16-at-3-22-44-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.44-PM-1.png\" data-orig-size=\"1572,742\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-16 at 3.22.44\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.44-PM-1-1024x483.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.44-PM-1.png\" alt=\"\" class=\"wp-image-79922\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.06554<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Why symmetric pooling changes the compute<\/strong><\/h2>\n<p>Pooling queries alongside keys and values changes the computational character of the attention call from O(N Sd) to O(S\u00b2 d) at training time. Because S \u226a N at long contexts, this is what produces the latency advantage. Benchmarked on a single NVIDIA B200 at 512K context (bfloat16, B=1, H=8, head dimension 128, L=3, p=4, sparsity \u2248 1:64), Lighthouse is 21\u00d7 faster on the forward pass and 17.3\u00d7 faster on the combined forward+backward pass relative to cuDNN-backed SDPA.<\/p>\n<p>From an asymptotic standpoint, setting L = logp(N\/k) gives a gathered sub-sequence size of S = \u0398(k log N), which makes the dense FlashAttention call cost \u0398(k\u00b2 log\u00b2 N d) \u2014 polylogarithmic in N at fixed k. Combined with the linear-cost stages (pyramid construction, scoring, scatter-back), total per-layer compute is \u0398(T d) at bounded k \u2014 the same asymptotic class as linear attention and SSMs \u2014 while preserving softmax attention\u2019s recall properties on the selected sub-sequence.<\/p>\n<p>Inference is a different constraint. Autoregressive decoding presents one query at a time, which violates the assumption that all queries co-occur in one forward pass. Lighthouse is a training-only method, and the symmetric pooling design cannot be used directly at inference.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The two-stage training recipe and recoverability<\/strong><\/h2>\n<p>The experimental setup used a 530M-parameter Llama-3-style decoder (dmodel=1024, 30 layers, 8 heads, head dimension 128, FFN width 1536, byte-level tokenizer), trained on C4 at 98,304-token context with AdamW at learning rate 2\u00d710\u207b\u00b3, \u03b21=0.9, \u03b22=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, and FSDP. One implementation detail that matters for practitioners: of the 30 layers, layers {0, 1, 28, 29} retain dense SDPA throughout \u2014 only the other 26 layers use Lighthouse. The inner attention call within those 26 Lighthouse layers uses the same cuDNN-backed SDPA kernel as the dense baseline.<\/p>\n<p>The training aproach is two-stage. Stage 1 trains with Lighthouse selection enabled for the majority of the step budget. Stage 2 resumes the Stage 1 checkpoint under dense SDPA (same optimizer state, same dataloader) for a short tail. If Stage 1 had hollowed out the model\u2019s dense-attention capability, Stage 2 recovery would fail.<\/p>\n<p>It doesn\u2019t fail. Testing at a total budget of 16,000 steps (~50.3B tokens), three split points (10k+6k, 11k+5k, 12k+4k) were evaluated against a dense-from-scratch SDPA baseline. At each resume point the training loss spikes transiently by 1.12\u20131.57 nats as the model is first run through attention it was not trained against, then recovers within approximately 1,000\u20131,500 SDPA steps and crosses below the dense baseline. By step 16,000, all three resumed Lighthouse runs reach final losses of 0.6980\u20130.7102, against the dense baseline\u2019s 0.7237, while spending 22.5h to 27.0h wall-clock compared to 37.9h for dense-SDPA-from-scratch on the same token budget.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Ablations and throughput<\/strong><\/h2>\n<p>The full ablation grid covers scorer type, pooling factor p, number of pyramid levels L, and top-K budget k. Key findings: the projection-norm scorer is within ~0.01 of the dilated softmax-attention scorer in either direction (no uniform winner) but is roughly 9% cheaper in B200-hours, since it skips the attention pass over the pyramid entirely. Shallower pyramids (L=3) consistently outperform deeper ones (L=4, L=5) at matched budgets. Smaller k values produce lower post-resume loss within the tested range \u2014 the lowest-loss configuration across the grid is L=3, p=2, k=1536 with the dilated scorer, reaching a final loss of 0.6825 \u2014 a counter-intuitive result the research teams attribute to hierarchical selection acting as a regularizer at this token budget scale.<\/p>\n<p>Stage-1 throughput across the ablation grid ranges from 84,000 to 126,000 tokens\/s\/GPU against approximately 46,000 for dense SDPA. The projection-norm scorer at L=3, p=4, k=1536 tops the range at 126,000 tokens\/s\/GPU by skipping the dilated-attention pass entirely.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Long-context retrieval<\/strong><\/h2>\n<p>To complement the loss-based recoverability results, the research team ran a simplified Needle-in-a-Haystack (NIAH) evaluation: a single passkey digit hidden in random alphanumeric filler at depths of 0\u2013100% across context lengths of 4K to 96K tokens, with retrieval scored as a one-token argmax over the ten digit tokens (random chance: 10%). Four Lighthouse configurations (varying k \u2208 {1536, 2048} and scorer \u2208 {dilated, norm} at L=3, p=4) were tested against the dense-SDPA-from-scratch baseline. Three of four Lighthouse runs match or beat the dense baseline\u2019s mean retrieval rate of 0.72: k=2048 dilated reaches 0.76, k=1536 dilated reaches 0.73, and k=2048 norm matches the baseline at 0.72. Only k=1536 norm dips, to 0.65. A pattern emerges across the grid: larger k is the dominant axis for retrieval performance, and the norm scorer hurts retrieval more than it hurts training loss at the same k. The practical implication is that the optimal configuration depends on whether the downstream task is loss-driven or retrieval-driven.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Context parallelism scaling<\/strong><\/h2>\n<p>For sequences beyond ~100K tokens, Lighthouse runs under context parallelism (CP). Pyramid pooling, scoring, and top-K run shard-locally on each rank with no inter-rank communication, since the coarsest pool window (e.g., 64 tokens) is orders of magnitude smaller than the shard size. The gathered sub-sequence is dense, so it participates in standard ring attention without sparse-aware collectives \u2014 something sparse-index-based methods cannot do without engineering specific to the sparse layout. Context parallelism introduces approximately 10% per-rank throughput overhead from ring rotation, but the Lighthouse vs. SDPA speedup ratio is preserved. The method scales to 1M-token training across 32 Blackwell GPUs (4 nodes, CP degree 8) with no changes to the inner attention kernel.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- Header --><\/p>\n<div class=\"lha-header\">\n<div class=\"lha-logo\">LH<\/div>\n<div class=\"lha-title-block\">\n<h2>Lighthouse Attention<\/h2>\n<p>      <span>Nous Research \u00a0\u2014\u00a0 arXiv:2605.06554<\/span>\n    <\/p><\/div>\n<div class=\"lha-badge\">TRAINING-ONLY<\/div>\n<\/div>\n<p>  <!-- Pill nav --><\/p>\n<div class=\"lha-counter\"><\/div>\n<p>  <!-- Slides --><\/p>\n<div class=\"lha-viewport\">\n<div class=\"lha-track\">\n<p>      <!-- \u2500\u2500 SLIDE 1: What is the problem \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">01 \u00a0\/ The Problem<\/div>\n<h3>Why Long-Context Training Is Expensive<\/h3>\n<p>Every transformer uses <strong>scaled dot-product attention (SDPA)<\/strong>, which computes a score between every token and every other token in the sequence. As sequence length N grows, this cost scales as <span class=\"hl\">\u0398(N\u00b2)<\/span> in both compute and memory \u2014 it doubles the cost for every ~1.4\u00d7 increase in context.<\/p>\n<p><strong>FlashAttention<\/strong> reduced this by using IO-aware tiling that avoids ever materializing the full N\u00d7N attention matrix in high-bandwidth memory, cutting memory footprint significantly. But the underlying \u0398(N\u00b2) <em>compute<\/em> scaling is unchanged \u2014 the wall is still there.<\/p>\n<div class=\"lha-stats\">\n<div class=\"lha-stat\">\n            <span class=\"val\">\u0398(N\u00b2)<\/span><br \/>\n            <span class=\"lbl\">SDPA compute &amp; memory scaling<\/span>\n          <\/div>\n<div class=\"lha-stat\">\n            <span class=\"val\">1M<\/span><br \/>\n            <span class=\"lbl\">token context frontier models target<\/span>\n          <\/div>\n<div class=\"lha-stat\">\n            <span class=\"val\">32<\/span><br \/>\n            <span class=\"lbl\">B200 GPUs needed for 1M-token training<\/span>\n          <\/div>\n<\/div>\n<p>The result: teams either train at shorter contexts than they want, or spend enormous compute budgets on attention alone. Lighthouse Attention is a method that wraps around standard SDPA during pretraining to reduce this cost, then gets removed so the final model is a normal dense-attention model at inference.<\/p>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 2: Prior work problems \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">02 \u00a0\/ Prior Work<\/div>\n<h3>What Existing Sparse Attention Gets Wrong<\/h3>\n<p>Several methods already try to reduce the attention cost by attending to only a subset of tokens. But most share two design decisions that create problems for pretraining.<\/p>\n<div class=\"lha-two\">\n<div class=\"lha-card\">\n<h4>\u26a0 Problem 1: Asymmetry<\/h4>\n<p>Methods like <strong>NSA, HISA, InfLLM-v2<\/strong> pool only keys and values but leave queries at full resolution. The hierarchy becomes a compressed memory rather than a true multi-scale representation. It also means the dense attention call stays <strong>O(N\u00b7S\u00b7d)<\/strong> instead of shrinking further.<\/p>\n<\/div>\n<div class=\"lha-card\">\n<h4>\u26a0 Problem 2: Kernel Entanglement<\/h4>\n<p>Methods like <strong>NSA, DSA, HISA, MoBA<\/strong> embed selection logic inside a custom attention kernel. This means they cannot reuse the optimized FlashAttention kernels that GPU tensor cores are built around. Every sparse method ships its own forward and backward kernels.<\/p>\n<\/div>\n<\/div>\n<div class=\"lha-warn\">\n          <strong>The hardest problem:<\/strong> An inference-only sparse method is automatically as good as its dense backbone. A <em>training-time<\/em> sparse method must answer a harder question: once training is done, will the resulting weights still work as a competent dense-attention model at inference? Most methods don\u2019t test this.\n        <\/div>\n<p>Lighthouse Attention treats this recoverability question as its <strong>central correctness criterion<\/strong>.<\/p>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 3: What is Lighthouse \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">03 \u00a0\/ The Method<\/div>\n<h3>Lighthouse Attention: Core Idea<\/h3>\n<p>Lighthouse is a <strong>selection-based hierarchical attention<\/strong> that wraps around, but does not modify, the attention kernel. It adds a pre-processing step that selects a small subset of tokens, runs stock FlashAttention on just that subset, and scatters the output back. At the end of training, you disable Lighthouse and keep the dense model.<\/p>\n<div class=\"lha-info\">\n          <strong>Two key design differences from prior work:<\/strong><br \/>\n          \u2713 \u00a0Queries, keys, <em>and<\/em> values are all pooled symmetrically (not just keys\/values)<br \/>\n          \u2713 \u00a0Selection sits <em>outside<\/em> the attention kernel \u2014 FlashAttention runs on a normal dense sub-sequence\n        <\/div>\n<div class=\"lha-stats\">\n<div class=\"lha-stat\">\n            <span class=\"val\">21\u00d7<\/span><br \/>\n            <span class=\"lbl\">faster forward pass vs SDPA at 512K context<\/span>\n          <\/div>\n<div class=\"lha-stat\">\n            <span class=\"val\">17.3\u00d7<\/span><br \/>\n            <span class=\"lbl\">faster forward+backward at 512K context<\/span>\n          <\/div>\n<div class=\"lha-stat\">\n            <span class=\"val\">1.69\u00d7<\/span><br \/>\n            <span class=\"lbl\">end-to-end pretraining wall-clock speedup<\/span>\n          <\/div>\n<\/div>\n<p>The method introduces <strong>no new learnable parameters<\/strong> and <strong>no auxiliary losses<\/strong>. The scoring function is parameter-free, and the top-K selection step is deliberately non-differentiable \u2014 no straight-through estimator or Gumbel softmax.<\/p>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 4: Four stages \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">04 \u00a0\/ Architecture<\/div>\n<h3>The Four-Stage Pipeline<\/h3>\n<p>A Lighthouse attention layer replaces the standard SDPA call with four stages. Stages 1 and 4 are custom kernels; stages 2 and 3 are standard PyTorch operations fused by <code>torch.compile<\/code>.<\/p>\n<div class=\"lha-steps\">\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">1<\/div>\n<div class=\"lha-step-body\">\n              <strong>Pyramid Pool<\/strong>\n<p>Average-pool Q, K, and V <em>symmetrically<\/em> into an L-level pyramid with pooling factor p. Level \u2113 has N\/p\u207f tokens, each summarizing p\u207f base positions. Total cost: <span class=\"hl\">\u0398(N)<\/span>. Crucially, the coarsest level is always retained in full to guarantee at least one contributor per base position.<\/p>\n<\/div>\n<\/div>\n<div class=\"lha-step-line\"><\/div>\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">2<\/div>\n<div class=\"lha-step-body\">\n              <strong>Score + Top-K Selection<\/strong>\n<p>Each pyramid entry gets two scalar scores using its per-head \u2113\u2082 norm: one as a query score, one as a key score. A fused chunked-bitonic top-K kernel selects k entries jointly across all pyramid levels. This step is <span class=\"hl\">non-differentiable<\/span> \u2014 indices carry no gradient.<\/p>\n<\/div>\n<\/div>\n<div class=\"lha-step-line\"><\/div>\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">3<\/div>\n<div class=\"lha-step-body\">\n              <strong>Dense Gather + FlashAttention<\/strong>\n<p>Selected (Q, K, V) triples are gathered into a contiguous sub-sequence of length S = N\/p\u207f\u207b\u00b9 + (L\u22121)\u00b7p\u00b7k, then passed to <strong>stock FlashAttention<\/strong>. No custom sparse kernel. The gathered sequence has no holes, which is essential because queries are also compressed.<\/p>\n<\/div>\n<\/div>\n<div class=\"lha-step-line\"><\/div>\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">4<\/div>\n<div class=\"lha-step-body\">\n              <strong>Scatter-Back<\/strong>\n<p>Each output entry is scattered back to the p\u207f base positions it represents via an integer-atomic scatter kernel. The output is fully dense. Per-position fan-in is bounded by L regardless of k.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 5: Symmetric pooling \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">05 \u00a0\/ Key Design Choice<\/div>\n<h3>Why Symmetric Q\/K\/V Pooling Matters<\/h3>\n<p>Most prior hierarchical methods pool only K and V while leaving Q at full resolution. Lighthouse pools all three. This is not cosmetic \u2014 it changes the math of the attention call.<\/p>\n<div class=\"lha-table-wrap\">\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Query side<\/th>\n<th>Attention cost<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NSA, HISA, InfLLM-v2<\/td>\n<td class=\"td-muted\">Full resolution (N)<\/td>\n<td class=\"td-muted\">O(N \u00b7 S \u00b7 d)<\/td>\n<\/tr>\n<tr>\n<td><strong>Lighthouse<\/strong><\/td>\n<td class=\"td-green\">Pooled (S)<\/td>\n<td class=\"td-green\">O(S\u00b2 \u00b7 d)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>Because S \u226a N at long contexts, <strong>O(S\u00b2\u00b7d) is dramatically cheaper than O(N\u00b7S\u00b7d)<\/strong>. At N = 1,000,000 with L=4, p=4, k=4096, S \u2248 65,000.<\/p>\n<div class=\"lha-info\">\n          <strong>The no-holes guarantee:<\/strong> Compressing queries means every query position must have a gradient path. Lighthouse guarantees no gaps in the gathered sub-sequence, which prevents training instabilities that would arise from tokens with missing gradients. Asymmetric methods that leave Q at full resolution don\u2019t face this problem.\n        <\/div>\n<p>At bounded k, setting L = log\u1d63(N\/k) gives total per-layer compute of <span class=\"hl\">\u0398(T\u00b7d)<\/span> \u2014 the same asymptotic class as linear attention and SSMs, but with softmax attention\u2019s recall properties on the selected sub-sequence.<\/p>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 6: Gradient flow \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">06 \u00a0\/ Gradient Flow<\/div>\n<h3>Non-Differentiable Selection, Differentiable Training<\/h3>\n<p>The top-K step is discrete. Lighthouse deliberately <strong>does not<\/strong> approximate it with a straight-through estimator or Gumbel softmax. This is a conscious design choice.<\/p>\n<div class=\"lha-two\">\n<div class=\"lha-card\">\n<h4>What does NOT get gradients<\/h4>\n<p>The selection indices and the scoring function. The \u2113\u2082 norm scorer is never trained \u2014 it has no parameters and receives no gradient signal.<\/p>\n<\/div>\n<div class=\"lha-card\">\n<h4>What DOES get gradients<\/h4>\n<p>Gradients flow through scatter-back \u2192 FlashAttention \u2192 gather into the gathered Q\u0303, K\u0303, V\u0303 and on into <code>W_Q<\/code>, <code>W_K<\/code>, <code>W_V<\/code>.<\/p>\n<\/div>\n<\/div>\n<p>The result: the projection matrices learn to <strong>produce values that are useful when selected<\/strong>, not scores that are good at selecting. This avoids the optimization problems \u2014 scorer collapse, scorer\u2013attention misalignment, auxiliary loss tuning \u2014 that learnable selectors in NSA and DSA are prone to.<\/p>\n<div class=\"lha-info\">\n          <strong>Complexity comparison across attention families (per-layer compute at bounded k):<\/strong>\n<p>          Dense softmax: <span class=\"hl\">\u0398(T\u00b2 \u00b7 d)<\/span><br \/>\n          Log-linear attention: \u0398(T log T \u00b7 d)<br \/>\n          Lighthouse (bounded k): <span class=\"hl\">\u0398(T \u00b7 d)<\/span><br \/>\n          Linear attention \/ SSMs: \u0398(T \u00b7 d)\n        <\/p><\/div>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 7: Two-stage recipe \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">07 \u00a0\/ Training Recipe<\/div>\n<h3>Two-Stage Training and Recoverability<\/h3>\n<p>The central claim of Lighthouse is that sparse training does not break the model\u2019s ability to use dense attention at inference. The two-stage recipe is how this is validated.<\/p>\n<div class=\"lha-steps\">\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">1<\/div>\n<div class=\"lha-step-body\">\n              <strong>Stage 1 \u2014 Lighthouse pretraining<\/strong>\n<p>Train for the majority of the step budget with Lighthouse selection active. This is the fast stage: ~2\u00d7 higher throughput than dense SDPA.<\/p>\n<\/div>\n<\/div>\n<div class=\"lha-step-line\"><\/div>\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">2<\/div>\n<div class=\"lha-step-body\">\n              <strong>Stage 2 \u2014 Dense SDPA resumption<\/strong>\n<p>Resume the Stage 1 checkpoint under standard dense SDPA with the <em>same optimizer state and dataloader<\/em>. The loss spikes transiently by 1.12\u20131.57 nats, then recovers within ~1,000\u20131,500 SDPA steps and crosses below the dense baseline.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p>Tested at 16,000 total steps (~50.3B tokens) on a 530M Llama-3-style model (dmodel=1024, 30 layers, H=8, head dim 128, FFN 1536, byte-level tokenizer, C4 dataset, 98,304-token context) across three split points:<\/p>\n<div class=\"lha-table-wrap\">\n<table>\n<thead>\n<tr>\n<th>Split<\/th>\n<th>B200\u2013Hrs<\/th>\n<th>Tok\/s (k)<\/th>\n<th>Final Loss<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"td-muted\">Dense SDPA baseline<\/td>\n<td>303.2<\/td>\n<td>45.6<\/td>\n<td>0.7237<\/td>\n<\/tr>\n<tr>\n<td>LH 12k + SDPA 4k<\/td>\n<td>214.7<\/td>\n<td>74.7<\/td>\n<td>0.7102<\/td>\n<\/tr>\n<tr>\n<td>LH 11k + SDPA 5k<\/td>\n<td>219.6<\/td>\n<td>75.4<\/td>\n<td>0.7001<\/td>\n<\/tr>\n<tr>\n<td class=\"td-green\">LH 10k + SDPA 6k<\/td>\n<td class=\"td-green\">228.0<\/td>\n<td class=\"td-green\">75.0<\/td>\n<td class=\"td-green\">0.6980<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>All three Lighthouse runs beat the dense baseline at matched token budgets.<\/p>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 8: Architecture detail \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">08 \u00a0\/ Implementation Detail<\/div>\n<h3>Not All Layers Use Lighthouse<\/h3>\n<p>An important detail for practitioners: in the 30-layer experimental model, <strong>layers {0, 1, 28, 29} retain dense SDPA throughout<\/strong>. Only the remaining 26 layers use Lighthouse. The inner attention call within those Lighthouse layers uses the same cuDNN-backed SDPA kernel as the dense baseline.<\/p>\n<div class=\"lha-info\">\n          This means Lighthouse is a partial replacement, not a full model-wide substitution. The first and last layers keeping dense attention is a practical stabilization choice \u2014 these boundary layers often carry disproportionate importance for model behavior.\n        <\/div>\n<p><strong>Optimizer setup:<\/strong> AdamW, lr 2\u00d710\u207b\u00b3, \u03b2\u2081=0.9, \u03b2\u2082=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, FSDP only.<\/p>\n<p><strong>Chunked-bitonic top-K:<\/strong> The kernel produces a <em>stratified<\/em> top-K, not a strict global top-K. Score stream is partitioned into fixed-size chunks; each chunk maintains an in-register buffer. If the globally highest-scoring entries clustered in one chunk, some are replaced by lower-scoring entries from other chunks \u2014 guaranteeing every region of the sequence contributes tokens and preventing attention from collapsing onto a narrow span.<\/p>\n<pre><code>S = N \/ p^(L-1) + (L-1) * p * k\n\n# Example: N=1M, L=4, p=4, k=4096\n# S = 1,000,000\/64 + 3*4*4096\n# S = 15,625 + 49,152 \u2248 65,000  (vs 1,000,000 for full attention)<\/code><\/pre>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 9: Ablations \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">09 \u00a0\/ Ablations<\/div>\n<h3>What the Hyperparameter Sweep Shows<\/h3>\n<p>The full ablation grid varied scorer type, pooling factor p, pyramid levels L, and top-K budget k. All configurations used the 10k+6k split at 98K context.<\/p>\n<div class=\"lha-table-wrap\">\n<table>\n<thead>\n<tr>\n<th>Config<\/th>\n<th>Scorer<\/th>\n<th>B200\u2013Hrs<\/th>\n<th>Tok\/s (k)<\/th>\n<th>Final Loss<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"td-muted\">SDPA baseline<\/td>\n<td class=\"td-muted\">\u2014<\/td>\n<td>303.2<\/td>\n<td>45.6<\/td>\n<td>0.7237<\/td>\n<\/tr>\n<tr>\n<td class=\"td-green\">L=3, p=2, k=1536<\/td>\n<td class=\"td-green\">Dilated<\/td>\n<td class=\"td-green\">203.9<\/td>\n<td>93.9<\/td>\n<td class=\"td-green\">0.6825<\/td>\n<\/tr>\n<tr>\n<td>L=3, p=4, k=1536<\/td>\n<td>Dilated<\/td>\n<td>197.2<\/td>\n<td>99.5<\/td>\n<td>0.6881<\/td>\n<\/tr>\n<tr>\n<td>L=3, p=4, k=1536<\/td>\n<td>Norm<\/td>\n<td>179.6<\/td>\n<td class=\"td-green\">126.0<\/td>\n<td>0.6946<\/td>\n<\/tr>\n<tr>\n<td>L=3, p=2, k=4096<\/td>\n<td>Dilated<\/td>\n<td>215.7<\/td>\n<td>83.5<\/td>\n<td>0.6951<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p><strong>Key findings from the sweep:<\/strong><\/p>\n<div class=\"lha-pills\">\n          <span class=\"lha-pill active\">Smaller k \u2192 better loss (counter-intuitive)<\/span><br \/>\n          <span class=\"lha-pill active\">Shallower L=3 beats L=4, L=5<\/span><br \/>\n          <span class=\"lha-pill active\">Norm scorer: 9% cheaper, similar quality<\/span><br \/>\n          <span class=\"lha-pill active\">Every config beats dense baseline<\/span>\n        <\/div>\n<p>The counter-intuitive finding on k: loss decreases monotonically as k shrinks from 4,096 to 1,536. The authors attribute this to hierarchical selection acting as a regularizer at the 50.3B-token budget. Whether this reverses at larger budgets is left to future work.<\/p>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 10: NIAH \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">10 \u00a0\/ Retrieval Evaluation<\/div>\n<h3>Needle-in-a-Haystack Results<\/h3>\n<p>Beyond training loss, the paper evaluates long-context retrieval using a simplified <strong>Needle-in-a-Haystack (NIAH)<\/strong> test: a single passkey digit hidden in random alphanumeric filler at depths of 0\u2013100% across context lengths of 4K\u201396K tokens. Retrieval is scored as a one-token argmax over the ten digit tokens. Random chance is 10%.<\/p>\n<div class=\"lha-table-wrap\">\n<table>\n<thead>\n<tr>\n<th>Configuration<\/th>\n<th>Mean Retrieval Rate<\/th>\n<th>vs Baseline<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"td-muted\">Dense SDPA baseline<\/td>\n<td>0.72<\/td>\n<td class=\"td-muted\">\u2014<\/td>\n<\/tr>\n<tr>\n<td class=\"td-green\">k=2048, Dilated scorer<\/td>\n<td class=\"td-green\">0.76<\/td>\n<td class=\"td-green\">+0.04<\/td>\n<\/tr>\n<tr>\n<td>k=1536, Dilated scorer<\/td>\n<td>0.73<\/td>\n<td class=\"td-green\">+0.01<\/td>\n<\/tr>\n<tr>\n<td>k=2048, Norm scorer<\/td>\n<td>0.72<\/td>\n<td>Matches<\/td>\n<\/tr>\n<tr>\n<td class=\"td-muted\">k=1536, Norm scorer<\/td>\n<td class=\"td-muted\">0.65<\/td>\n<td class=\"td-muted\">\u22120.07<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<div class=\"lha-info\">\n          Three of four Lighthouse configurations match or beat the dense-from-scratch baseline on retrieval. The norm scorer hurts retrieval more than it hurts training loss at the same k. The practical implication: if your downstream task is retrieval-heavy, use a larger k and the dilated scorer. If optimizing for loss and throughput, the norm scorer with k=1536 is the better trade-off.\n        <\/div>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 11: Context parallelism \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">11 \u00a0\/ Scaling<\/div>\n<h3>Context Parallelism at 1M Tokens<\/h3>\n<p>For sequences beyond ~100K tokens, the 530M model OOMs on a single B200 regardless of attention method (activations + gradients + optimizer state). Lighthouse extends to multi-GPU context parallelism (CP) cleanly.<\/p>\n<div class=\"lha-steps\">\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">1<\/div>\n<div class=\"lha-step-body\">\n              <strong>Shard-local pre-attention<\/strong>\n<p>Each rank holds a contiguous slice of the sequence. Pyramid pooling, scoring, and top-K all run shard-locally. The coarsest pool window (e.g., 64 tokens) is far smaller than the shard size (N\/W \u2248 128K at N=1M, W=8), so no inter-rank communication is needed at this stage.<\/p>\n<\/div>\n<\/div>\n<div class=\"lha-step-line\"><\/div>\n<div class=\"lha-step\">\n<div class=\"lha-step-num\">2<\/div>\n<div class=\"lha-step-body\">\n              <strong>Standard ring attention<\/strong>\n<p>The gathered sub-sequence is dense, so it participates in standard ring attention with no sparse-aware collectives. KV shards rotate through the ring as in a fully dense long-context run. Sparse-index-based methods cannot do this \u2014 ring rotation requires a contiguous tensor, which their sparse outputs are not.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"lha-stats\">\n<div class=\"lha-stat\">\n            <span class=\"val\">~10%<\/span><br \/>\n            <span class=\"lbl\">ring-rotation overhead in CP vs single-device<\/span>\n          <\/div>\n<div class=\"lha-stat\">\n            <span class=\"val\">1M<\/span><br \/>\n            <span class=\"lbl\">token training context achieved<\/span>\n          <\/div>\n<div class=\"lha-stat\">\n            <span class=\"val\">4\u00d78<\/span><br \/>\n            <span class=\"lbl\">nodes \u00d7 GPUs, CP degree 8<\/span>\n          <\/div>\n<\/div>\n<p>The Lighthouse vs. SDPA speedup ratio is fully preserved under matched CP geometry, carrying the advantage cleanly into the 1M-token regime.<\/p>\n<\/div>\n<p>      <!-- \u2500\u2500 SLIDE 12: Limitations + links \u2500\u2500 --><\/p>\n<div class=\"lha-slide\">\n<div class=\"lha-slide-label\">12 \u00a0\/ Limitations &amp; Resources<\/div>\n<h3>Limitations and Open Directions<\/h3>\n<div class=\"lha-warn\">\n          <strong>Key limitation:<\/strong> Symmetric Q\/K\/V pooling presumes all queries co-occur in one forward pass. Autoregressive decoding presents one query at a time \u2014 this violates that assumption. Lighthouse is a training-only method and relies on the dense-SDPA resumption to produce an inference-ready model. The gathered sub-sequence cost is \u0398(S\u00b2\u00b7d): sub-quadratic in N at fixed k, but not strictly linear. Regimes where k must scale with N remain uncharacterized.\n        <\/div>\n<p><strong>Open directions from the paper:<\/strong><\/p>\n<div class=\"lha-pills\">\n          <span class=\"lha-pill active\">Asymmetric sparse resumption (DSA \/ NSA \/ MoBA target)<\/span><br \/>\n          <span class=\"lha-pill active\">Per-layer \/ per-head adaptive k<\/span><br \/>\n          <span class=\"lha-pill active\">Vision, audio, video pyramid extensions<\/span><br \/>\n          <span class=\"lha-pill active\">Serving integration (continuous batching, KV-cache)<\/span>\n        <\/div>\n<div class=\"lha-two\">\n<div class=\"lha-card\">\n<h4>Paper<\/h4>\n<p>arXiv:2605.06554<br \/>\u201cLong Context Pre-Training with Lighthouse Attention\u201d<br \/>Peng, Ghosh, Quesnelle \u2014 Nous Research<\/p>\n<\/div>\n<div class=\"lha-card\">\n<h4>Code<\/h4>\n<p>github.com\/ighoshsubho\/<br \/>lighthouse-attention<br \/>Patch on upstream torchtitan + 2 new files<\/p>\n<\/div>\n<\/div>\n<p>Scorer variants: <code>norm<\/code>, <code>dilated<\/code>, <code>gla<\/code> \u2014 selectable from config. CP path requires norm scorer.<\/p>\n<\/div>\n<\/div>\n<p><!-- \/track -->\n  <\/p><\/div>\n<p><!-- \/viewport --><\/p>\n<p>  <!-- Nav footer --><\/p>\n<div class=\"lha-nav\">\n    <button class=\"lha-btn\">\u2190 Prev<\/button><br \/>\n    <span class=\"lha-progress\">1 \/ 12<\/span><br \/>\n    <button class=\"lha-btn primary\">Next \u2192<\/button>\n  <\/div>\n<\/div>\n<p><!-- \/lha-guide --><\/p>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Nous Research&#8217;s Lighthouse Attention pools Q, K, and V symmetrically across a multi-level pyramid \u2014 unlike NSA and HISA which only pool K and V \u2014 cutting the attention call from O(N S d) to O(S\u00b2 d) and making the expensive step stock FlashAttention on a small dense sub-sequence.<\/li>\n<li>It&#8217;s a training-only method: a brief dense-SDPA resumption at the end converts the checkpoint into a normal full-attention model that matches or beats dense-from-scratch at the same token budget (final loss 0.6980\u20130.7102 vs. 0.7237 baseline, 16k steps, ~50.3B tokens).<\/li>\n<li>At 512K context on a single B200, Lighthouse is 21\u00d7 faster on the forward pass and 17.3\u00d7 faster on forward+backward vs. cuDNN SDPA \u2014 translating to a 1.40\u00d7\u20131.69\u00d7 end-to-end pretraining wall-clock speedup.<\/li>\n<li>The top-K selection step is deliberately non-differentiable \u2014 no straight-through estimator, no Gumbel softmax \u2014 so projection matrices learn to produce values that are useful when selected, not to game a learnable scorer.<\/li>\n<li>Scales to 1M-token training across 32 Blackwell GPUs (4 nodes, CP degree 8) under context parallelism with no changes to the inner attention kernel, because the gathered sub-sequence is dense and participates in standard ring attention.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.06554\" target=\"_blank\" rel=\"noreferrer noopener\">Paper,<\/a><\/strong> <strong><a href=\"https:\/\/github.com\/ighoshsubho\/lighthouse-attention\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a> and <a href=\"https:\/\/nousresearch.com\/lighthouse-attention\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/16\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\">Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Training large language models on long sequences has a well-known problem: attention is expensive. The scaled dot-product attention (SDPA) at the core of every transformer scales quadratically \u0398(N\u00b2) in both compute and memory with sequence length N. FlashAttention addressed this through IO-aware tiling that avoids materializing the full N\u00d7N attention matrix in high-bandwidth memory, reducing the memory footprint significantly, but the underlying \u0398(N\u00b2) compute scaling remains. Researchers at Nous Research have introduced a new method called Lighthouse Attention that addresses this bottleneck specifically at pretraining time, achieving a 1.40\u00d7 to 1.69\u00d7 end-to-end wall-clock speedup against a cuDNN-backed SDPA baseline, with matching or lower final training loss. The core problem with existing sparse attention methods To understand why Lighthouse works the way it does, it helps to know what existing sparse attention methods do. Most prior work like NSA, HISA, DSA, MoBA makes the same two design decisions. First, they pool only the key and value side while leaving queries at full resolution (asymmetric compression). Second, their selection logic lives inside a custom attention kernel, which means teams can\u2019t reuse the optimized dense-attention kernels that modern GPU tensor cores are built around. There is also a concern specific to training that inference-only sparse methods don\u2019t face. An inference-time sparse method is evaluated only against its dense backbone and it is at most as good as that backbone. A training-time sparse method faces a harder test: once training is done, will the resulting weights still produce a competent dense-attention model at inference? Lighthouse treats that question as its central correctness criterion. Lighthouse takes a different approach on both design decisions. It pools queries, keys, and values symmetrically across a multi-level pyramid, and it places selection entirely outside the attention kernel. After selection, the system gathers the chosen entries into a contiguous, dense sub-sequence and runs stock FlashAttention on it \u2014 the same kernel used by the dense baseline. https:\/\/arxiv.org\/pdf\/2605.06554 How the four-stage pipeline works A Lighthouse attention layer wraps around, but does not modify, scaled dot-product attention. The pipeline has four stages. In the first stage, average pooling constructs an L-level pyramid from Q, K, and V. With pooling factor p, level \u2113 of the pyramid has N\/p^\u2113 tokens, each summarizing p^\u2113 base positions. Crucially, the same pooling applies to all three projections, producing coherent (Q^(\u2113), K^(\u2113), V^(\u2113)) triples at every level. Total pyramid construction costs \u0398(N) time and memory. In the second stage, a parameter-free scorer assigns each pyramid entry two scalar scores using per-head \u2113\u2082 norms: one as a query score (\u2225Q^(\u2113)_i\u2225\u2082) and one as a key score (\u2225K^(\u2113)_i\u2225\u2082). Coarser levels inherit scores from finer ones via max-pooling, so a coarse span picks up the importance of its strongest token. A fused chunked-bitonic top-K kernel then selects k entries jointly across all pyramid levels. One design detail worth noting: the coarsest pyramid level is always retained in full \u2014 it is cheap and guarantees at least one contributor at every base position; the remaining selection budget is spent on finer levels. Additionally, the chunked-bitonic design produces a stratified top-K rather than a strict global top-K: the score stream is partitioned into fixed-size chunks, each maintaining an in-register top-m buffer, so if the k globally highest-scoring entries clustered in one chunk, some would be replaced by lower-scoring entries from other chunks. The result is more balanced attention coverage across the sequence and avoids selection collapse onto a narrow span. The top-K step is discrete and non-differentiable \u2014 no straight-through estimator, no Gumbel softmax. Selection indices carry no gradient. Gradients flow only through the gathered Q, K, V entries into WQ, WK, WV, so the projections learn to produce values that are useful when selected rather than scores that are good at selecting. In the third stage, the selected entries are gathered into a contiguous sub-sequence of length S = N\/p^(L\u22121) + (L\u22121)\u00b7p\u00b7k and passed to standard FlashAttention. At N = 1,000,000 with L = 4, p = 4, k = 4,096, S \u2248 65,000 \u2014 far smaller than N. A critical property of the gathering process is that it guarantees no \u201choles\u201d or empty spaces in the assembled sub-sequence. This matters specifically because Lighthouse also compresses queries: a gap in the sequence would mean those missing tokens have no gradient path during the backward pass and could cause training instabilities. Asymmetric methods that leave queries at full resolution don\u2019t face this problem, but Lighthouse\u2019s symmetric design requires that the gathered sub-sequence remains fully dense. In the fourth stage, each output entry is scattered back to the p^\u2113 base positions it represents via a deterministic integer-atomic scatter kernel, with a shift of p^\u2113 \u2212 1 to preserve causality. The per-position fan-in is bounded by L regardless of k. https:\/\/arxiv.org\/pdf\/2605.06554 Why symmetric pooling changes the compute Pooling queries alongside keys and values changes the computational character of the attention call from O(N Sd) to O(S\u00b2 d) at training time. Because S \u226a N at long contexts, this is what produces the latency advantage. Benchmarked on a single NVIDIA B200 at 512K context (bfloat16, B=1, H=8, head dimension 128, L=3, p=4, sparsity \u2248 1:64), Lighthouse is 21\u00d7 faster on the forward pass and 17.3\u00d7 faster on the combined forward+backward pass relative to cuDNN-backed SDPA. From an asymptotic standpoint, setting L = logp(N\/k) gives a gathered sub-sequence size of S = \u0398(k log N), which makes the dense FlashAttention call cost \u0398(k\u00b2 log\u00b2 N d) \u2014 polylogarithmic in N at fixed k. Combined with the linear-cost stages (pyramid construction, scoring, scatter-back), total per-layer compute is \u0398(T d) at bounded k \u2014 the same asymptotic class as linear attention and SSMs \u2014 while preserving softmax attention\u2019s recall properties on the selected sub-sequence. Inference is a different constraint. Autoregressive decoding presents one query at a time, which violates the assumption that all queries co-occur in one forward pass. Lighthouse is a training-only method, and the symmetric pooling design cannot be used directly at inference. The two-stage training recipe and recoverability The experimental setup used a 530M-parameter<\/p>","protected":false},"author":2,"featured_media":90990,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-90989","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-17T16:34:53+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"19 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context\",\"datePublished\":\"2026-05-17T16:34:53+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\"},\"wordCount\":3808,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\",\"url\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\",\"name\":\"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png\",\"datePublished\":\"2026-05-17T16:34:53+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png\",\"width\":1568,\"height\":796},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/","og_locale":"fr_FR","og_type":"article","og_title":"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-17T16:34:53+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"19 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context","datePublished":"2026-05-17T16:34:53+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/"},"wordCount":3808,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/","url":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/","name":"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png","datePublished":"2026-05-17T16:34:53+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png","width":1568,"height":796},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nous-research-proposes-lighthouse-attention-a-training-only-selection-based-hierarchical-attention-that-delivers-1-4-1-7x-pretraining-speedup-at-long-context\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4\u20131.7\u00d7 Pretraining Speedup at Long Context"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png",1568,796,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png",1568,796,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png",1568,796,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-300x152.png",300,152,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-1024x520.png",1024,520,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-1536x780.png",1536,780,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r.png",1568,796,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-18x9.png",18,9,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-600x305.png",600,305,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-16-at-3.22.13-PM-1-Z3UX6r-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Training large language models on long sequences has a well-known problem: attention is expensive. The scaled dot-product attention (SDPA) at the core of every transformer scales quadratically \u0398(N\u00b2) in both compute and memory with sequence length N. FlashAttention addressed this through IO-aware tiling that avoids materializing the full N\u00d7N attention matrix in high-bandwidth memory, reducing\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/90989","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=90989"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/90989\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/90990"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=90989"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=90989"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=90989"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}