{"id":97585,"date":"2026-06-15T18:03:09","date_gmt":"2026-06-15T18:03:09","guid":{"rendered":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/"},"modified":"2026-06-15T18:03:09","modified_gmt":"2026-06-15T18:03:09","slug":"meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus","status":"publish","type":"post","link":"https:\/\/youzum.net\/es\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/","title":{"rendered":"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs"},"content":{"rendered":"<p class=\"wp-block-paragraph\">k-means has been an offline tool for decades. You run it once to preprocess data, then move on. A team of researchers from UC Berkeley and UT Austin released Flash-KMeans, a new open-source library that targets a different setting. Modern AI pipelines now call k-means inside training and inference loops. At that frequency, latency per call matters more than theoretical FLOPs.<\/p>\n<p class=\"wp-block-paragraph\">Flash-KMeans is an IO-aware implementation of standard Lloyd\u2019s k-means. It does not change the math, and it does not approximate. It only restructures how the algorithm moves data on a GPU. On an NVIDIA H200, the research team reported up to 17.9\u00d7 end-to-end speedup over the best baseline. Against NVIDIA cuML they report 33\u00d7. Against FAISS they report over 200\u00d7. <\/p>\n<h2 class=\"wp-block-heading\"><strong>What is Flash-KMeans<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Flash-KMeans is a batched k-means library written in Triton GPU kernels. It ships under Apache 2.0 and installs with <code>pip install flash-kmeans<\/code>. <\/p>\n<p class=\"wp-block-paragraph\">The output is mathematically identical to standard Lloyd\u2019s k-means. The speedup comes from kernel-level dataflow, not from skipping work. That separates it from algorithmic methods like triangle-inequality pruning or coreset sampling.<\/p>\n<p class=\"wp-block-paragraph\">A standard Lloyd iteration has two stages. The assignment stage computes each point\u2019s distance to every centroid, then picks the nearest. The update stage averages the points in each cluster to form new centroids. Both stages are simple arithmetic. On GPUs, both are bottlenecked by memory, not compute.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Two Bottlenecks It Attacks<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The <strong>first bottleneck<\/strong> is the assignment stage. Standard code builds a full distance matrix D of shape N\u00d7K in High Bandwidth Memory (HBM). It writes the matrix, then reads it back to run argmin. For N=65536, K=1024, d=128, B=32, the distance math takes 2.6ms. Writing and consuming D takes about 23ms. The matrix is the cost, not the arithmetic.<\/p>\n<p class=\"wp-block-paragraph\">Flash-KMeans replaces this with FlashAssign. The design borrows from FlashAttention. FlashAssign streams tiles of points and centroids from HBM into on-chip SRAM. It fuses distance computation with an online argmin. The full N\u00d7K matrix is never materialized. This cuts the dominant IO complexity from O(NK) to O(Nd + Kd). At the kernel level, FlashAssign reaches up to 21.2\u00d7. In one case it cut assignment from 122.5ms to 5.8ms.<\/p>\n<p class=\"wp-block-paragraph\">The <strong>second bottleneck<\/strong> is the centroid update stage. Standard code uses scatter-style atomic adds. Each thread adds its point into a shared sum buffer keyed by cluster id. Many threads hit the same \u2018hot\u2019 cluster at once. That causes atomic contention and hardware serialization. The research team measured only 50 GB\/s effective bandwidth here on an H200.<\/p>\n<p class=\"wp-block-paragraph\">Flash-KMeans replaces this with Sort-Inverse Update. It sorts the 1D assignment vector by cluster id using argsort. Identical cluster ids then form contiguous segments. Each thread block reduces a segment on-chip, then issues one atomic add per segment. The heavy point matrix is never physically permuted. Atomic operations drop from (O((K+NBN)d))(O((K + frac{N}{B_N})d)) . The kernel reaches up to 6.3\u00d7.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Benchmark<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The research team test it on an H200 with CUDA 12.8, FP16 data, and d=128. They sweep N, K, and batch size B. They compare against four optimized baselines: fast_pytorch_kmeans, fastkmeans, cuML, and FAISS.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Comparison<\/th>\n<th>Reported speedup<\/th>\n<th>Workload context<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>End-to-end vs best baseline<\/td>\n<td>up to 17.9\u00d7<\/td>\n<td>N=8M, K=1024 (large N, small K)<\/td>\n<\/tr>\n<tr>\n<td>vs NVIDIA cuML<\/td>\n<td>33\u00d7<\/td>\n<td>industry library<\/td>\n<\/tr>\n<tr>\n<td>vs FAISS<\/td>\n<td>over 200\u00d7<\/td>\n<td>industry library<\/td>\n<\/tr>\n<tr>\n<td>FlashAssign kernel<\/td>\n<td>up to 21.2\u00d7<\/td>\n<td>N=1M, K=8192 (assignment)<\/td>\n<\/tr>\n<tr>\n<td>Sort-Inverse Update kernel<\/td>\n<td>up to 6.3\u00d7<\/td>\n<td>N=33M, K=4096 (update)<\/td>\n<\/tr>\n<tr>\n<td>Out-of-core, large scale<\/td>\n<td>up to 10.5\u00d7<\/td>\n<td>N=400M, K=16384 vs fastkmeans<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">One failure mode matters for context. Standard PyTorch implementations run out of memory in large-K regimes. They cannot materialize the N\u00d7K matrix. FAISS is the industry-standard library under many production vector-search systems.<\/p>\n<p class=\"wp-block-paragraph\">The library also runs out-of-core. On one billion points (K=32768, d=128), it finishes an iteration in 41.4s, against 261.8s for the baseline. It uses chunked stream overlap to hide PCIe transfer behind compute. A cache-aware compile heuristic also cuts tuning overhead by up to 175\u00d7, within 0.3% of tuned speed.<\/p>\n<h2 class=\"wp-block-heading\"><strong>MTP Interactive Explainer<\/strong><\/h2>\n<p><!-- ===== Flash-KMeans Interactive Demo \u2014 paste into a WordPress \"Custom HTML\" block ===== --><\/p>\n<div>\n<div class=\"fk-head\">\n<div class=\"fk-eyebrow\">Marktechpost \u00b7 Interactive Explainer<\/div>\n<h2 class=\"fk-title\">Flash-KMeans: exact k-means, rebuilt around GPU memory<\/h2>\n<p class=\"fk-sub\">Same Lloyd\u2019s math as standard k-means \u2014 faster only because of dataflow. Run clustering live, watch the update bottleneck, and size the IO it removes.<\/p>\n<\/div>\n<div class=\"fk-stats\">\n<div class=\"fk-stat\"><b>17.9\u00d7<\/b><span>end-to-end vs best baseline<\/span><\/div>\n<div class=\"fk-stat\"><b>33\u00d7<\/b><span>vs NVIDIA cuML<\/span><\/div>\n<div class=\"fk-stat\"><b>200\u00d7+<\/b><span>vs FAISS<\/span><\/div>\n<div class=\"fk-stat\"><b>1B<\/b><span>points, out-of-core<\/span><\/div>\n<\/div>\n<div class=\"fk-tabs\">\n<div class=\"fk-tab on\" data-tab=\"cluster\">1 \u00b7 Live clustering<\/div>\n<div class=\"fk-tab\" data-tab=\"update\">2 \u00b7 Update contention<\/div>\n<div class=\"fk-tab\" data-tab=\"io\">3 \u00b7 IO calculator<\/div>\n<\/div>\n<p>  <!-- PANEL 1: LIVE CLUSTERING --><\/p>\n<div class=\"fk-panel on\" data-panel=\"cluster\">\n<div class=\"fk-row\">\n<div class=\"fk-canvaswrap\"><\/div>\n<div class=\"fk-side\">\n<div class=\"fk-ctl\">\n          <label>Data points (N) <em>800<\/em><\/label><\/div>\n<div class=\"fk-ctl\">\n          <label>Clusters (K) <em>5<\/em><\/label><\/div>\n<div class=\"fk-btns\">\n          <button class=\"fk-b\">Run<\/button><br \/>\n          <button class=\"fk-b ghost\">Step<\/button><br \/>\n          <button class=\"fk-b ghost\">New data<\/button>\n        <\/div>\n<div class=\"fk-read\">\n<div><span class=\"k\">Iteration<\/span><span class=\"v\">0<\/span><\/div>\n<div><span class=\"k\">Centroid shift<\/span><span class=\"v\">\u2014<\/span><\/div>\n<div><span class=\"k\">Status<\/span><span class=\"v\">idle<\/span><\/div>\n<\/div>\n<p class=\"fk-note\">This runs real Lloyd\u2019s k-means in your browser on 2-D points. The algorithm is identical to what Flash-KMeans accelerates \u2014 only the GPU dataflow differs. Each step = one assignment + one centroid update.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p>  <!-- PANEL 2: UPDATE CONTENTION --><\/p>\n<div class=\"fk-panel\" data-panel=\"update\">\n<div class=\"fk-row\">\n<div class=\"fk-canvaswrap\">\n<div class=\"fk-tl\"><\/div>\n<p class=\"fk-note\">Press play. Standard scatter-update serializes when blocks write the same \u201chot\u201d centroid (red stalls). Sort-Inverse Update sorts cluster IDs first, so each block merges contiguous segments with one atomic add \u2014 no conflict.<\/p>\n<div class=\"fk-btns\">\n          <button class=\"fk-b\">Play timeline<\/button><br \/>\n          <button class=\"fk-b ghost\">Reset<\/button>\n        <\/div>\n<\/div>\n<div class=\"fk-side\">\n<div class=\"fk-read\">\n<div><span class=\"k\">Standard atomics<\/span><span class=\"v\">O(N\u00b7d)<\/span><\/div>\n<div><span class=\"k\">Sort-Inverse atomics<\/span><span class=\"v\">O((K+N\/B)\u00b7d)<\/span><\/div>\n<div><span class=\"k\">Measured std bandwidth<\/span><span class=\"v\">50 GB\/s<\/span><\/div>\n<div><span class=\"k\">Kernel speedup<\/span><span class=\"v\">6.3\u00d7<\/span><\/div>\n<\/div>\n<p class=\"fk-note\">Standard updates issue one atomic add <em>per token<\/em>. Many threads hit the same centroid at once, causing contention. Sorting by cluster ID turns scatters into segment-level reductions in on-chip memory.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p>  <!-- PANEL 3: IO CALCULATOR --><\/p>\n<div class=\"fk-panel\" data-panel=\"io\">\n<div class=\"fk-row\">\n<div class=\"fk-canvaswrap\">\n<div class=\"fk-bars\">\n<div class=\"fk-bar\">\n<div class=\"lab\"><span class=\"nm\">Standard <small>\u2014 materialize N\u00d7K matrix, O(NK)<\/small><\/span><span class=\"amt\">\u2014<\/span><\/div>\n<div class=\"fk-track\">\n<div class=\"fk-fill std\"><\/div>\n<\/div><\/div>\n<div class=\"fk-bar\">\n<div class=\"lab\"><span class=\"nm\">FlashAssign <small>\u2014 stream inputs, O(Nd+Kd)<\/small><\/span><span class=\"amt\">\u2014<\/span><\/div>\n<div class=\"fk-track\">\n<div class=\"fk-fill fk\"><\/div>\n<\/div><\/div>\n<\/div>\n<div class=\"fk-bigratio\"><b>\u2014<\/b><span>less HBM traffic for the assignment step (theoretical)<\/span><\/div>\n<\/div>\n<div class=\"fk-side\">\n<div class=\"fk-ctl\">\n          <label>Points N <em>1M<\/em><\/label><\/div>\n<div class=\"fk-ctl\">\n          <label>Clusters K <em>1024<\/em><\/label><\/div>\n<div class=\"fk-ctl\">\n          <label>Dimension d <em>128<\/em><\/label><\/div>\n<p class=\"fk-note\">Standard k-means writes then reads a full N\u00d7K distance matrix in HBM. FlashAssign never builds it \u2014 it reads X and C once and writes assignments once. Bars show relative HBM round-trips, FP16.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"fk-foot\">\n    <span class=\"br\">\u00a9 Marktechpost<\/span><br \/>\n    <span class=\"src\">Speedups: Flash-KMeans paper (arXiv:2603.09229), NVIDIA H200. Demo runs in-browser for illustration \u00b7 <a href=\"https:\/\/github.com\/svg-project\/flash-kmeans\" target=\"_blank\" rel=\"noopener\">github.com\/svg-project\/flash-kmeans<\/a><\/span>\n  <\/div>\n<\/div>\n<p><!-- ===== end Flash-KMeans demo ===== --><\/p>\n<p class=\"wp-block-paragraph\">\n<h2 class=\"wp-block-heading\"><strong>Use Cases<\/strong><\/h2>\n<\/p><p class=\"wp-block-paragraph\">Faster exact k-means changes what you can run online, not just offline.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Vector search indexing<\/strong>: FAISS builds its search indices with k-means. Faster k-means lets you re-index as data shifts, instead of rebuilding overnight.<\/li>\n<li><strong>Sparse attention routing<\/strong>: Routing Transformers and Tactic cluster tokens to route attention. Millisecond k-means makes this viable inside the inference loop.<\/li>\n<li><strong>KV-cache compression<\/strong>: ClusterKV clusters tokens in semantic space to compress the cache. Cheaper clustering makes per-layer, per-step compression practical.<\/li>\n<li><strong>Low-bit KV quantization<\/strong>: Recent methods cluster KV entries into codebooks, repeatedly. Faster clustering shrinks that preprocessing cost.<\/li>\n<li><strong>Diffusion Transformers<\/strong>: Sparse VideoGen2 calls batched k-means during forward passes. It permutes tokens by semantic similarity to exploit sparsity.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Using It<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The API mirrors faiss and sklearn. The call below clusters a batched (B, N, d) tensor.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import torch\nfrom flash_kmeans import batch_kmeans_Euclid\n\nx = torch.randn(32, 75600, 128, device=\"cuda\", dtype=torch.float16)\ncluster_ids, centers, _ = batch_kmeans_Euclid(\n    x, n_clusters=1000, tol=1e-4, verbose=True\n)<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">A scikit-learn-style interface is also available.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">from flash_kmeans import FlashKMeans\n\nkm = FlashKMeans(d=128, k=8192, niter=100)\nlabels = km.fit_predict(large_cpu_tensor)  # device=None uses all visible GPUs<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">The kernel auto-dispatches by shape and dtype. A small-D path handles d\u2264512. A split-D path handles larger d without materializing the distance matrix. Multi-GPU runs trigger automatically for large-N data held in CPU memory.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li><strong>Flash-KMeans is exact, not approximate<\/strong> \u2014 same Lloyd\u2019s math, sped up purely by GPU dataflow.<\/li>\n<li><strong>FlashAssign<\/strong> fuses distance + online argmin, cutting assignment IO from O(NK) to O(Nd+Kd) \u2014 up to 21.2\u00d7.<\/li>\n<li><strong>Sort-Inverse Update<\/strong> sorts cluster IDs into segments, replacing scatter atomics \u2014 up to 6.3\u00d7.<\/li>\n<li>Reports up to <strong>17.9\u00d7 end-to-end<\/strong>, 33\u00d7 over cuML, and over 200\u00d7 over FAISS on an H200.<\/li>\n<li>Scales <strong>out-of-core to one billion points<\/strong> and cuts tuning overhead up to 175\u00d7.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/svg-project\/flash-kmeans\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> <\/strong>and<strong> <a href=\"https:\/\/github.com\/svg-project\/flash-kmeans\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/15\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\">Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>k-means has been an offline tool for decades. You run it once to preprocess data, then move on. A team of researchers from UC Berkeley and UT Austin released Flash-KMeans, a new open-source library that targets a different setting. Modern AI pipelines now call k-means inside training and inference loops. At that frequency, latency per call matters more than theoretical FLOPs. Flash-KMeans is an IO-aware implementation of standard Lloyd\u2019s k-means. It does not change the math, and it does not approximate. It only restructures how the algorithm moves data on a GPU. On an NVIDIA H200, the research team reported up to 17.9\u00d7 end-to-end speedup over the best baseline. Against NVIDIA cuML they report 33\u00d7. Against FAISS they report over 200\u00d7. What is Flash-KMeans Flash-KMeans is a batched k-means library written in Triton GPU kernels. It ships under Apache 2.0 and installs with pip install flash-kmeans. The output is mathematically identical to standard Lloyd\u2019s k-means. The speedup comes from kernel-level dataflow, not from skipping work. That separates it from algorithmic methods like triangle-inequality pruning or coreset sampling. A standard Lloyd iteration has two stages. The assignment stage computes each point\u2019s distance to every centroid, then picks the nearest. The update stage averages the points in each cluster to form new centroids. Both stages are simple arithmetic. On GPUs, both are bottlenecked by memory, not compute. The Two Bottlenecks It Attacks The first bottleneck is the assignment stage. Standard code builds a full distance matrix D of shape N\u00d7K in High Bandwidth Memory (HBM). It writes the matrix, then reads it back to run argmin. For N=65536, K=1024, d=128, B=32, the distance math takes 2.6ms. Writing and consuming D takes about 23ms. The matrix is the cost, not the arithmetic. Flash-KMeans replaces this with FlashAssign. The design borrows from FlashAttention. FlashAssign streams tiles of points and centroids from HBM into on-chip SRAM. It fuses distance computation with an online argmin. The full N\u00d7K matrix is never materialized. This cuts the dominant IO complexity from O(NK) to O(Nd + Kd). At the kernel level, FlashAssign reaches up to 21.2\u00d7. In one case it cut assignment from 122.5ms to 5.8ms. The second bottleneck is the centroid update stage. Standard code uses scatter-style atomic adds. Each thread adds its point into a shared sum buffer keyed by cluster id. Many threads hit the same \u2018hot\u2019 cluster at once. That causes atomic contention and hardware serialization. The research team measured only 50 GB\/s effective bandwidth here on an H200. Flash-KMeans replaces this with Sort-Inverse Update. It sorts the 1D assignment vector by cluster id using argsort. Identical cluster ids then form contiguous segments. Each thread block reduces a segment on-chip, then issues one atomic add per segment. The heavy point matrix is never physically permuted. Atomic operations drop from (O((K+NBN)d))(O((K + frac{N}{B_N})d)) . The kernel reaches up to 6.3\u00d7. Benchmark The research team test it on an H200 with CUDA 12.8, FP16 data, and d=128. They sweep N, K, and batch size B. They compare against four optimized baselines: fast_pytorch_kmeans, fastkmeans, cuML, and FAISS. Comparison Reported speedup Workload context End-to-end vs best baseline up to 17.9\u00d7 N=8M, K=1024 (large N, small K) vs NVIDIA cuML 33\u00d7 industry library vs FAISS over 200\u00d7 industry library FlashAssign kernel up to 21.2\u00d7 N=1M, K=8192 (assignment) Sort-Inverse Update kernel up to 6.3\u00d7 N=33M, K=4096 (update) Out-of-core, large scale up to 10.5\u00d7 N=400M, K=16384 vs fastkmeans One failure mode matters for context. Standard PyTorch implementations run out of memory in large-K regimes. They cannot materialize the N\u00d7K matrix. FAISS is the industry-standard library under many production vector-search systems. The library also runs out-of-core. On one billion points (K=32768, d=128), it finishes an iteration in 41.4s, against 261.8s for the baseline. It uses chunked stream overlap to hide PCIe transfer behind compute. A cache-aware compile heuristic also cuts tuning overhead by up to 175\u00d7, within 0.3% of tuned speed. MTP Interactive Explainer Marktechpost \u00b7 Interactive Explainer Flash-KMeans: exact k-means, rebuilt around GPU memory Same Lloyd\u2019s math as standard k-means \u2014 faster only because of dataflow. Run clustering live, watch the update bottleneck, and size the IO it removes. 17.9\u00d7end-to-end vs best baseline 33\u00d7vs NVIDIA cuML 200\u00d7+vs FAISS 1Bpoints, out-of-core 1 \u00b7 Live clustering 2 \u00b7 Update contention 3 \u00b7 IO calculator Data points (N) 800 Clusters (K) 5 Run Step New data Iteration0 Centroid shift\u2014 Statusidle This runs real Lloyd\u2019s k-means in your browser on 2-D points. The algorithm is identical to what Flash-KMeans accelerates \u2014 only the GPU dataflow differs. Each step = one assignment + one centroid update. Press play. Standard scatter-update serializes when blocks write the same \u201chot\u201d centroid (red stalls). Sort-Inverse Update sorts cluster IDs first, so each block merges contiguous segments with one atomic add \u2014 no conflict. Play timeline Reset Standard atomicsO(N\u00b7d) Sort-Inverse atomicsO((K+N\/B)\u00b7d) Measured std bandwidth50 GB\/s Kernel speedup6.3\u00d7 Standard updates issue one atomic add per token. Many threads hit the same centroid at once, causing contention. Sorting by cluster ID turns scatters into segment-level reductions in on-chip memory. Standard \u2014 materialize N\u00d7K matrix, O(NK)\u2014 FlashAssign \u2014 stream inputs, O(Nd+Kd)\u2014 \u2014less HBM traffic for the assignment step (theoretical) Points N 1M Clusters K 1024 Dimension d 128 Standard k-means writes then reads a full N\u00d7K distance matrix in HBM. FlashAssign never builds it \u2014 it reads X and C once and writes assignments once. Bars show relative HBM round-trips, FP16. \u00a9 Marktechpost Speedups: Flash-KMeans paper (arXiv:2603.09229), NVIDIA H200. Demo runs in-browser for illustration \u00b7 github.com\/svg-project\/flash-kmeans Use Cases Faster exact k-means changes what you can run online, not just offline. Vector search indexing: FAISS builds its search indices with k-means. Faster k-means lets you re-index as data shifts, instead of rebuilding overnight. Sparse attention routing: Routing Transformers and Tactic cluster tokens to route attention. Millisecond k-means makes this viable inside the inference loop. KV-cache compression: ClusterKV clusters tokens in semantic space to compress the cache. Cheaper clustering makes per-layer, per-step compression practical. Low-bit KV quantization: Recent methods cluster KV entries into codebooks, repeatedly. Faster<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-97585","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/es\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\" \/>\n<meta property=\"og:locale\" content=\"es_ES\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/es\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-15T18:03:09+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tiempo de lectura\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs\",\"datePublished\":\"2026-06-15T18:03:09+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\"},\"wordCount\":1268,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\",\"url\":\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\",\"name\":\"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-06-15T18:03:09+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#breadcrumb\"},\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"es\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/es\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/es\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/","og_locale":"es_ES","og_type":"article","og_title":"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/es\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-06-15T18:03:09+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Escrito por":"admin NU","Tiempo de lectura":"7 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs","datePublished":"2026-06-15T18:03:09+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/"},"wordCount":1268,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"es","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/","url":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/","name":"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-06-15T18:03:09+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#breadcrumb"},"inLanguage":"es","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/meet-flash-kmeans-an-io-aware-exact-k-means-that-runs-over-200x-faster-than-faiss-on-gpus\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Meet Flash-KMeans: An IO-Aware, Exact K-Means That Runs Over 200\u00d7 Faster Than FAISS on GPUs"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"es"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/es\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/es\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/es\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"k-means has been an offline tool for decades. You run it once to preprocess data, then move on. A team of researchers from UC Berkeley and UT Austin released Flash-KMeans, a new open-source library that targets a different setting. Modern AI pipelines now call k-means inside training and inference loops. At that frequency, latency per&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/97585","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/comments?post=97585"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/97585\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media?parent=97585"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/categories?post=97585"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/tags?post=97585"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}