{"id":98293,"date":"2026-06-18T18:08:56","date_gmt":"2026-06-18T18:08:56","guid":{"rendered":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/"},"modified":"2026-06-18T18:08:56","modified_gmt":"2026-06-18T18:08:56","slug":"the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache","status":"publish","type":"post","link":"https:\/\/youzum.net\/de\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/","title":{"rendered":"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Long-context large language models (LLMs) face a memory bottleneck that has nothing to do with model weights. During decoding, transformers cache the key and value (KV) vectors for every token at every layer so they don\u2019t have to recompute attention. This cache grows linearly with sequence length and batch size, and at long context with high concurrency it can dwarf the model\u2019s own footprint.<\/p>\n<p class=\"wp-block-paragraph\">Consider Llama-3.1-70B in BF16. Its KV cache costs about 0.31 MB per token (80 layers \u00d7 8 KV heads \u00d7 128 head-dim \u00d7 2 tensors \u00d7 2 bytes). At 128K tokens that is ~40 GB; at 1M tokens it exceeds 300 GB \u2014 more than the 140 GB of weights themselves. Worse, every newly decoded token has to stream the entire cache out of high-bandwidth memory (HBM), which makes decoding memory-bandwidth-bound rather than compute-bound. Shrinking the KV cache is therefore the most direct lever for cutting both cost and decode latency.<\/p>\n<p class=\"wp-block-paragraph\">Current approaches fall into roughly five families: <strong>token eviction<\/strong> (H2O, SnapKV), <strong>quantization<\/strong> (KIVI, GEAR), <strong>low-rank projection<\/strong> (Palu), <strong>merging<\/strong> (KVMerger), and <strong>architectural sharing<\/strong> (MLA). Recent 2026 work has pushed hard on the ultra-low-bit quantization frontier. Google and NYU\u2019s <a href=\"https:\/\/arxiv.org\/abs\/2504.19874\">TurboQuant<\/a> (ICLR 2026) and Together AI\u2019s <a href=\"https:\/\/arxiv.org\/abs\/2605.17757\">OSCAR<\/a> attack the same problem from opposite directions, while Apple\u2019s <a href=\"https:\/\/arxiv.org\/abs\/2509.17396\">EpiCache<\/a> tackles a problem neither one addresses.<\/p>\n<p class=\"wp-block-paragraph\">Most KV quantizers are fighting the same underlying enemy: <strong>outlier channels<\/strong> \u2014 a handful of channels with disproportionately large magnitudes that dominate the quantization range and squeeze the rest of the signal into just a few representable levels. This is why naive INT2 quantization (only four levels) collapses to near-zero accuracy.<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/2402.02750\">KIVI<\/a> established the standard baseline here. It showed that key vectors have fixed outlier channels across tokens while value vectors do not, so it quantizes keys <em>per-channel<\/em> and values <em>per-token<\/em>. That tuning-free 2-bit recipe cuts end-to-end peak memory (weights included) by about 2.6\u00d7, and it is the reference point the newer methods build on.<\/p>\n<h2 class=\"wp-block-heading\"><strong>TurboQuant: data-oblivious and theoretically optimal<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">TurboQuant handles outliers without ever looking at your data, in two stages:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Stage one: <\/strong>each vector is randomly rotated so its coordinates become nearly independent and approximately Gaussian, which lets an optimal precomputed scalar (Lloyd\u2013Max) quantizer be applied per coordinate.<\/li>\n<li><strong>Stage two: <\/strong>a 1-bit Quantized Johnson\u2013Lindenstrauss (QJL) transform is applied to the residual, giving a provably unbiased estimate of attention logits with no normalization-constant overhead.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The selling point is theoretical: TurboQuant\u2019s distortion is provably within a small constant factor (\u2248 2.7\u00d7) of the information-theoretic lower bound. In practice it reaches essentially full-precision recall on Needle-in-a-Haystack at 4\u00d7 compression, and the paper reports absolute quality neutrality at 3.5 bits and only marginal degradation at 2.5 bits per channel. Because it needs no calibration, it works on any model untouched and doubles as a fast vector-database quantizer.<\/p>\n<p class=\"wp-block-paragraph\">One caveat worth flagging: the widely repeated \u201c8\u00d7 faster attention on H100\u201d figure comes from <a href=\"https:\/\/research.google\/blog\/turboquant-redefining-ai-efficiency-with-extreme-compression\/\">Google\u2019s blog<\/a>, not the paper, and refers to a narrow attention-logit microbenchmark. TurboQuant\u2019s documented sweet spot is the 3\u20134 bit near-lossless regime.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"2048\" height=\"1247\" data-attachment-id=\"80596\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/06\/18\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/image-525\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-3.jpeg\" data-orig-size=\"2048,1247\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-3-1024x624.jpeg\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-3.jpeg\" alt=\"\" class=\"wp-image-80596\" \/><figcaption class=\"wp-element-caption\"><em>Image source: <\/em>Data from the TurboQuant paper \u2013 <a href=\"https:\/\/arxiv.org\/abs\/2504.19874\">https:\/\/arxiv.org\/abs\/2504.19874<\/a><\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>OSCAR: attention-aware and deployment-ready<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">OSCAR bets the opposite way. Its premise is that at INT2\u2019s four levels, a data-oblivious rotation is the wrong tool \u2014 blindly smoothing ranges isn\u2019t enough when there\u2019s almost no precision to spare. So OSCAR computes an <em>attention-aware<\/em> rotation from a one-time offline calibration pass: keys are rotated into the eigenbasis of the query covariance, values into the score-weighted value covariance. A Hadamard transform plus a bit-reversal permutation then spread channel importance evenly across the quantization groups.<\/p>\n<p class=\"wp-block-paragraph\">What sets OSCAR apart is that it ships as a complete system, not just an algorithm:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Mixed-precision paged cache: <\/strong>sink and recent tokens stay in BF16 while the history compresses to INT2 \u2014 at 128K context only ~0.24% of tokens remain in BF16.<\/li>\n<li><strong>Fused Triton kernels<\/strong> with full SGLang integration (paged-attention and prefix-cache compatible).<\/li>\n<li><a href=\"https:\/\/github.com\/FutureMLS-Lab\/OSCAR\">Precomputed rotations<\/a> (a \u201cRotationZoo\u201d) for Qwen3-4B\/8B\/32B, GLM-4.7-FP8, and MiniMax-M2.7 \u2014 no recalibration needed.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">At an effective 2.28 bits, OSCAR lands within 1.42 points of BF16 on Qwen3-8B and is essentially on par on Qwen3-32B (a 0.02-point gap). On GLM-4.7-FP8 \u2014 where naive INT2 collapses to zero and data-oblivious baselines reach only low single digits \u2014 OSCAR matches BF16 and even edges slightly ahead on the reported benchmarks (within noise). Together AI reports up to 7.83\u00d7 job-level throughput and roughly 8\u00d7 KV-cache memory reduction at 100K context, with up to ~3\u00d7 faster decoding.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"2048\" height=\"1291\" data-attachment-id=\"80597\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/06\/18\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/image-526\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-4.jpeg\" data-orig-size=\"2048,1291\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-4-1024x646.jpeg\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-4.jpeg\" alt=\"\" class=\"wp-image-80597\" \/><figcaption class=\"wp-element-caption\">Image Source- Data from the OSCAR paper:<a href=\"https:\/\/arxiv.org\/abs\/2605.17757\"> https:\/\/arxiv.org\/abs\/2605.17757<\/a><\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>So which one wins?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Neither \u2014 and that\u2019s the honest answer. For <strong>deployable INT2 at 128K tokens on supported models<\/strong>, OSCAR is currently the only demonstrated option that doesn\u2019t collapse, and it comes with production-ready SGLang support. For <strong>training-free, model-agnostic quantization in the 3\u20134 bit regime<\/strong>, TurboQuant offers far broader generality.<\/p>\n<p class=\"wp-block-paragraph\">OSCAR\u2019s paper reports that TurboQuant drops by more than 40 points at a comparable budget \u2014 but that evaluation runs inside OSCAR\u2019s own framework, quantizes all layers, uses a single random seed, and operates well below TurboQuant\u2019s intended bit-width, so it\u2019s a weak basis for a head-to-head verdict. The more interesting possibility is that the two are <strong>complementary<\/strong>: pairing a calibration-aware rotation with an optimal scalar quantizer is a promising combination nobody has shipped yet. (Both teams have publicly noted the same idea.)<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"2048\" height=\"1279\" data-attachment-id=\"80599\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/06\/18\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/image-528\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-6.jpeg\" data-orig-size=\"2048,1279\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-6-1024x640.jpeg\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/image-6.jpeg\" alt=\"\" class=\"wp-image-80599\" \/><figcaption class=\"wp-element-caption\">Image source: Data from the OSCAR paper- <a href=\"https:\/\/arxiv.org\/abs\/2605.17757\">https:\/\/arxiv.org\/abs\/2605.17757<\/a><\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>The third axis: EpiCache<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">TurboQuant and OSCAR are both built for a single long context. Neither handles <strong>extended multi-turn conversations<\/strong>, where history piles up across many exchanges. Apple\u2019s <a href=\"https:\/\/github.com\/apple\/ml-epicache\">EpiCache<\/a> is a training-free KV-cache management framework aimed exactly at that gap:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Block-wise prefill<\/strong> processes history in blocks to keep peak memory bounded.<\/li>\n<li><strong>Episodic clustering<\/strong> segments the conversation into coherent semantic \u201cepisodes,\u201d each with its own compressed cache.<\/li>\n<li><strong>Episode-matched retrieval<\/strong> routes each query to the most relevant episode at inference time.<\/li>\n<li><strong>Adaptive layer-wise budget allocation<\/strong> measures each layer\u2019s sensitivity to eviction and distributes the memory budget accordingly.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Across LongMemEval, RealTalk, and LoCoMo, EpiCache reports up to 40% higher accuracy than eviction baselines, near-full-cache accuracy at 4\u20136\u00d7 compression, and up to 3.5\u00d7 lower peak memory (and ~2.4\u00d7 lower latency). Because it decides <em>which<\/em> tokens to keep rather than <em>how precisely<\/em> to store them, it composes directly with OSCAR or TurboQuant for compounding savings.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li><strong>TurboQuant<\/strong> pushes the theoretical, model-agnostic frontier \u2014 the go-to for 3\u20134 bit near-lossless compression on any model.<\/li>\n<li><strong>OSCAR<\/strong> leads on deployable INT2, with up to 7.83\u00d7 throughput and ~8\u00d7 memory reduction at 100K context on supported models.<\/li>\n<li><strong>EpiCache<\/strong> solves conversational memory across turns \u2014 up to 40% accuracy gains over eviction and 3.5\u00d7 lower peak memory \u2014 and composes with either quantizer.<\/li>\n<li><strong>Pick by constraint: <\/strong>bit-width budget, model portability, or conversation length, then combine the orthogonal methods that fit. These approaches are more complementary than competitive.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p class=\"wp-block-paragraph\">\n<h3 class=\"wp-block-heading\"><strong>Sources<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/arxiv.org\/abs\/2504.19874\">TurboQuant (arXiv 2504.19874)<\/a><\/li>\n<li><a href=\"https:\/\/research.google\/blog\/turboquant-redefining-ai-efficiency-with-extreme-compression\/\">TurboQuant \u2014 Google Research blog<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2605.17757\">OSCAR (arXiv 2605.17757)<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/FutureMLS-Lab\/OSCAR\">OSCAR code \u2014 FutureMLS-Lab\/OSCAR<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2509.17396\">EpiCache (arXiv 2509.17396)<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/apple\/ml-epicache\">EpiCache code \u2014 apple\/ml-epicache<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2402.02750\">KIVI (arXiv 2402.02750)<\/a><\/li>\n<\/ul>\n<\/p><p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/18\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\">The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Long-context large language models (LLMs) face a memory bottleneck that has nothing to do with model weights. During decoding, transformers cache the key and value (KV) vectors for every token at every layer so they don\u2019t have to recompute attention. This cache grows linearly with sequence length and batch size, and at long context with high concurrency it can dwarf the model\u2019s own footprint. Consider Llama-3.1-70B in BF16. Its KV cache costs about 0.31 MB per token (80 layers \u00d7 8 KV heads \u00d7 128 head-dim \u00d7 2 tensors \u00d7 2 bytes). At 128K tokens that is ~40 GB; at 1M tokens it exceeds 300 GB \u2014 more than the 140 GB of weights themselves. Worse, every newly decoded token has to stream the entire cache out of high-bandwidth memory (HBM), which makes decoding memory-bandwidth-bound rather than compute-bound. Shrinking the KV cache is therefore the most direct lever for cutting both cost and decode latency. Current approaches fall into roughly five families: token eviction (H2O, SnapKV), quantization (KIVI, GEAR), low-rank projection (Palu), merging (KVMerger), and architectural sharing (MLA). Recent 2026 work has pushed hard on the ultra-low-bit quantization frontier. Google and NYU\u2019s TurboQuant (ICLR 2026) and Together AI\u2019s OSCAR attack the same problem from opposite directions, while Apple\u2019s EpiCache tackles a problem neither one addresses. Most KV quantizers are fighting the same underlying enemy: outlier channels \u2014 a handful of channels with disproportionately large magnitudes that dominate the quantization range and squeeze the rest of the signal into just a few representable levels. This is why naive INT2 quantization (only four levels) collapses to near-zero accuracy. KIVI established the standard baseline here. It showed that key vectors have fixed outlier channels across tokens while value vectors do not, so it quantizes keys per-channel and values per-token. That tuning-free 2-bit recipe cuts end-to-end peak memory (weights included) by about 2.6\u00d7, and it is the reference point the newer methods build on. TurboQuant: data-oblivious and theoretically optimal TurboQuant handles outliers without ever looking at your data, in two stages: Stage one: each vector is randomly rotated so its coordinates become nearly independent and approximately Gaussian, which lets an optimal precomputed scalar (Lloyd\u2013Max) quantizer be applied per coordinate. Stage two: a 1-bit Quantized Johnson\u2013Lindenstrauss (QJL) transform is applied to the residual, giving a provably unbiased estimate of attention logits with no normalization-constant overhead. The selling point is theoretical: TurboQuant\u2019s distortion is provably within a small constant factor (\u2248 2.7\u00d7) of the information-theoretic lower bound. In practice it reaches essentially full-precision recall on Needle-in-a-Haystack at 4\u00d7 compression, and the paper reports absolute quality neutrality at 3.5 bits and only marginal degradation at 2.5 bits per channel. Because it needs no calibration, it works on any model untouched and doubles as a fast vector-database quantizer. One caveat worth flagging: the widely repeated \u201c8\u00d7 faster attention on H100\u201d figure comes from Google\u2019s blog, not the paper, and refers to a narrow attention-logit microbenchmark. TurboQuant\u2019s documented sweet spot is the 3\u20134 bit near-lossless regime. Image source: Data from the TurboQuant paper \u2013 https:\/\/arxiv.org\/abs\/2504.19874 OSCAR: attention-aware and deployment-ready OSCAR bets the opposite way. Its premise is that at INT2\u2019s four levels, a data-oblivious rotation is the wrong tool \u2014 blindly smoothing ranges isn\u2019t enough when there\u2019s almost no precision to spare. So OSCAR computes an attention-aware rotation from a one-time offline calibration pass: keys are rotated into the eigenbasis of the query covariance, values into the score-weighted value covariance. A Hadamard transform plus a bit-reversal permutation then spread channel importance evenly across the quantization groups. What sets OSCAR apart is that it ships as a complete system, not just an algorithm: Mixed-precision paged cache: sink and recent tokens stay in BF16 while the history compresses to INT2 \u2014 at 128K context only ~0.24% of tokens remain in BF16. Fused Triton kernels with full SGLang integration (paged-attention and prefix-cache compatible). Precomputed rotations (a \u201cRotationZoo\u201d) for Qwen3-4B\/8B\/32B, GLM-4.7-FP8, and MiniMax-M2.7 \u2014 no recalibration needed. At an effective 2.28 bits, OSCAR lands within 1.42 points of BF16 on Qwen3-8B and is essentially on par on Qwen3-32B (a 0.02-point gap). On GLM-4.7-FP8 \u2014 where naive INT2 collapses to zero and data-oblivious baselines reach only low single digits \u2014 OSCAR matches BF16 and even edges slightly ahead on the reported benchmarks (within noise). Together AI reports up to 7.83\u00d7 job-level throughput and roughly 8\u00d7 KV-cache memory reduction at 100K context, with up to ~3\u00d7 faster decoding. Image Source- Data from the OSCAR paper: https:\/\/arxiv.org\/abs\/2605.17757 So which one wins? Neither \u2014 and that\u2019s the honest answer. For deployable INT2 at 128K tokens on supported models, OSCAR is currently the only demonstrated option that doesn\u2019t collapse, and it comes with production-ready SGLang support. For training-free, model-agnostic quantization in the 3\u20134 bit regime, TurboQuant offers far broader generality. OSCAR\u2019s paper reports that TurboQuant drops by more than 40 points at a comparable budget \u2014 but that evaluation runs inside OSCAR\u2019s own framework, quantizes all layers, uses a single random seed, and operates well below TurboQuant\u2019s intended bit-width, so it\u2019s a weak basis for a head-to-head verdict. The more interesting possibility is that the two are complementary: pairing a calibration-aware rotation with an optimal scalar quantizer is a promising combination nobody has shipped yet. (Both teams have publicly noted the same idea.) Image source: Data from the OSCAR paper- https:\/\/arxiv.org\/abs\/2605.17757 The third axis: EpiCache TurboQuant and OSCAR are both built for a single long context. Neither handles extended multi-turn conversations, where history piles up across many exchanges. Apple\u2019s EpiCache is a training-free KV-cache management framework aimed exactly at that gap: Block-wise prefill processes history in blocks to keep peak memory bounded. Episodic clustering segments the conversation into coherent semantic \u201cepisodes,\u201d each with its own compressed cache. Episode-matched retrieval routes each query to the most relevant episode at inference time. Adaptive layer-wise budget allocation measures each layer\u2019s sensitivity to eviction and distributes the memory budget accordingly. Across LongMemEval, RealTalk, and LoCoMo, EpiCache reports up to 40% higher accuracy than eviction baselines, near-full-cache accuracy at 4\u20136\u00d7 compression, and<\/p>","protected":false},"author":2,"featured_media":98294,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-98293","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/de\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/de\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-18T18:08:56+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"6\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache\",\"datePublished\":\"2026-06-18T18:08:56+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\"},\"wordCount\":1158,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\",\"url\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\",\"name\":\"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg\",\"datePublished\":\"2026-06-18T18:08:56+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg\",\"width\":2048,\"height\":1247},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/de\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/de\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/","og_locale":"de_DE","og_type":"article","og_title":"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/de\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-06-18T18:08:56+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"admin NU","Gesch\u00e4tzte Lesezeit":"6\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache","datePublished":"2026-06-18T18:08:56+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/"},"wordCount":1158,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/","url":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/","name":"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg","datePublished":"2026-06-18T18:08:56+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg","width":2048,"height":1247},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/the-kv-cache-compression-race-turboquant-vs-oscar-vs-epicache\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/de\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg",2048,1247,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg",2048,1247,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg",2048,1247,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-150x150.jpg",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-300x183.jpg",300,183,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-1024x624.jpg",1024,624,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-1536x935.jpg",1536,935,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD.jpg",2048,1247,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-18x12.jpg",18,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-300x300.jpg",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-600x365.jpg",600,365,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/image-3-owtqLD-100x100.jpg",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/de\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/de\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Long-context large language models (LLMs) face a memory bottleneck that has nothing to do with model weights. During decoding, transformers cache the key and value (KV) vectors for every token at every layer so they don\u2019t have to recompute attention. This cache grows linearly with sequence length and batch size, and at long context with&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/98293","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/comments?post=98293"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/98293\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media\/98294"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media?parent=98293"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/categories?post=98293"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/tags?post=98293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}