{"id":86219,"date":"2026-04-26T15:29:42","date_gmt":"2026-04-26T15:29:42","guid":{"rendered":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/"},"modified":"2026-04-26T15:29:42","modified_gmt":"2026-04-26T15:29:42","slug":"a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing","status":"publish","type":"post","link":"https:\/\/youzum.net\/de\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/","title":{"rendered":"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing"},"content":{"rendered":"<p>In this tutorial, we explore <a href=\"https:\/\/github.com\/ovg-project\/kvcached\"><strong>kvcached<\/strong><\/a>, a dynamic KV-cache implementation on top of vLLM, to understand how dynamic KV-cache allocation transforms GPU memory usage for large language models. We begin by setting up the environment and deploying lightweight Qwen2.5 models through an OpenAI-compatible API, ensuring a realistic inference workflow. We then design controlled experiments where we simulate bursty workloads to observe how memory behaves under both elastic and static allocation strategies. Through systematic measurement and visualization, we directly compare VRAM utilization and latency, and extend the setup to a multi-model scenario where we observe how memory flexibly shifts across active workloads in real time.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import os, sys, time, json, subprocess, threading, signal, shutil\nfrom pathlib import Path\n\n\ndef sh(cmd, check=True):\n   return subprocess.run(cmd, check=check, shell=isinstance(cmd, str))\n\n\ntry:\n   import torch\nexcept ImportError:\n   sh([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"torch\"])\n   import torch\n\n\nassert torch.cuda.is_available(), \n   \"No GPU detected. In Colab: Runtime &gt; Change runtime type &gt; GPU.\"\nprops = torch.cuda.get_device_properties(0)\nprint(f\"[GPU] {torch.cuda.get_device_name(0)}  \"\n     f\"({props.total_memory \/ 1e9:.1f} GB, \"\n     f\"compute capability {props.major}.{props.minor})\")\n\n\ndef pip_install(*pkgs, extra=()):\n   subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs, *extra],\n                  check=True)\n\n\nprint(\"[install] vLLM ...\")\npip_install(\"vllm==0.10.2\")\nprint(\"[install] kvcached (compiles a small CUDA extension) ...\")\npip_install(\"kvcached\", extra=[\"--no-build-isolation\"])\nprint(\"[install] misc (matplotlib, requests, pynvml) ...\")\npip_install(\"matplotlib\", \"requests\", \"pynvml\", \"numpy\")\n\n\nMODEL_A = \"Qwen\/Qwen2.5-0.5B-Instruct\"\nMODEL_B = \"Qwen\/Qwen2.5-1.5B-Instruct\"\nPORT_A, PORT_B = 8001, 8002\nMAX_MODEL_LEN = 2048<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We start by setting up the environment and verifying that a GPU is available for our experiments. We install all required dependencies including vLLM and kvcached along with supporting libraries. We then define our model configurations and ports to prepare for launching the inference servers.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def launch_vllm(model, port, kvcached=True, gpu_mem_util=0.55, log_path=None):\n   \"\"\"Start a vLLM OpenAI-compatible server as a subprocess. With kvcached=True\n   the autopatch hooks replace vLLM's KV-cache allocator with the elastic one.\"\"\"\n   env = os.environ.copy()\n   env[\"VLLM_USE_V1\"] = \"1\"\n   if kvcached:\n       env[\"ENABLE_KVCACHED\"]    = \"true\"\n       env[\"KVCACHED_AUTOPATCH\"] = \"1\"\n       env[\"KVCACHED_IPC_NAME\"]  = f\"kvc_{port}\"\n   cmd = [\n       sys.executable, \"-m\", \"vllm.entrypoints.openai.api_server\",\n       \"--model\", model, \"--port\", str(port),\n       \"--max-model-len\", str(MAX_MODEL_LEN),\n       \"--disable-log-requests\",\n       \"--no-enable-prefix-caching\",\n       \"--enforce-eager\",\n   ]\n   if not kvcached:\n       cmd += [\"--gpu-memory-utilization\", str(gpu_mem_util)]\n   log = open(log_path or os.devnull, \"w\")\n   proc = subprocess.Popen(cmd, env=env, stdout=log, stderr=subprocess.STDOUT,\n                           preexec_fn=os.setsid)\n   return proc, log\n\n\ndef wait_ready(port, timeout=420):\n   import requests\n   url = f\"http:\/\/localhost:{port}\/v1\/models\"\n   t0 = time.time()\n   while time.time() - t0 &lt; timeout:\n       try:\n           if requests.get(url, timeout=2).status_code == 200:\n               return True\n       except Exception:\n           pass\n       time.sleep(3)\n   raise TimeoutError(f\"vLLM on port {port} didn't come up within {timeout}s\")\n\n\ndef shutdown(proc, log):\n   if proc and proc.poll() is None:\n       try:\n           os.killpg(os.getpgid(proc.pid), signal.SIGTERM)\n           proc.wait(timeout=45)\n       except Exception:\n           os.killpg(os.getpgid(proc.pid), signal.SIGKILL)\n   if log and not log.closed:\n       log.close()\n   time.sleep(3)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement helper functions to launch and manage the vLLM server with and without kvcached enabled. We configure environment variables to activate dynamic KV-cache behavior and ensure proper server initialization. We also define utilities to wait for server readiness and safely shut down processes after execution.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import pynvml\npynvml.nvmlInit()\nNV_HANDLE = pynvml.nvmlDeviceGetHandleByIndex(0)\n\n\ndef vram_used_mb():\n   info = pynvml.nvmlDeviceGetMemoryInfo(NV_HANDLE)\n   return info.used \/ (1024 ** 2)\n\n\nclass MemorySampler(threading.Thread):\n   def __init__(self, interval=0.2):\n       super().__init__(daemon=True)\n       self.interval = interval\n       self.samples  = []\n       self._stop    = threading.Event()\n   def run(self):\n       t0 = time.time()\n       while not self._stop.is_set():\n           self.samples.append((time.time() - t0, vram_used_mb()))\n           time.sleep(self.interval)\n   def stop(self):\n       self._stop.set(); self.join()\n\n\nimport requests\nfrom concurrent.futures import ThreadPoolExecutor\n\n\nPROMPTS = [\n   \"Explain quantum entanglement to a curious 10-year-old.\",\n   \"Write a Python function that detects cycles in a linked list.\",\n   \"Summarize the plot of Hamlet in one paragraph.\",\n   \"List 5 surprising household uses for baking soda with explanations.\",\n   \"Compose a vivid haiku about rainy Monday mornings.\",\n   \"Describe the Fermi paradox and three plausible resolutions.\",\n   \"Translate 'knowledge is power' into French, German, and Japanese.\",\n   \"Explain the difference between TCP and UDP with real examples.\",\n]\n\n\ndef bursty_workload(port, model, n_bursts=3, burst_size=6, pause=6.0,\n                   max_tokens=180):\n   \"\"\"Fire n_bursts waves of burst_size concurrent requests with an idle\n   gap between waves. The idle gap is where kvcached releases physical\n   VRAM -- a static-allocation engine simply cannot.\"\"\"\n   url = f\"http:\/\/localhost:{port}\/v1\/chat\/completions\"\n   def one(i):\n       body = {\n           \"model\": model,\n           \"messages\": [{\"role\": \"user\", \"content\": PROMPTS[i % len(PROMPTS)]}],\n           \"max_tokens\": max_tokens, \"temperature\": 0.7,\n       }\n       t0 = time.time()\n       r = requests.post(url, json=body, timeout=180)\n       r.raise_for_status()\n       return time.time() - t0\n   latencies = []\n   with ThreadPoolExecutor(max_workers=burst_size) as ex:\n       for b in range(n_bursts):\n           print(f\"    burst {b+1}\/{n_bursts}  ({burst_size} concurrent)\")\n           latencies += list(ex.map(one, range(burst_size)))\n           if b &lt; n_bursts - 1:\n               time.sleep(pause)\n   return latencies<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We initialize GPU memory tracking using pynvml to monitor VRAM usage in real time. We create a background sampling thread that continuously records memory consumption during experiments. We then define a bursty workload generator that sends concurrent requests to simulate realistic LLM usage patterns.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">print(\"n=== Experiment 1: vLLM + kvcached ===\")\nproc, log = launch_vllm(MODEL_A, PORT_A, kvcached=True,\n                       log_path=\"\/tmp\/vllm_kvc.log\")\ntry:\n   wait_ready(PORT_A)\n   idle_kvc = vram_used_mb()\n   print(f\"  Idle VRAM after load (weights only): {idle_kvc:.0f} MB\")\n   sampler = MemorySampler(); sampler.start()\n   lat_kvc = bursty_workload(PORT_A, MODEL_A)\n   time.sleep(6)\n   sampler.stop()\n   mem_kvc = sampler.samples\nfinally:\n   shutdown(proc, log)\n\n\nprint(\"n=== Experiment 2: vLLM baseline (static KV allocation) ===\")\nproc, log = launch_vllm(MODEL_A, PORT_A, kvcached=False,\n                       log_path=\"\/tmp\/vllm_base.log\")\ntry:\n   wait_ready(PORT_A)\n   idle_base = vram_used_mb()\n   print(f\"  Idle VRAM (weights + pre-reserved KV pool): {idle_base:.0f} MB\")\n   sampler = MemorySampler(); sampler.start()\n   lat_base = bursty_workload(PORT_A, MODEL_A)\n   time.sleep(6)\n   sampler.stop()\n   mem_base = sampler.samples\nfinally:\n   shutdown(proc, log)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We run the first experiment with kvcached enabled and capture both memory usage and latency metrics. We then execute the same workload under a baseline static allocation setup for comparison. We collect and store all results to enable a clear side-by-side evaluation of both approaches.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import numpy as np\nimport matplotlib.pyplot as plt\n\n\nfig, axes = plt.subplots(1, 2, figsize=(14, 4.5))\n\n\ntk, mk = zip(*mem_kvc); tb, mb = zip(*mem_base)\naxes[0].plot(tk, mk, label=\"with kvcached\", linewidth=2, color=\"#1f77b4\")\naxes[0].plot(tb, mb, label=\"baseline (static)\", linewidth=2,\n            linestyle=\"--\", color=\"#d62728\")\naxes[0].axhline(idle_kvc,  color=\"#1f77b4\", alpha=.3, linestyle=\":\")\naxes[0].axhline(idle_base, color=\"#d62728\", alpha=.3, linestyle=\":\")\naxes[0].set_xlabel(\"time (s)\"); axes[0].set_ylabel(\"GPU memory used (MB)\")\naxes[0].set_title(\"VRAM under a bursty workloadn(dotted = idle-baseline VRAM)\")\naxes[0].grid(alpha=.3); axes[0].legend()\n\n\naxes[1].boxplot([lat_kvc, lat_base], labels=[\"kvcached\", \"baseline\"])\naxes[1].set_ylabel(\"request latency (s)\")\naxes[1].set_title(f\"Latency across {len(lat_kvc)} requests\")\naxes[1].grid(alpha=.3)\n\n\nplt.tight_layout()\nplt.savefig(\"\/content\/kvcached_single_model.png\", dpi=120, bbox_inches=\"tight\")\nplt.show()\n\n\nprint(\"n--- Single-model summary --------------------------------------------\")\nprint(f\"  Idle VRAM    kvcached: {idle_kvc:&gt;6.0f} MB   \"\n     f\"baseline: {idle_base:&gt;6.0f} MB  \"\n     f\"(savings: {idle_base - idle_kvc:&gt;5.0f} MB)\")\nprint(f\"  Peak VRAM    kvcached: {max(mk):&gt;6.0f} MB   \"\n     f\"baseline: {max(mb):&gt;6.0f} MB\")\nprint(f\"  Median lat.  kvcached: {np.median(lat_kvc):&gt;6.2f} s   \"\n     f\"baseline: {np.median(lat_base):&gt;6.2f} s\")\nprint(f\"  VRAM flex    kvcached: peak-idle = {max(mk)-min(mk):&gt;5.0f} MB  \"\n     f\"(baseline can't release -- static pool)\")\n\n\nprint(\"n=== Experiment 3: Two LLMs sharing one GPU (kvcached on both) ===\")\npA, lA = launch_vllm(MODEL_A, PORT_A, kvcached=True, log_path=\"\/tmp\/mA.log\")\ntry:\n   wait_ready(PORT_A)\n   pB, lB = launch_vllm(MODEL_B, PORT_B, kvcached=True, log_path=\"\/tmp\/mB.log\")\n   try:\n       wait_ready(PORT_B)\n       print(f\"  Both models loaded. Idle VRAM: {vram_used_mb():.0f} MB\")\n\n\n       sampler = MemorySampler(); sampler.start()\n       for i in range(4):\n           port, model = ((PORT_A, MODEL_A) if i % 2 == 0\n                          else (PORT_B, MODEL_B))\n           print(f\"  round {i+1}: driving {model}\")\n           bursty_workload(port, model, n_bursts=1, burst_size=4, pause=0)\n           time.sleep(5)\n       sampler.stop()\n       t, m = zip(*sampler.samples)\n\n\n       plt.figure(figsize=(11, 4.2))\n       plt.plot(t, m, color=\"#c2410c\", linewidth=2)\n       plt.xlabel(\"time (s)\"); plt.ylabel(\"GPU memory used (MB)\")\n       plt.title(\"Two LLMs on one T4 via kvcached \u2014 memory flexes per active model\")\n       plt.grid(alpha=.3); plt.tight_layout()\n       plt.savefig(\"\/content\/kvcached_multillm.png\", dpi=120,\n                   bbox_inches=\"tight\")\n       plt.show()\n   finally:\n       shutdown(pB, lB)\nfinally:\n   shutdown(pA, lA)\n\n\nprint(\"n=== Bonus: kvcached ships CLI tools ===\")\nprint(\"  kvtop  \u2014 live per-instance KV memory monitor (like nvtop for kvcached)\")\nprint(\"  kvctl  \u2014 set\/limit per-instance memory budgets in shared memory\")\nfor tool in (\"kvtop\", \"kvctl\"):\n   path = shutil.which(tool)\n   print(f\"    {tool}: {path or 'not on PATH'}\")\nprint(\"nAll plots saved to \/content\/. Done.\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We visualize the collected data by plotting VRAM usage trends and latency distributions across both setups. We compute summary statistics to quantify improvements in memory efficiency and performance. We finally extend the experiment to a multi-model scenario, observe how memory dynamically adapts across active models, and conclude with additional insights into tooling.<\/p>\n<p>In conclusion, we demonstrated how dynamic KV-cache management fundamentally improves GPU efficiency compared to traditional static allocation approaches. We observed that kvcached enables significant VRAM savings during idle periods while maintaining competitive latency under load, making it especially effective for bursty or multi-tenant inference environments. By running multiple models on a single GPU and alternating traffic, we clearly saw how memory is allocated only when needed and released when idle, validating the core premise of demand-driven caching. Overall, we established a practical and reproducible framework for evaluating memory optimization techniques in LLM serving and highlighted how this approach can scale to more complex, production-grade deployments.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/kvcached_vllm_elastic_kv_cache_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes with Notebook<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/25\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\">A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore kvcached, a dynamic KV-cache implementation on top of vLLM, to understand how dynamic KV-cache allocation transforms GPU memory usage for large language models. We begin by setting up the environment and deploying lightweight Qwen2.5 models through an OpenAI-compatible API, ensuring a realistic inference workflow. We then design controlled experiments where we simulate bursty workloads to observe how memory behaves under both elastic and static allocation strategies. Through systematic measurement and visualization, we directly compare VRAM utilization and latency, and extend the setup to a multi-model scenario where we observe how memory flexibly shifts across active workloads in real time. Copy CodeCopiedUse a different Browser import os, sys, time, json, subprocess, threading, signal, shutil from pathlib import Path def sh(cmd, check=True): return subprocess.run(cmd, check=check, shell=isinstance(cmd, str)) try: import torch except ImportError: sh([sys.executable, &#8220;-m&#8221;, &#8220;pip&#8221;, &#8220;install&#8221;, &#8220;-q&#8221;, &#8220;torch&#8221;]) import torch assert torch.cuda.is_available(), &#8220;No GPU detected. In Colab: Runtime &gt; Change runtime type &gt; GPU.&#8221; props = torch.cuda.get_device_properties(0) print(f&#8221;[GPU] {torch.cuda.get_device_name(0)} &#8221; f&#8221;({props.total_memory \/ 1e9:.1f} GB, &#8221; f&#8221;compute capability {props.major}.{props.minor})&#8221;) def pip_install(*pkgs, extra=()): subprocess.run([sys.executable, &#8220;-m&#8221;, &#8220;pip&#8221;, &#8220;install&#8221;, &#8220;-q&#8221;, *pkgs, *extra], check=True) print(&#8220;[install] vLLM &#8230;&#8221;) pip_install(&#8220;vllm==0.10.2&#8221;) print(&#8220;[install] kvcached (compiles a small CUDA extension) &#8230;&#8221;) pip_install(&#8220;kvcached&#8221;, extra=[&#8220;&#8211;no-build-isolation&#8221;]) print(&#8220;[install] misc (matplotlib, requests, pynvml) &#8230;&#8221;) pip_install(&#8220;matplotlib&#8221;, &#8220;requests&#8221;, &#8220;pynvml&#8221;, &#8220;numpy&#8221;) MODEL_A = &#8220;Qwen\/Qwen2.5-0.5B-Instruct&#8221; MODEL_B = &#8220;Qwen\/Qwen2.5-1.5B-Instruct&#8221; PORT_A, PORT_B = 8001, 8002 MAX_MODEL_LEN = 2048 We start by setting up the environment and verifying that a GPU is available for our experiments. We install all required dependencies including vLLM and kvcached along with supporting libraries. We then define our model configurations and ports to prepare for launching the inference servers. Copy CodeCopiedUse a different Browser def launch_vllm(model, port, kvcached=True, gpu_mem_util=0.55, log_path=None): &#8220;&#8221;&#8221;Start a vLLM OpenAI-compatible server as a subprocess. With kvcached=True the autopatch hooks replace vLLM&#8217;s KV-cache allocator with the elastic one.&#8221;&#8221;&#8221; env = os.environ.copy() env[&#8220;VLLM_USE_V1&#8221;] = &#8220;1&#8221; if kvcached: env[&#8220;ENABLE_KVCACHED&#8221;] = &#8220;true&#8221; env[&#8220;KVCACHED_AUTOPATCH&#8221;] = &#8220;1&#8221; env[&#8220;KVCACHED_IPC_NAME&#8221;] = f&#8221;kvc_{port}&#8221; cmd = [ sys.executable, &#8220;-m&#8221;, &#8220;vllm.entrypoints.openai.api_server&#8221;, &#8220;&#8211;model&#8221;, model, &#8220;&#8211;port&#8221;, str(port), &#8220;&#8211;max-model-len&#8221;, str(MAX_MODEL_LEN), &#8220;&#8211;disable-log-requests&#8221;, &#8220;&#8211;no-enable-prefix-caching&#8221;, &#8220;&#8211;enforce-eager&#8221;, ] if not kvcached: cmd += [&#8220;&#8211;gpu-memory-utilization&#8221;, str(gpu_mem_util)] log = open(log_path or os.devnull, &#8220;w&#8221;) proc = subprocess.Popen(cmd, env=env, stdout=log, stderr=subprocess.STDOUT, preexec_fn=os.setsid) return proc, log def wait_ready(port, timeout=420): import requests url = f&#8221;http:\/\/localhost:{port}\/v1\/models&#8221; t0 = time.time() while time.time() &#8211; t0 &lt; timeout: try: if requests.get(url, timeout=2).status_code == 200: return True except Exception: pass time.sleep(3) raise TimeoutError(f&#8221;vLLM on port {port} didn&#8217;t come up within {timeout}s&#8221;) def shutdown(proc, log): if proc and proc.poll() is None: try: os.killpg(os.getpgid(proc.pid), signal.SIGTERM) proc.wait(timeout=45) except Exception: os.killpg(os.getpgid(proc.pid), signal.SIGKILL) if log and not log.closed: log.close() time.sleep(3) We implement helper functions to launch and manage the vLLM server with and without kvcached enabled. We configure environment variables to activate dynamic KV-cache behavior and ensure proper server initialization. We also define utilities to wait for server readiness and safely shut down processes after execution. Copy CodeCopiedUse a different Browser import pynvml pynvml.nvmlInit() NV_HANDLE = pynvml.nvmlDeviceGetHandleByIndex(0) def vram_used_mb(): info = pynvml.nvmlDeviceGetMemoryInfo(NV_HANDLE) return info.used \/ (1024 ** 2) class MemorySampler(threading.Thread): def __init__(self, interval=0.2): super().__init__(daemon=True) self.interval = interval self.samples = [] self._stop = threading.Event() def run(self): t0 = time.time() while not self._stop.is_set(): self.samples.append((time.time() &#8211; t0, vram_used_mb())) time.sleep(self.interval) def stop(self): self._stop.set(); self.join() import requests from concurrent.futures import ThreadPoolExecutor PROMPTS = [ &#8220;Explain quantum entanglement to a curious 10-year-old.&#8221;, &#8220;Write a Python function that detects cycles in a linked list.&#8221;, &#8220;Summarize the plot of Hamlet in one paragraph.&#8221;, &#8220;List 5 surprising household uses for baking soda with explanations.&#8221;, &#8220;Compose a vivid haiku about rainy Monday mornings.&#8221;, &#8220;Describe the Fermi paradox and three plausible resolutions.&#8221;, &#8220;Translate &#8216;knowledge is power&#8217; into French, German, and Japanese.&#8221;, &#8220;Explain the difference between TCP and UDP with real examples.&#8221;, ] def bursty_workload(port, model, n_bursts=3, burst_size=6, pause=6.0, max_tokens=180): &#8220;&#8221;&#8221;Fire n_bursts waves of burst_size concurrent requests with an idle gap between waves. The idle gap is where kvcached releases physical VRAM &#8212; a static-allocation engine simply cannot.&#8221;&#8221;&#8221; url = f&#8221;http:\/\/localhost:{port}\/v1\/chat\/completions&#8221; def one(i): body = { &#8220;model&#8221;: model, &#8220;messages&#8221;: [{&#8220;role&#8221;: &#8220;user&#8221;, &#8220;content&#8221;: PROMPTS[i % len(PROMPTS)]}], &#8220;max_tokens&#8221;: max_tokens, &#8220;temperature&#8221;: 0.7, } t0 = time.time() r = requests.post(url, json=body, timeout=180) r.raise_for_status() return time.time() &#8211; t0 latencies = [] with ThreadPoolExecutor(max_workers=burst_size) as ex: for b in range(n_bursts): print(f&#8221; burst {b+1}\/{n_bursts} ({burst_size} concurrent)&#8221;) latencies += list(ex.map(one, range(burst_size))) if b &lt; n_bursts &#8211; 1: time.sleep(pause) return latencies We initialize GPU memory tracking using pynvml to monitor VRAM usage in real time. We create a background sampling thread that continuously records memory consumption during experiments. We then define a bursty workload generator that sends concurrent requests to simulate realistic LLM usage patterns. Copy CodeCopiedUse a different Browser print(&#8220;n=== Experiment 1: vLLM + kvcached ===&#8221;) proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=True, log_path=&#8221;\/tmp\/vllm_kvc.log&#8221;) try: wait_ready(PORT_A) idle_kvc = vram_used_mb() print(f&#8221; Idle VRAM after load (weights only): {idle_kvc:.0f} MB&#8221;) sampler = MemorySampler(); sampler.start() lat_kvc = bursty_workload(PORT_A, MODEL_A) time.sleep(6) sampler.stop() mem_kvc = sampler.samples finally: shutdown(proc, log) print(&#8220;n=== Experiment 2: vLLM baseline (static KV allocation) ===&#8221;) proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=False, log_path=&#8221;\/tmp\/vllm_base.log&#8221;) try: wait_ready(PORT_A) idle_base = vram_used_mb() print(f&#8221; Idle VRAM (weights + pre-reserved KV pool): {idle_base:.0f} MB&#8221;) sampler = MemorySampler(); sampler.start() lat_base = bursty_workload(PORT_A, MODEL_A) time.sleep(6) sampler.stop() mem_base = sampler.samples finally: shutdown(proc, log) We run the first experiment with kvcached enabled and capture both memory usage and latency metrics. We then execute the same workload under a baseline static allocation setup for comparison. We collect and store all results to enable a clear side-by-side evaluation of both approaches. Copy CodeCopiedUse a different Browser import numpy as np import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(14, 4.5)) tk, mk = zip(*mem_kvc); tb, mb = zip(*mem_base) axes[0].plot(tk, mk, label=&#8221;with kvcached&#8221;, linewidth=2, color=&#8221;#1f77b4&#8243;) axes[0].plot(tb, mb, label=&#8221;baseline (static)&#8221;, linewidth=2, linestyle=&#8221;&#8211;&#8220;, color=&#8221;#d62728&#8243;) axes[0].axhline(idle_kvc, color=&#8221;#1f77b4&#8243;, alpha=.3, linestyle=&#8221;:&#8221;) axes[0].axhline(idle_base, color=&#8221;#d62728&#8243;, alpha=.3, linestyle=&#8221;:&#8221;) axes[0].set_xlabel(&#8220;time (s)&#8221;); axes[0].set_ylabel(&#8220;GPU memory used (MB)&#8221;) axes[0].set_title(&#8220;VRAM under a bursty workloadn(dotted = idle-baseline VRAM)&#8221;) axes[0].grid(alpha=.3); axes[0].legend() axes[1].boxplot([lat_kvc, lat_base], labels=[&#8220;kvcached&#8221;, &#8220;baseline&#8221;]) axes[1].set_ylabel(&#8220;request latency (s)&#8221;) axes[1].set_title(f&#8221;Latency across {len(lat_kvc)} requests&#8221;) axes[1].grid(alpha=.3) plt.tight_layout() plt.savefig(&#8220;\/content\/kvcached_single_model.png&#8221;, dpi=120, bbox_inches=&#8221;tight&#8221;) plt.show() print(&#8220;n&#8212; Single-model summary &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;&#8220;) print(f&#8221; Idle VRAM kvcached: {idle_kvc:&gt;6.0f} MB &#8221; f&#8221;baseline: {idle_base:&gt;6.0f} MB &#8221; f&#8221;(savings: {idle_base &#8211; idle_kvc:&gt;5.0f} MB)&#8221;) print(f&#8221; Peak VRAM kvcached: {max(mk):&gt;6.0f} MB &#8221; f&#8221;baseline: {max(mb):&gt;6.0f} MB&#8221;) print(f&#8221; Median lat. kvcached: {np.median(lat_kvc):&gt;6.2f} s<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-86219","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/de\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/de\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-26T15:29:42+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"9\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing\",\"datePublished\":\"2026-04-26T15:29:42+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\"},\"wordCount\":583,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\",\"url\":\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\",\"name\":\"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-04-26T15:29:42+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/de\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/de\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/","og_locale":"de_DE","og_type":"article","og_title":"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/de\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-04-26T15:29:42+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"admin NU","Gesch\u00e4tzte Lesezeit":"9\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing","datePublished":"2026-04-26T15:29:42+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/"},"wordCount":583,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/","url":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/","name":"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-04-26T15:29:42+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/a-coding-implementation-on-kvcached-for-elastic-kv-cache-memory-bursty-llm-serving-and-multi-model-gpu-sharing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/de\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/de\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/de\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we explore kvcached, a dynamic KV-cache implementation on top of vLLM, to understand how dynamic KV-cache allocation transforms GPU memory usage for large language models. We begin by setting up the environment and deploying lightweight Qwen2.5 models through an OpenAI-compatible API, ensuring a realistic inference workflow. We then design controlled experiments where&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/86219","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/comments?post=86219"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/86219\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media?parent=86219"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/categories?post=86219"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/tags?post=86219"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}