{"id":52840,"date":"2025-11-20T08:10:34","date_gmt":"2025-11-20T08:10:34","guid":{"rendered":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/"},"modified":"2025-11-20T08:10:34","modified_gmt":"2025-11-20T08:10:34","slug":"vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/","title":{"rendered":"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference"},"content":{"rendered":"<p>Production LLM serving is now a systems problem, not a <code>generate()<\/code> loop. For real workloads, the choice of inference stack drives your <strong>tokens per second<\/strong>, <strong>tail latency<\/strong>, and ultimately <strong>cost per million tokens<\/strong> on a given GPU fleet.<\/p>\n<p><strong>This comparison focuses on 4 widely used stacks:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>vLLM<\/strong><\/li>\n<li><strong>NVIDIA TensorRT-LLM<\/strong><\/li>\n<li><strong>Hugging Face Text Generation Inference (TGI v3)<\/strong><\/li>\n<li><strong>LMDeploy<\/strong><\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled.png\"><img fetchpriority=\"high\" decoding=\"async\" width=\"2560\" height=\"1493\" data-attachment-id=\"76408\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/11\/19\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/copy-of-infograp-1200x700\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled.png\" data-orig-size=\"2560,1493\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Copy of infograp 1200\u00d7700\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-300x175.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-1024x597.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled.png\" alt=\"\" class=\"wp-image-76408\" \/><\/a><\/figure>\n<h3 class=\"wp-block-heading\"><strong>1. vLLM, PagedAttention as the open baseline<\/strong><\/h3>\n<p><strong>Core idea<\/strong><\/p>\n<p>vLLM is built around <strong>PagedAttention<\/strong>, an attention implementation that treats the KV cache like paged virtual memory rather than a single contiguous buffer per sequence.<\/p>\n<p>Instead of allocating one big KV region per request,<strong> vLLM:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Divides KV cache into fixed size blocks<\/li>\n<li>Maintains a block table that maps logical tokens to physical blocks<\/li>\n<li>Shares blocks between sequences wherever prefixes overlap<\/li>\n<\/ul>\n<p>This reduces external fragmentation and lets the scheduler pack many more concurrent sequences into the same VRAM.<\/p>\n<p><strong>Throughput and latency<\/strong><\/p>\n<p>vLLM improves throughput by <strong>2\u20134\u00d7<\/strong> over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences. <\/p>\n<p><strong>Key properties for operators:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Continuous batching<\/strong> (also called inflight batching) merges incoming requests into existing GPU batches instead of waiting for fixed batch windows.<\/li>\n<li>On typical chat workloads, throughput scales close to linearly with concurrency until KV memory or compute saturates.<\/li>\n<li>P50 latency remains low for moderate concurrency, but P99 can degrade once queues are long or KV memory is tight, especially for prefill heavy queries.<\/li>\n<\/ul>\n<p>vLLM exposes an <strong>OpenAI compatible HTTP API<\/strong> and integrates well with Ray Serve and other orchestrators, which is why it is widely used as an open baseline.<\/p>\n<p><strong>KV and multi tenant<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>PagedAttention gives <strong>near zero KV waste<\/strong> and flexible prefix sharing within and across requests.<\/li>\n<li>Each vLLM process serves <strong>one model<\/strong>, multi tenant and multi model setups are usually built with an external router or API gateway that fans out to multiple vLLM instances.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>2. TensorRT-LLM, hardware maximum on NVIDIA GPUs<\/strong><\/h3>\n<p><strong>Core idea<\/strong><\/p>\n<p><a href=\"https:\/\/github.com\/NVIDIA\/TensorRT-LLM\" target=\"_blank\" rel=\"noreferrer noopener\">TensorRT-LLM <\/a>is NVIDIA\u2019s optimized inference library for their GPUs. The library provides custom attention kernels, inflight batching, paged KV caching, quantization down to FP4 and INT4, and speculative decoding. <\/p>\n<p>It is tightly coupled to NVIDIA hardware, including FP8 tensor cores on Hopper and Blackwell.<\/p>\n<p><strong>Measured performance<\/strong><\/p>\n<p><strong>NVIDIA\u2019s H100 vs A100 evaluation is the most concrete public reference:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>On H100 with FP8, TensorRT-LLM reaches <strong>over 10,000 output tokens\/s<\/strong> at peak throughput for <strong>64 concurrent requests<\/strong>, with <strong>~100 ms<\/strong> time to first token.<\/li>\n<li>H100 FP8 achieves up to <strong>4.6\u00d7 higher max throughput<\/strong> and <strong>4.4\u00d7 faster first token latency<\/strong> than A100 on the same models. <\/li>\n<\/ul>\n<p><strong>For latency sensitive modes:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>TensorRT-LLM on H100 can drive TTFT <strong>below 10 ms<\/strong> in batch 1 configurations, at the cost of lower overall throughput.<\/li>\n<\/ul>\n<p>These numbers are model and shape specific, but they give a realistic scale.<\/p>\n<p><strong>Prefill vs decode<\/strong><\/p>\n<p><strong>TensorRT-LLM optimizes both phases:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Prefill benefits from high throughput FP8 attention kernels and tensor parallelism<\/li>\n<li>Decode benefits from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion<\/li>\n<\/ul>\n<p>The result is very high tokens\/s across a wide range of input and output lengths, especially when the engine is tuned for that model and batch profile.<\/p>\n<p><strong>KV and multi tenant<\/strong><\/p>\n<p><strong>TensorRT-LLM provides:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Paged KV cache<\/strong> with configurable layout<\/li>\n<li>Support for long sequences, KV reuse and offloading<\/li>\n<li>Inflight batching and priority aware scheduling primitives<\/li>\n<\/ul>\n<p>NVIDIA pairs this with Ray based or Triton based orchestration patterns for multi tenant clusters. Multi model support is done at the orchestrator level, not inside a single TensorRT-LLM engine instance.<\/p>\n<h3 class=\"wp-block-heading\"><strong>3. Hugging Face TGI v3, long prompt specialist and multi backend gateway<\/strong><\/h3>\n<p><strong>Core idea<\/strong><\/p>\n<p><strong><a href=\"https:\/\/huggingface.co\/docs\/text-generation-inference\/en\/conceptual\/chunking?\" target=\"_blank\" rel=\"noreferrer noopener\">Text Generation Inference (TGI)<\/a> is a Rust and Python based serving stack that adds:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>HTTP and gRPC APIs<\/li>\n<li>Continuous batching scheduler<\/li>\n<li>Observability and autoscaling hooks<\/li>\n<li>Pluggable backends, including vLLM style engines, TensorRT-LLM, and other runtimes <\/li>\n<\/ul>\n<p>Version 3 focuses on long prompt processing through <strong>chunking and prefix caching<\/strong>.<\/p>\n<p><strong>Long prompt benchmark vs vLLM<\/strong><\/p>\n<p>The <a href=\"https:\/\/huggingface.co\/docs\/text-generation-inference\/en\/conceptual\/chunking?\" target=\"_blank\" rel=\"noreferrer noopener\">TGI v3 docs<\/a> give a clear benchmark:<\/p>\n<ul class=\"wp-block-list\">\n<li>On long prompts with more than <strong>200,000 tokens<\/strong>, a conversation reply that takes <strong>27.5 s in vLLM<\/strong> can be served in about <strong>2 s in TGI v3<\/strong>.<\/li>\n<li>This is reported as a <strong>13\u00d7 speedup<\/strong> on that workload.<\/li>\n<li>TGI v3 is able to process about <strong>3\u00d7 more tokens in the same GPU memory<\/strong> by reducing its memory footprint and exploiting chunking and caching.<\/li>\n<\/ul>\n<p><strong>The mechanism is:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>TGI keeps the original conversation context in a <strong>prefix cache<\/strong>, so subsequent turns only pay for incremental tokens<\/li>\n<li>Cache lookup overhead is on the order of <strong>microseconds<\/strong>, negligible relative to prefill compute<\/li>\n<\/ul>\n<p>This is a targeted optimization for workloads where prompts are extremely long and reused across turns, for example RAG pipelines and analytic summarization.<\/p>\n<p><strong>Architecture and latency behavior<\/strong><\/p>\n<p><strong>Key components:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Chunking<\/strong>, very long prompts are split into manageable segments for KV and scheduling<\/li>\n<li><strong>Prefix caching<\/strong>, data structure to share long context across turns<\/li>\n<li><strong>Continuous batching<\/strong>, incoming requests join batches of already running sequences<\/li>\n<li><strong>PagedAttention and fused kernels<\/strong> in the GPU backends<\/li>\n<\/ul>\n<p>For short chat style workloads, throughput and latency are in the same ballpark as vLLM. For long, cacheable contexts, both P50 and P99 latency improve by an order of magnitude because the engine avoids repeated prefill.<\/p>\n<p><strong>Multi backend and multi model<\/strong><\/p>\n<p>TGI is designed as a <strong>router plus model server<\/strong> architecture. <strong>It can:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Route requests across many models and replicas<\/li>\n<li>Target different backends, for example TensorRT-LLM on H100 plus CPU or smaller GPUs for low priority traffic <\/li>\n<\/ul>\n<p>This makes it suitable as a central serving tier in multi tenant environments.<\/p>\n<h3 class=\"wp-block-heading\"><strong>4. LMDeploy, TurboMind with blocked KV and aggressive quantization<\/strong><\/h3>\n<p><strong>Core idea<\/strong><\/p>\n<p><a href=\"https:\/\/github.com\/InternLM\/lmdeploy?\" target=\"_blank\" rel=\"noreferrer noopener\">LMDeploy<\/a> from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered around the <strong>TurboMind<\/strong> engine. It focuses on:<\/p>\n<ul class=\"wp-block-list\">\n<li>High throughput request serving<\/li>\n<li>Blocked KV cache<\/li>\n<li>Persistent batching (continuous batching)<\/li>\n<li>Quantization of weights and KV cache<\/li>\n<\/ul>\n<p><strong>Relative throughput vs vLLM<\/strong><\/p>\n<p>The project states:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>\u2018LMDeploy delivers up to 1.8\u00d7 higher request throughput than vLLM<\/strong>\u2018, with the support from persistent batch, blocked KV, dynamic split and fuse, tensor parallelism and optimized CUDA kernels.<\/li>\n<\/ul>\n<p><strong>KV, quantization and latency<\/strong><\/p>\n<p><strong>LMDeploy includes:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Blocked KV cache<\/strong>, similar to paged KV, that helps pack many sequences into VRAM<\/li>\n<li>Support for <strong>KV cache quantization<\/strong>, typically int8 or int4, to cut KV memory and bandwidth<\/li>\n<li>Weight only quantization paths such as 4 bit AWQ<\/li>\n<li>A benchmarking harness that reports token throughput, request throughput, and first token latency<\/li>\n<\/ul>\n<p>This makes LMDeploy attractive when you want to run larger open models like InternLM or Qwen on mid range GPUs with aggressive compression while still maintaining good tokens\/s.<\/p>\n<p><strong>Multi model deployments<\/strong><\/p>\n<p>LMDeploy provides a <strong>proxy server<\/strong> able to handle:<\/p>\n<ul class=\"wp-block-list\">\n<li>Multi model deployments<\/li>\n<li>Multi machine, multi GPU setups<\/li>\n<li>Routing logic to select models based on request metadata<\/li>\n<\/ul>\n<p>So architecturally it sits closer to TGI than to a single engine.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What to use when<\/strong>?<\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>If you want maximum throughput and very low TTFT on NVIDIA GPUs<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>TensorRT-LLM<\/strong> is the primary choice<\/li>\n<li>It uses FP8 and lower precision, custom kernels and speculative decoding to push tokens\/s and keep TTFT under 100 ms at high concurrency and under 10 ms at low concurrency <\/li>\n<\/ul>\n<\/li>\n<li><strong>If you are dominated by long prompts with reuse, such as RAG over large contexts<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>TGI v3<\/strong> is a strong default<\/li>\n<li>Its prefix cache and chunking give up to <strong>3\u00d7 token capacity<\/strong> and <strong>13\u00d7 lower latency<\/strong> than vLLM in published long prompt benchmarks, without extra configuration <\/li>\n<\/ul>\n<\/li>\n<li><strong>If you want an open, simple engine with strong baseline performance and an OpenAI style API<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>vLLM<\/strong> remains the standard baseline<\/li>\n<li>PagedAttention and continuous batching make it <strong>2\u20134\u00d7 faster<\/strong> than older stacks at similar latency, and it integrates cleanly with Ray and K8s <\/li>\n<\/ul>\n<\/li>\n<li><strong>If you target open models such as InternLM or Qwen and value aggressive quantization with multi model serving<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>LMDeploy<\/strong> is a good fit<\/li>\n<li>Blocked KV cache, persistent batching and int8 or int4 KV quantization give <strong>up to 1.8\u00d7 higher request throughput than vLLM<\/strong> on supported models, with a router layer included <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>In practice, many dev teams mix these systems, for example using TensorRT-LLM for high volume proprietary chat, TGI v3 for long context analytics, vLLM or LMDeploy for experimental and open model workloads. The key is to align throughput, latency tails, and KV behavior with the actual token distributions in your traffic, then compute cost per million tokens from measured tokens\/s on your own hardware.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h4 class=\"wp-block-heading\"><strong>References<\/strong><\/h4>\n<ol class=\"wp-block-list\">\n<li><strong>vLLM \/ PagedAttention<\/strong>\n<ul class=\"wp-block-list\">\n<li>Paper: <a href=\"https:\/\/arxiv.org\/abs\/2309.06180\">https:\/\/arxiv.org\/abs\/2309.06180<\/a><\/li>\n<li>Blog: <a href=\"https:\/\/blog.vllm.ai\/2023\/06\/20\/vllm.html\">https:\/\/blog.vllm.ai\/2023\/06\/20\/vllm.html<\/a><\/li>\n<li>Repo: <a href=\"https:\/\/github.com\/vllm-project\/vllm\">https:\/\/github.com\/vllm-project\/vllm<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>TensorRT-LLM performance and overview<\/strong>\n<ul class=\"wp-block-list\">\n<li>H100 vs A100 performance (10k tok\/s @ 100 ms TTFT): <a href=\"https:\/\/nvidia.github.io\/TensorRT-LLM\/blogs\/H100vsA100.html\">https:\/\/nvidia.github.io\/TensorRT-LLM\/blogs\/H100vsA100.html<\/a><\/li>\n<li>Performance overview tables: <a href=\"https:\/\/nvidia.github.io\/TensorRT-LLM\/performance\/perf-overview.html\">https:\/\/nvidia.github.io\/TensorRT-LLM\/performance\/perf-overview.html<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>HF Text Generation Inference (TGI v3) long-prompt behavior<\/strong>\n<ul class=\"wp-block-list\">\n<li>Chunking \/ conceptual docs (13\u00d7 faster on long prompts): <a href=\"https:\/\/huggingface.co\/docs\/text-generation-inference\/en\/conceptual\/chunking\">https:\/\/huggingface.co\/docs\/text-generation-inference\/en\/conceptual\/chunking<\/a><\/li>\n<li>Release coverage with 13\u00d7 vs vLLM on long prompts: <a href=\"https:\/\/www.marktechpost.com\/2024\/12\/10\/hugging-face-releases-text-generation-inference-tgi-v3-0-13x-faster-than-vllm-on-long-prompts\/\">https:\/\/www.marktechpost.com\/2024\/12\/10\/hugging-face-releases-text-generation-inference-tgi-v3-0-13x-faster-than-vllm-on-long-prompts\/<\/a><\/li>\n<li>HF post summarizing 27.5 s \u2192 2 s example: <a href=\"https:\/\/huggingface.co\/posts\/Narsil\/601808386353996\">https:\/\/huggingface.co\/posts\/Narsil\/601808386353996<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>LMDeploy \/ TurboMind<\/strong>\n<ul class=\"wp-block-list\">\n<li>Repo (core features, 1.8\u00d7 vLLM throughput, blocked KV, persistent batch): <a href=\"https:\/\/github.com\/InternLM\/lmdeploy\">https:\/\/github.com\/InternLM\/lmdeploy<\/a><\/li>\n<li>Official docs (1.8\u00d7 request throughput, KV + weight quantization details): <a href=\"https:\/\/lmdeploy.readthedocs.io\/\">https:\/\/lmdeploy.readthedocs.io\/<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/11\/19\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\">vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Production LLM serving is now a systems problem, not a generate() loop. For real workloads, the choice of inference stack drives your tokens per second, tail latency, and ultimately cost per million tokens on a given GPU fleet. This comparison focuses on 4 widely used stacks: vLLM NVIDIA TensorRT-LLM Hugging Face Text Generation Inference (TGI v3) LMDeploy 1. vLLM, PagedAttention as the open baseline Core idea vLLM is built around PagedAttention, an attention implementation that treats the KV cache like paged virtual memory rather than a single contiguous buffer per sequence. Instead of allocating one big KV region per request, vLLM: Divides KV cache into fixed size blocks Maintains a block table that maps logical tokens to physical blocks Shares blocks between sequences wherever prefixes overlap This reduces external fragmentation and lets the scheduler pack many more concurrent sequences into the same VRAM. Throughput and latency vLLM improves throughput by 2\u20134\u00d7 over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences. Key properties for operators: Continuous batching (also called inflight batching) merges incoming requests into existing GPU batches instead of waiting for fixed batch windows. On typical chat workloads, throughput scales close to linearly with concurrency until KV memory or compute saturates. P50 latency remains low for moderate concurrency, but P99 can degrade once queues are long or KV memory is tight, especially for prefill heavy queries. vLLM exposes an OpenAI compatible HTTP API and integrates well with Ray Serve and other orchestrators, which is why it is widely used as an open baseline. KV and multi tenant PagedAttention gives near zero KV waste and flexible prefix sharing within and across requests. Each vLLM process serves one model, multi tenant and multi model setups are usually built with an external router or API gateway that fans out to multiple vLLM instances. 2. TensorRT-LLM, hardware maximum on NVIDIA GPUs Core idea TensorRT-LLM is NVIDIA\u2019s optimized inference library for their GPUs. The library provides custom attention kernels, inflight batching, paged KV caching, quantization down to FP4 and INT4, and speculative decoding. It is tightly coupled to NVIDIA hardware, including FP8 tensor cores on Hopper and Blackwell. Measured performance NVIDIA\u2019s H100 vs A100 evaluation is the most concrete public reference: On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens\/s at peak throughput for 64 concurrent requests, with ~100 ms time to first token. H100 FP8 achieves up to 4.6\u00d7 higher max throughput and 4.4\u00d7 faster first token latency than A100 on the same models. For latency sensitive modes: TensorRT-LLM on H100 can drive TTFT below 10 ms in batch 1 configurations, at the cost of lower overall throughput. These numbers are model and shape specific, but they give a realistic scale. Prefill vs decode TensorRT-LLM optimizes both phases: Prefill benefits from high throughput FP8 attention kernels and tensor parallelism Decode benefits from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion The result is very high tokens\/s across a wide range of input and output lengths, especially when the engine is tuned for that model and batch profile. KV and multi tenant TensorRT-LLM provides: Paged KV cache with configurable layout Support for long sequences, KV reuse and offloading Inflight batching and priority aware scheduling primitives NVIDIA pairs this with Ray based or Triton based orchestration patterns for multi tenant clusters. Multi model support is done at the orchestrator level, not inside a single TensorRT-LLM engine instance. 3. Hugging Face TGI v3, long prompt specialist and multi backend gateway Core idea Text Generation Inference (TGI) is a Rust and Python based serving stack that adds: HTTP and gRPC APIs Continuous batching scheduler Observability and autoscaling hooks Pluggable backends, including vLLM style engines, TensorRT-LLM, and other runtimes Version 3 focuses on long prompt processing through chunking and prefix caching. Long prompt benchmark vs vLLM The TGI v3 docs give a clear benchmark: On long prompts with more than 200,000 tokens, a conversation reply that takes 27.5 s in vLLM can be served in about 2 s in TGI v3. This is reported as a 13\u00d7 speedup on that workload. TGI v3 is able to process about 3\u00d7 more tokens in the same GPU memory by reducing its memory footprint and exploiting chunking and caching. The mechanism is: TGI keeps the original conversation context in a prefix cache, so subsequent turns only pay for incremental tokens Cache lookup overhead is on the order of microseconds, negligible relative to prefill compute This is a targeted optimization for workloads where prompts are extremely long and reused across turns, for example RAG pipelines and analytic summarization. Architecture and latency behavior Key components: Chunking, very long prompts are split into manageable segments for KV and scheduling Prefix caching, data structure to share long context across turns Continuous batching, incoming requests join batches of already running sequences PagedAttention and fused kernels in the GPU backends For short chat style workloads, throughput and latency are in the same ballpark as vLLM. For long, cacheable contexts, both P50 and P99 latency improve by an order of magnitude because the engine avoids repeated prefill. Multi backend and multi model TGI is designed as a router plus model server architecture. It can: Route requests across many models and replicas Target different backends, for example TensorRT-LLM on H100 plus CPU or smaller GPUs for low priority traffic This makes it suitable as a central serving tier in multi tenant environments. 4. LMDeploy, TurboMind with blocked KV and aggressive quantization Core idea LMDeploy from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered around the TurboMind engine. It focuses on: High throughput request serving Blocked KV cache Persistent batching (continuous batching) Quantization of weights and KV cache Relative throughput vs vLLM The project states: \u2018LMDeploy delivers up to 1.8\u00d7 higher request throughput than vLLM\u2018, with the support from persistent batch, blocked KV, dynamic split and fuse, tensor parallelism and optimized CUDA kernels. KV, quantization and latency LMDeploy includes: Blocked KV cache, similar to paged KV,<\/p>","protected":false},"author":2,"featured_media":52841,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-52840","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-20T08:10:34+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference\",\"datePublished\":\"2025-11-20T08:10:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\"},\"wordCount\":1516,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\",\"url\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\",\"name\":\"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png\",\"datePublished\":\"2025-11-20T08:10:34+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png\",\"width\":2560,\"height\":1493},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/","og_locale":"fr_FR","og_type":"article","og_title":"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-11-20T08:10:34+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference","datePublished":"2025-11-20T08:10:34+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/"},"wordCount":1516,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/","url":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/","name":"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png","datePublished":"2025-11-20T08:10:34+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png","width":2560,"height":1493},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png",2560,1493,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png",2560,1493,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi.png",2560,1493,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-300x175.png",300,175,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-1024x597.png",1024,597,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-1536x896.png",1536,896,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-2048x1194.png",2048,1194,true],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-18x10.png",18,10,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-600x350.png",600,350,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/11\/Copy-of-infograp-1200x700-1-scaled-RkNHSi-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Production LLM serving is now a systems problem, not a generate() loop. For real workloads, the choice of inference stack drives your tokens per second, tail latency, and ultimately cost per million tokens on a given GPU fleet. This comparison focuses on 4 widely used stacks: vLLM NVIDIA TensorRT-LLM Hugging Face Text Generation Inference (TGI\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/52840","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=52840"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/52840\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/52841"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=52840"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=52840"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=52840"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}