{"id":46841,"date":"2025-10-25T07:40:30","date_gmt":"2025-10-25T07:40:30","guid":{"rendered":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/"},"modified":"2025-10-25T07:40:30","modified_gmt":"2025-10-25T07:40:30","slug":"an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/","title":{"rendered":"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference"},"content":{"rendered":"<p>In this tutorial, we explore <a href=\"https:\/\/github.com\/Lightning-AI\/LitServe\"><strong>LitServe<\/strong><\/a>, a lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints that demonstrate real-world functionalities such as text generation, batching, streaming, multi-task processing, and caching, all running locally without relying on external APIs. By the end, we clearly understand how to design scalable and flexible ML serving pipelines that are both efficient and easy to extend for production-level applications. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/advanced_litserve_multi_endpoint_api_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">!pip install litserve torch transformers -q\n\n\nimport litserve as ls\nimport torch\nfrom transformers import pipeline\nimport time\nfrom typing import List<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by setting up our environment on Google Colab and installing all required dependencies, including LitServe, PyTorch, and Transformers. We then import the essential libraries and modules that will allow us to define, serve, and test our APIs efficiently. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/advanced_litserve_multi_endpoint_api_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">class TextGeneratorAPI(ls.LitAPI):\n   def setup(self, device):\n       self.model = pipeline(\"text-generation\", model=\"distilgpt2\", device=0 if device == \"cuda\" and torch.cuda.is_available() else -1)\n       self.device = device\n   def decode_request(self, request):\n       return request[\"prompt\"]\n   def predict(self, prompt):\n       result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)\n       return result[0]['generated_text']\n   def encode_response(self, output):\n       return {\"generated_text\": output, \"model\": \"distilgpt2\"}\n\n\nclass BatchedSentimentAPI(ls.LitAPI):\n   def setup(self, device):\n       self.model = pipeline(\"sentiment-analysis\", model=\"distilbert-base-uncased-finetuned-sst-2-english\", device=0 if device == \"cuda\" and torch.cuda.is_available() else -1)\n   def decode_request(self, request):\n       return request[\"text\"]\n   def batch(self, inputs: List[str]) -&gt; List[str]:\n       return inputs\n   def predict(self, batch: List[str]):\n       results = self.model(batch)\n       return results\n   def unbatch(self, output):\n       return output\n   def encode_response(self, output):\n       return {\"label\": output[\"label\"], \"score\": float(output[\"score\"]), \"batched\": True}<\/code><\/pre>\n<\/div>\n<\/div>\n<p>Here, we create two LitServe APIs, one for text generation using a local DistilGPT2 model and another for batched sentiment analysis. We define how each API decodes incoming requests, performs inference, and returns structured responses, demonstrating how easy it is to build scalable, reusable model-serving endpoints. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/advanced_litserve_multi_endpoint_api_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">class StreamingTextAPI(ls.LitAPI):\n   def setup(self, device):\n       self.model = pipeline(\"text-generation\", model=\"distilgpt2\", device=0 if device == \"cuda\" and torch.cuda.is_available() else -1)\n   def decode_request(self, request):\n       return request[\"prompt\"]\n   def predict(self, prompt):\n       words = [\"Once\", \"upon\", \"a\", \"time\", \"in\", \"a\", \"digital\", \"world\"]\n       for word in words:\n           time.sleep(0.1)\n           yield word + \" \"\n   def encode_response(self, output):\n       for token in output:\n           yield {\"token\": token}<\/code><\/pre>\n<\/div>\n<\/div>\n<p>In this section, we design a streaming text-generation API that emits tokens as they are generated. We simulate real-time streaming by yielding words one at a time, demonstrating how LitServe can handle continuous token generation efficiently. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/advanced_litserve_multi_endpoint_api_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">class MultiTaskAPI(ls.LitAPI):\n   def setup(self, device):\n       self.sentiment = pipeline(\"sentiment-analysis\", device=-1)\n       self.summarizer = pipeline(\"summarization\", model=\"sshleifer\/distilbart-cnn-6-6\", device=-1)\n       self.device = device\n   def decode_request(self, request):\n       return {\"task\": request.get(\"task\", \"sentiment\"), \"text\": request[\"text\"]}\n   def predict(self, inputs):\n       task = inputs[\"task\"]\n       text = inputs[\"text\"]\n       if task == \"sentiment\":\n           result = self.sentiment(text)[0]\n           return {\"task\": \"sentiment\", \"result\": result}\n       elif task == \"summarize\":\n           if len(text.split()) <\/code><\/pre>\n<\/div>\n<\/div>\n<p>We now develop a multi-task API that handles both sentiment analysis and summarization via a single endpoint. This snippet demonstrates how we can manage multiple model pipelines through a unified interface, dynamically routing each request to the appropriate pipeline based on the specified task. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/advanced_litserve_multi_endpoint_api_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">class CachedAPI(ls.LitAPI):\n   def setup(self, device):\n       self.model = pipeline(\"sentiment-analysis\", device=-1)\n       self.cache = {}\n       self.hits = 0\n       self.misses = 0\n   def decode_request(self, request):\n       return request[\"text\"]\n   def predict(self, text):\n       if text in self.cache:\n           self.hits += 1\n           return self.cache[text], True\n       self.misses += 1\n       result = self.model(text)[0]\n       self.cache[text] = result\n       return result, False\n   def encode_response(self, output):\n       result, from_cache = output\n       return {\"label\": result[\"label\"], \"score\": float(result[\"score\"]), \"from_cache\": from_cache, \"cache_stats\": {\"hits\": self.hits, \"misses\": self.misses}}<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement an API that uses <a href=\"https:\/\/www.marktechpost.com\/2025\/08\/08\/proxy-servers-explained-types-use-cases-trends-in-2025-technical-deep-dive\/\" target=\"_blank\">caching<\/a> to store previous inference results, reducing redundant computation for repeated requests. We track cache hits and misses in real time, illustrating how simple caching mechanisms can drastically improve performance in repeated inference scenarios. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/advanced_litserve_multi_endpoint_api_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def test_apis_locally():\n   print(\"=\" * 70)\n   print(\"Testing APIs Locally (No Server)\")\n   print(\"=\" * 70)\n\n\n   api1 = TextGeneratorAPI(); api1.setup(\"cpu\")\n   decoded = api1.decode_request({\"prompt\": \"Artificial intelligence will\"})\n   result = api1.predict(decoded)\n   encoded = api1.encode_response(result)\n   print(f\"\u2713 Result: {encoded['generated_text'][:100]}...\")\n\n\n   api2 = BatchedSentimentAPI(); api2.setup(\"cpu\")\n   texts = [\"I love Python!\", \"This is terrible.\", \"Neutral statement.\"]\n   decoded_batch = [api2.decode_request({\"text\": t}) for t in texts]\n   batched = api2.batch(decoded_batch)\n   results = api2.predict(batched)\n   unbatched = api2.unbatch(results)\n   for i, r in enumerate(unbatched):\n       encoded = api2.encode_response(r)\n       print(f\"\u2713 '{texts[i]}' -&gt; {encoded['label']} ({encoded['score']:.2f})\")\n\n\n   api3 = MultiTaskAPI(); api3.setup(\"cpu\")\n   decoded = api3.decode_request({\"task\": \"sentiment\", \"text\": \"Amazing tutorial!\"})\n   result = api3.predict(decoded)\n   print(f\"\u2713 Sentiment: {result['result']}\")\n\n\n   api4 = CachedAPI(); api4.setup(\"cpu\")\n   test_text = \"LitServe is awesome!\"\n   for i in range(3):\n       decoded = api4.decode_request({\"text\": test_text})\n       result = api4.predict(decoded)\n       encoded = api4.encode_response(result)\n       print(f\"\u2713 Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})\")\n\n\n   print(\"=\" * 70)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> All tests completed successfully!\")\n   print(\"=\" * 70)\n\n\ntest_apis_locally()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We test all our APIs locally to verify their correctness and performance without starting an external server. We sequentially evaluate text generation, batched sentiment analysis, multi-tasking, and caching, ensuring each component of our LitServe setup runs smoothly and efficiently.<\/p>\n<p>In conclusion, we create and run diverse APIs that showcase the framework&#8217;s versatility. We experiment with text generation, sentiment analysis, multi-tasking, and caching to experience LitServe\u2019s sea<a href=\"https:\/\/www.marktechpost.com\/2025\/01\/14\/what-is-machine-learning-ml\/\" target=\"_blank\">ML<\/a>ess integration with Hugging Face pipelines. As we complete the tutorial, we realize how LitServe simplifies model deployment workflows, enabling us to serve intelligent ML systems in just a few lines of Python code while maintaining flexibility, performance, and simplicity.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/advanced_litserve_multi_endpoint_api_tutorial_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>. Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/10\/24\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\">An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore LitServe, a lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints that demonstrate real-world functionalities such as text generation, batching, streaming, multi-task processing, and caching, all running locally without relying on external APIs. By the end, we clearly understand how to design scalable and flexible ML serving pipelines that are both efficient and easy to extend for production-level applications. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser !pip install litserve torch transformers -q import litserve as ls import torch from transformers import pipeline import time from typing import List We begin by setting up our environment on Google Colab and installing all required dependencies, including LitServe, PyTorch, and Transformers. We then import the essential libraries and modules that will allow us to define, serve, and test our APIs efficiently. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser class TextGeneratorAPI(ls.LitAPI): def setup(self, device): self.model = pipeline(&#8220;text-generation&#8221;, model=&#8221;distilgpt2&#8243;, device=0 if device == &#8220;cuda&#8221; and torch.cuda.is_available() else -1) self.device = device def decode_request(self, request): return request[&#8220;prompt&#8221;] def predict(self, prompt): result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True) return result[0][&#8216;generated_text&#8217;] def encode_response(self, output): return {&#8220;generated_text&#8221;: output, &#8220;model&#8221;: &#8220;distilgpt2&#8221;} class BatchedSentimentAPI(ls.LitAPI): def setup(self, device): self.model = pipeline(&#8220;sentiment-analysis&#8221;, model=&#8221;distilbert-base-uncased-finetuned-sst-2-english&#8221;, device=0 if device == &#8220;cuda&#8221; and torch.cuda.is_available() else -1) def decode_request(self, request): return request[&#8220;text&#8221;] def batch(self, inputs: List[str]) -&gt; List[str]: return inputs def predict(self, batch: List[str]): results = self.model(batch) return results def unbatch(self, output): return output def encode_response(self, output): return {&#8220;label&#8221;: output[&#8220;label&#8221;], &#8220;score&#8221;: float(output[&#8220;score&#8221;]), &#8220;batched&#8221;: True} Here, we create two LitServe APIs, one for text generation using a local DistilGPT2 model and another for batched sentiment analysis. We define how each API decodes incoming requests, performs inference, and returns structured responses, demonstrating how easy it is to build scalable, reusable model-serving endpoints. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser class StreamingTextAPI(ls.LitAPI): def setup(self, device): self.model = pipeline(&#8220;text-generation&#8221;, model=&#8221;distilgpt2&#8243;, device=0 if device == &#8220;cuda&#8221; and torch.cuda.is_available() else -1) def decode_request(self, request): return request[&#8220;prompt&#8221;] def predict(self, prompt): words = [&#8220;Once&#8221;, &#8220;upon&#8221;, &#8220;a&#8221;, &#8220;time&#8221;, &#8220;in&#8221;, &#8220;a&#8221;, &#8220;digital&#8221;, &#8220;world&#8221;] for word in words: time.sleep(0.1) yield word + &#8221; &#8221; def encode_response(self, output): for token in output: yield {&#8220;token&#8221;: token} In this section, we design a streaming text-generation API that emits tokens as they are generated. We simulate real-time streaming by yielding words one at a time, demonstrating how LitServe can handle continuous token generation efficiently. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser class MultiTaskAPI(ls.LitAPI): def setup(self, device): self.sentiment = pipeline(&#8220;sentiment-analysis&#8221;, device=-1) self.summarizer = pipeline(&#8220;summarization&#8221;, model=&#8221;sshleifer\/distilbart-cnn-6-6&#8243;, device=-1) self.device = device def decode_request(self, request): return {&#8220;task&#8221;: request.get(&#8220;task&#8221;, &#8220;sentiment&#8221;), &#8220;text&#8221;: request[&#8220;text&#8221;]} def predict(self, inputs): task = inputs[&#8220;task&#8221;] text = inputs[&#8220;text&#8221;] if task == &#8220;sentiment&#8221;: result = self.sentiment(text)[0] return {&#8220;task&#8221;: &#8220;sentiment&#8221;, &#8220;result&#8221;: result} elif task == &#8220;summarize&#8221;: if len(text.split()) We now develop a multi-task API that handles both sentiment analysis and summarization via a single endpoint. This snippet demonstrates how we can manage multiple model pipelines through a unified interface, dynamically routing each request to the appropriate pipeline based on the specified task. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser class CachedAPI(ls.LitAPI): def setup(self, device): self.model = pipeline(&#8220;sentiment-analysis&#8221;, device=-1) self.cache = {} self.hits = 0 self.misses = 0 def decode_request(self, request): return request[&#8220;text&#8221;] def predict(self, text): if text in self.cache: self.hits += 1 return self.cache[text], True self.misses += 1 result = self.model(text)[0] self.cache[text] = result return result, False def encode_response(self, output): result, from_cache = output return {&#8220;label&#8221;: result[&#8220;label&#8221;], &#8220;score&#8221;: float(result[&#8220;score&#8221;]), &#8220;from_cache&#8221;: from_cache, &#8220;cache_stats&#8221;: {&#8220;hits&#8221;: self.hits, &#8220;misses&#8221;: self.misses}} We implement an API that uses caching to store previous inference results, reducing redundant computation for repeated requests. We track cache hits and misses in real time, illustrating how simple caching mechanisms can drastically improve performance in repeated inference scenarios. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def test_apis_locally(): print(&#8220;=&#8221; * 70) print(&#8220;Testing APIs Locally (No Server)&#8221;) print(&#8220;=&#8221; * 70) api1 = TextGeneratorAPI(); api1.setup(&#8220;cpu&#8221;) decoded = api1.decode_request({&#8220;prompt&#8221;: &#8220;Artificial intelligence will&#8221;}) result = api1.predict(decoded) encoded = api1.encode_response(result) print(f&#8221;\u2713 Result: {encoded[&#8216;generated_text&#8217;][:100]}&#8230;&#8221;) api2 = BatchedSentimentAPI(); api2.setup(&#8220;cpu&#8221;) texts = [&#8220;I love Python!&#8221;, &#8220;This is terrible.&#8221;, &#8220;Neutral statement.&#8221;] decoded_batch = [api2.decode_request({&#8220;text&#8221;: t}) for t in texts] batched = api2.batch(decoded_batch) results = api2.predict(batched) unbatched = api2.unbatch(results) for i, r in enumerate(unbatched): encoded = api2.encode_response(r) print(f&#8221;\u2713 &#8216;{texts[i]}&#8217; -&gt; {encoded[&#8216;label&#8217;]} ({encoded[&#8216;score&#8217;]:.2f})&#8221;) api3 = MultiTaskAPI(); api3.setup(&#8220;cpu&#8221;) decoded = api3.decode_request({&#8220;task&#8221;: &#8220;sentiment&#8221;, &#8220;text&#8221;: &#8220;Amazing tutorial!&#8221;}) result = api3.predict(decoded) print(f&#8221;\u2713 Sentiment: {result[&#8216;result&#8217;]}&#8221;) api4 = CachedAPI(); api4.setup(&#8220;cpu&#8221;) test_text = &#8220;LitServe is awesome!&#8221; for i in range(3): decoded = api4.decode_request({&#8220;text&#8221;: test_text}) result = api4.predict(decoded) encoded = api4.encode_response(result) print(f&#8221;\u2713 Request {i+1}: {encoded[&#8216;label&#8217;]} (cached: {encoded[&#8216;from_cache&#8217;]})&#8221;) print(&#8220;=&#8221; * 70) print(&#8221; All tests completed successfully!&#8221;) print(&#8220;=&#8221; * 70) test_apis_locally() We test all our APIs locally to verify their correctness and performance without starting an external server. We sequentially evaluate text generation, batched sentiment analysis, multi-tasking, and caching, ensuring each component of our LitServe setup runs smoothly and efficiently. In conclusion, we create and run diverse APIs that showcase the framework&#8217;s versatility. We experiment with text generation, sentiment analysis, multi-tasking, and caching to experience LitServe\u2019s seaMLess integration with Hugging Face pipelines. As we complete the tutorial, we realize how LitServe simplifies model deployment workflows, enabling us to serve intelligent ML systems in just a few lines of Python code while maintaining flexibility, performance, and simplicity. Check out the\u00a0FULL CODES here. Feel free to check out our\u00a0GitHub Page for Tutorials, Codes and Notebooks.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. Wait! are you on telegram?\u00a0now you can join us on telegram as well. The post An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-46841","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-25T07:40:30+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference\",\"datePublished\":\"2025-10-25T07:40:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\"},\"wordCount\":566,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\",\"url\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\",\"name\":\"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\",\"datePublished\":\"2025-10-25T07:40:30+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage\",\"url\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\",\"contentUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/","og_locale":"zh_CN","og_type":"article","og_title":"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-10-25T07:40:30+00:00","og_image":[{"url":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"6 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference","datePublished":"2025-10-25T07:40:30+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/"},"wordCount":566,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/","url":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/","name":"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png","datePublished":"2025-10-25T07:40:30+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/"]}]},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#primaryimage","url":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png","contentUrl":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png"},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/an-implementation-on-building-advanced-multi-endpoint-machine-learning-apis-with-litserve-batching-streaming-caching-and-local-inference\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we explore LitServe, a lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints that demonstrate real-world functionalities such as text generation, batching, streaming, multi-task processing, and caching, all running locally without relying on external APIs. By the&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/46841","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=46841"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/46841\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=46841"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=46841"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=46841"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}