{"id":42510,"date":"2025-10-06T06:53:30","date_gmt":"2025-10-06T06:53:30","guid":{"rendered":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/"},"modified":"2025-10-06T06:53:30","modified_gmt":"2025-10-06T06:53:30","slug":"streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows","status":"publish","type":"post","link":"https:\/\/youzum.net\/th\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/","title":{"rendered":"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows"},"content":{"rendered":"<p>Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?<a href=\"https:\/\/arxiv.org\/abs\/2509.13694\" target=\"_blank\" rel=\"noreferrer noopener\">StreamTensor <\/a>is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD\u2019s Alveo U55C FPGA. The system introduces an <em>iterative tensor<\/em> (\u201citensor\u201d) type to encode tile\/order of streams, enabling provably correct inter-kernel streaming and automated insertion\/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to <strong>0.64\u00d7 lower latency vs. GPUs<\/strong> and up to <strong>1.99\u00d7 higher energy efficiency<\/strong>.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"587\" data-attachment-id=\"75108\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/05\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/screenshot-2025-10-05-at-10-18-54-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1.png\" data-orig-size=\"1440,826\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-05 at 10.18.54\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-300x172.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587.png\" alt=\"\" class=\"wp-image-75108\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2509.13694<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What StreamTensor does<\/strong>?<\/h3>\n<p>StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are <strong>largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required<\/strong>; they are forwarded through on-chip FIFOs to downstream kernels. The compiler\u2019s central abstraction\u2014<strong>iterative tensors (itensors)<\/strong>\u2014records iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a <strong>linear program<\/strong> to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"615\" data-attachment-id=\"75110\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/05\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/screenshot-2025-10-05-at-10-19-15-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.19.15-PM-1.png\" data-orig-size=\"1766,1060\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-05 at 10.19.15\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.19.15-PM-1-300x180.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.19.15-PM-1-1024x615.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.19.15-PM-1-1024x615.png\" alt=\"\" class=\"wp-image-75110\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2509.13694<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What\u2019s actually new<\/strong>?<\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Hierarchical DSE.<\/strong> The compiler explores three design spaces\u2014(i) tiling\/unroll\/vectorization\/permutation at the Linalg level, (ii) fusion under memory\/resource constraints, and (iii) resource allocation\/stream widths\u2014optimizing for sustained throughput under bandwidth limits.<\/li>\n<li><strong>End-to-end PyTorch \u2192 device flow.<\/strong> Models enter via Torch-MLIR, are transformed to MLIR Linalg, and then into a <em>dataflow IR<\/em> whose nodes become hardware kernels with explicit streams and host\/runtime glue\u2014no manual RTL assembly. <\/li>\n<li><strong>iterative tensor <\/strong>(<strong>itensor) typing system.<\/strong> A first-class tensor type expresses iteration order, tiling, and affine maps. This makes stream order explicit, allows safe kernel fusion, and lets the compiler synthesize minimal buffer\/format converters when producers\/consumers disagree. <\/li>\n<li><strong>Formal FIFO sizing.<\/strong> Inter-kernel buffering is solved with a <strong>linear-programming<\/strong> formulation to avoid stalls\/deadlocks while minimizing on-chip memory usage (BRAM\/URAM). <\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Results<\/strong><\/h3>\n<p>Latency: <strong>up to 0.76\u00d7 vs prior FPGA LLM accelerators<\/strong> and <strong>0.64\u00d7 vs a GPU baseline<\/strong> on GPT-2; Energy efficiency: <strong>up to 1.99\u00d7<\/strong> vs A100 on emerging LLMs (model-dependent). Platform context: <strong>Alveo U55C<\/strong> (HBM2 16 GB, <strong>460 GB\/s<\/strong>, PCIe Gen3\u00d716 or dual Gen4\u00d78, <strong>2\u00d7QSFP28<\/strong>).<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"284\" data-attachment-id=\"75106\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/05\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/screenshot-2025-10-05-at-10-14-19-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.14.19-PM-1.png\" data-orig-size=\"1874,520\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-05 at 10.14.19\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.14.19-PM-1-300x83.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.14.19-PM-1-1024x284.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.14.19-PM-1-1024x284.png\" alt=\"\" class=\"wp-image-75106\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2509.13694<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Our Comments<\/strong><\/h3>\n<p>The useful contribution here is a PyTorch\u2192Torch-MLIR\u2192dataflow compiler that emits stream-scheduled kernels and a host\/runtime for AMD\u2019s Alveo U55C; the <em>iterative tensor<\/em> type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM <strong>decoding<\/strong> benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as <strong>0.64\u00d7 vs. a GPU baseline<\/strong> and energy efficiency up to <strong>1.99\u00d7<\/strong>, with scope limited to decoding workloads. The hardware context is clear: <strong>Alveo U55C<\/strong> provides <strong>16 GB HBM2<\/strong> at <strong>460 GB\/s<\/strong> with dual QSFP28 and PCIe Gen3\u00d716 or dual Gen4\u00d78, which aligns with the streaming dataflow design.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/abs\/2509.13694\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong>. Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. <\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/10\/05\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\">StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD\u2019s Alveo U55C FPGA. The system introduces an iterative tensor (\u201citensor\u201d) type to encode tile\/order of streams, enabling provably correct inter-kernel streaming and automated insertion\/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to 0.64\u00d7 lower latency vs. GPUs and up to 1.99\u00d7 higher energy efficiency. https:\/\/arxiv.org\/pdf\/2509.13694 What StreamTensor does? StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required; they are forwarded through on-chip FIFOs to downstream kernels. The compiler\u2019s central abstraction\u2014iterative tensors (itensors)\u2014records iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a linear program to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory. https:\/\/arxiv.org\/pdf\/2509.13694 What\u2019s actually new? Hierarchical DSE. The compiler explores three design spaces\u2014(i) tiling\/unroll\/vectorization\/permutation at the Linalg level, (ii) fusion under memory\/resource constraints, and (iii) resource allocation\/stream widths\u2014optimizing for sustained throughput under bandwidth limits. End-to-end PyTorch \u2192 device flow. Models enter via Torch-MLIR, are transformed to MLIR Linalg, and then into a dataflow IR whose nodes become hardware kernels with explicit streams and host\/runtime glue\u2014no manual RTL assembly. iterative tensor (itensor) typing system. A first-class tensor type expresses iteration order, tiling, and affine maps. This makes stream order explicit, allows safe kernel fusion, and lets the compiler synthesize minimal buffer\/format converters when producers\/consumers disagree. Formal FIFO sizing. Inter-kernel buffering is solved with a linear-programming formulation to avoid stalls\/deadlocks while minimizing on-chip memory usage (BRAM\/URAM). Results Latency: up to 0.76\u00d7 vs prior FPGA LLM accelerators and 0.64\u00d7 vs a GPU baseline on GPT-2; Energy efficiency: up to 1.99\u00d7 vs A100 on emerging LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB\/s, PCIe Gen3\u00d716 or dual Gen4\u00d78, 2\u00d7QSFP28). https:\/\/arxiv.org\/pdf\/2509.13694 Our Comments The useful contribution here is a PyTorch\u2192Torch-MLIR\u2192dataflow compiler that emits stream-scheduled kernels and a host\/runtime for AMD\u2019s Alveo U55C; the iterative tensor type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM decoding benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as 0.64\u00d7 vs. a GPU baseline and energy efficiency up to 1.99\u00d7, with scope limited to decoding workloads. The hardware context is clear: Alveo U55C provides 16 GB HBM2 at 460 GB\/s with dual QSFP28 and PCIe Gen3\u00d716 or dual Gen4\u00d78, which aligns with the streaming dataflow design. Check out the\u00a0Paper. Feel free to check out our\u00a0GitHub Page for Tutorials, Codes and Notebooks.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. The post StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":42511,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-42510","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/th\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\" \/>\n<meta property=\"og:locale\" content=\"th_TH\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/th\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-06T06:53:30+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 \u0e19\u0e32\u0e17\u0e35\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows\",\"datePublished\":\"2025-10-06T06:53:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\"},\"wordCount\":556,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\",\"url\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\",\"name\":\"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp\",\"datePublished\":\"2025-10-06T06:53:30+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#breadcrumb\"},\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp\",\"width\":1024,\"height\":587},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"th\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/th\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/th\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/","og_locale":"th_TH","og_type":"article","og_title":"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/th\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-10-06T06:53:30+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Written by":"admin NU","Est. reading time":"3 \u0e19\u0e32\u0e17\u0e35"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows","datePublished":"2025-10-06T06:53:30+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/"},"wordCount":556,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"th","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/","url":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/","name":"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp","datePublished":"2025-10-06T06:53:30+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#breadcrumb"},"inLanguage":"th","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/"]}]},{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp","width":1024,"height":587},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/streamtensor-a-pytorch-to-accelerator-compiler-that-streams-llm-intermediates-across-fpga-dataflows\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"th"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/th\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp",1024,587,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp",1024,587,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp",1024,587,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW-150x150.webp",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW-300x172.webp",300,172,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp",1024,587,false],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp",1024,587,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW.webp",1024,587,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW-18x10.webp",18,10,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW-300x300.webp",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW-600x344.webp",600,344,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-05-at-10.18.54-PM-1-1024x587-folfYW-100x100.webp",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/th\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/th\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD\u2019s Alveo U55C FPGA. The system introduces an iterative tensor (\u201citensor\u201d) type to encode tile\/order of&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/42510","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/comments?post=42510"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/42510\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media\/42511"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media?parent=42510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/categories?post=42510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/tags?post=42510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}