{"id":16451,"date":"2025-06-04T03:53:18","date_gmt":"2025-06-04T03:53:18","guid":{"rendered":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/"},"modified":"2025-06-04T03:53:18","modified_gmt":"2025-06-04T03:53:18","slug":"hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics","status":"publish","type":"post","link":"https:\/\/youzum.net\/ja\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/","title":{"rendered":"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics"},"content":{"rendered":"<p>Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms\u2014differences in morphology, sensors, and control modes\u2014poses a further challenge to generalizability and cross-platform learning.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework<\/strong><\/h3>\n<p>Hugging Face presents <strong>SmolVLA<\/strong>, a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"591\" data-attachment-id=\"71786\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/06\/03\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/screenshot-2025-06-03-at-10-36-27-am-3\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27\u202fAM-2.png\" data-orig-size=\"1936,1118\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-06-03 at 10.36.27\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27\u202fAM-2-300x173.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27\u202fAM-2-1024x591.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27%E2%80%AFAM-2-1024x591.png\" alt=\"\" class=\"wp-image-71786\" \/><\/figure>\n<\/div>\n<p>A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Architectural Overview and Design Trade-Offs<\/strong><\/h3>\n<p>The SmolVLA model is structured into two primary components:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Perception Module (SmolVLM-2)<\/strong>: A pretrained compact vision-language encoder processes sequences of RGB images, sensorimotor states, and language instructions. For efficiency, the model limits visual tokens through downsampling and only uses the lower half of transformer layers, based on empirical findings that earlier layers often yield more transferable features.<\/li>\n<li><strong>Action Expert<\/strong>: A lightweight transformer, trained with flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, balancing internal action coherence and conditioning on perception inputs. Causal masking is applied to enforce temporal consistency.<\/li>\n<\/ul>\n<p>To reduce computational overhead, linear projections are used to align the modalities\u2019 token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch\u2019s JIT compilation for runtime optimization.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Empirical Evaluation: Simulation and Real-World Performance<\/strong><\/h3>\n<p>SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions.<\/p>\n<p>In the <strong>LIBERO<\/strong> benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as \u03c0\u2080 (3.3B). In <strong>Meta-World<\/strong>, the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA\u2019s smaller training footprint and absence of robotics-specific pretraining.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"637\" data-attachment-id=\"71788\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/06\/03\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/screenshot-2025-06-03-at-10-38-27-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.38.27\u202fAM-1.png\" data-orig-size=\"1932,1202\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-06-03 at 10.38.27\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.38.27\u202fAM-1-300x187.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.38.27\u202fAM-1-1024x637.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.38.27%E2%80%AFAM-1-1024x637.png\" alt=\"\" class=\"wp-image-71788\" \/><\/figure>\n<\/div>\n<p>In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks\u2014outperforming both ACT (trained from scratch) and \u03c0\u2080 (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Performance Implications of Asynchronous Inference<\/strong><\/h3>\n<p>SmolVLA\u2019s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h3>\n<p>SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices\u2014layer pruning, chunked action prediction, and asynchronous execution\u2014SmolVLA maintains performance while significantly reducing computational demands.<\/p>\n<p>The model\u2019s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p><strong>Check out the <a href=\"https:\/\/arxiv.org\/abs\/2506.01844\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> and <a href=\"https:\/\/huggingface.co\/lerobot\/smolvla_base\" target=\"_blank\" rel=\"noreferrer noopener\">Model on Hugging Face<\/a> <em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">95k+ ML SubReddit<\/a><\/strong> and Subscribe to <strong><a href=\"https:\/\/www.airesearchinsights.com\/subscribe\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/06\/03\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\">Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms\u2014differences in morphology, sensors, and control modes\u2014poses a further challenge to generalizability and cross-platform learning. Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework Hugging Face presents SmolVLA, a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs. A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools. Architectural Overview and Design Trade-Offs The SmolVLA model is structured into two primary components: Perception Module (SmolVLM-2): A pretrained compact vision-language encoder processes sequences of RGB images, sensorimotor states, and language instructions. For efficiency, the model limits visual tokens through downsampling and only uses the lower half of transformer layers, based on empirical findings that earlier layers often yield more transferable features. Action Expert: A lightweight transformer, trained with flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, balancing internal action coherence and conditioning on perception inputs. Causal masking is applied to enforce temporal consistency. To reduce computational overhead, linear projections are used to align the modalities\u2019 token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch\u2019s JIT compilation for runtime optimization. Empirical Evaluation: Simulation and Real-World Performance SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions. In the LIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as \u03c0\u2080 (3.3B). In Meta-World, the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA\u2019s smaller training footprint and absence of robotics-specific pretraining. In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks\u2014outperforming both ACT (trained from scratch) and \u03c0\u2080 (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data. Performance Implications of Asynchronous Inference SmolVLA\u2019s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance. Conclusion SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices\u2014layer pruning, chunked action prediction, and asynchronous execution\u2014SmolVLA maintains performance while significantly reducing computational demands. The model\u2019s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data. Check out the Paper and Model on Hugging Face .\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a095k+ ML SubReddit and Subscribe to our Newsletter. The post Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":16452,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-16451","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/ja\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\" \/>\n<meta property=\"og:locale\" content=\"ja_JP\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/ja\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-04T03:53:18+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"591\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u57f7\u7b46\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics\",\"datePublished\":\"2025-06-04T03:53:18+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\"},\"wordCount\":719,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\",\"url\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\",\"name\":\"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png\",\"datePublished\":\"2025-06-04T03:53:18+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#breadcrumb\"},\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png\",\"width\":1024,\"height\":591},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ja\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/ja\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/ja\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/","og_locale":"ja_JP","og_type":"article","og_title":"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/ja\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-06-04T03:53:18+00:00","og_image":[{"width":1024,"height":591,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u57f7\u7b46\u8005":"admin NU","\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593":"4\u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics","datePublished":"2025-06-04T03:53:18+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/"},"wordCount":719,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"ja","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/","url":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/","name":"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png","datePublished":"2025-06-04T03:53:18+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#breadcrumb"},"inLanguage":"ja","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/"]}]},{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png","width":1024,"height":591},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/hugging-face-releases-smolvla-a-compact-vision-language-action-model-for-affordable-and-efficient-robotics\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ja"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/ja\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png",1024,591,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png",1024,591,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png",1024,591,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX-300x173.png",300,173,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png",1024,591,false],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png",1024,591,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX.png",1024,591,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX-18x10.png",18,10,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX-600x346.png",600,346,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/Screenshot-2025-06-03-at-10.36.27E280AFAM-2-1024x591-hUlVhX-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/ja\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/ja\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/16451","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/comments?post=16451"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/16451\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/media\/16452"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/media?parent=16451"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/categories?post=16451"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/tags?post=16451"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}