YouZum

AI

AI, Committee, Actualités, Uncategorized

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms—differences in morphology, sensors, and control modes—poses a further challenge to generalizability and cross-platform learning. Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework Hugging Face presents SmolVLA, a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs. A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools. Architectural Overview and Design Trade-Offs The SmolVLA model is structured into two primary components: Perception Module (SmolVLM-2): A pretrained compact vision-language encoder processes sequences of RGB images, sensorimotor states, and language instructions. For efficiency, the model limits visual tokens through downsampling and only uses the lower half of transformer layers, based on empirical findings that earlier layers often yield more transferable features. Action Expert: A lightweight transformer, trained with flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, balancing internal action coherence and conditioning on perception inputs. Causal masking is applied to enforce temporal consistency. To reduce computational overhead, linear projections are used to align the modalities’ token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch’s JIT compilation for runtime optimization. Empirical Evaluation: Simulation and Real-World Performance SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions. In the LIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as π₀ (3.3B). In Meta-World, the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA’s smaller training footprint and absence of robotics-specific pretraining. In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks—outperforming both ACT (trained from scratch) and π₀ (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data. Performance Implications of Asynchronous Inference SmolVLA’s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance. Conclusion SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices—layer pruning, chunked action prediction, and asynchronous execution—SmolVLA maintains performance while significantly reducing computational demands. The model’s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data. Check out the Paper and Model on Hugging Face . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics appeared first on MarkTechPost.

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics Lire l’article »

AI, Committee, Actualités, Uncategorized

OpenAI Introduces Four Key Updates to Its AI Agent Framework

OpenAI has announced a set of targeted updates to its AI agent development stack, aimed at expanding platform compatibility, improving support for voice interfaces, and enhancing observability. These updates reflect a consistent progression toward building practical, controllable, and auditable AI agents that can be integrated into real-world applications across client and server environments. 1. TypeScript Support for the Agents SDK OpenAI’s Agents SDK is now available in TypeScript, extending the existing Python implementation to developers working in JavaScript and Node.js environments. The TypeScript SDK provides parity with the Python version, including foundational components such as: Handoffs: Mechanisms to route execution to other agents or processes. Guardrails: Runtime checks that constrain tool behavior to defined boundaries. Tracing: Hooks for collecting structured telemetry during agent execution. MCP (Model Context Protocol): Protocols for passing contextual state between agent steps and tool calls. This addition brings the SDK into alignment with modern web and cloud-native application stacks. Developers can now build and deploy agents across both frontend (browser) and backend (Node.js) contexts using a unified set of abstractions. The open documentation is available at openai-agents-js. 2. RealtimeAgent with Human-in-the-Loop Capabilities OpenAI introduced a new RealtimeAgent abstraction to support latency-sensitive voice applications. RealtimeAgents extend the Agents SDK with audio input/output, stateful interactions, and interruption handling. One of the more substantial features is human-in-the-loop (HITL) approval, allowing developers to intercept an agent’s execution at runtime, serialize its state, and require manual confirmation before continuing. This is especially relevant for applications requiring oversight, compliance checkpoints, or domain-specific validation during tool execution. Developers can pause execution, inspect the serialized state, and resume the agent with full context retention. The workflow is described in detail in OpenAI’s HITL documentation. 3. Traceability for Realtime API Sessions Complementing the RealtimeAgent feature, OpenAI has expanded the Traces dashboard to include support for voice agent sessions. Tracing now covers full Realtime API sessions—whether initiated via the SDK or directly through API calls. The Traces interface allows visualization of: Audio inputs and outputs (streamed or buffered) Tool invocations and parameters User interruptions and agent resumptions This provides a consistent audit trail for both text-based and audio-first agents, simplifying debugging, quality assurance, and performance tuning across modalities. The trace format is standardized and integrates with OpenAI’s broader monitoring stack, offering visibility without requiring additional instrumentation. Further implementation details are available in the voice agent guide at openai-agents-js/guides/voice-agents. 4. Refinements to the Speech-to-Speech Pipeline OpenAI has also made updates to its underlying speech-to-speech model, which powers real-time audio interactions. Enhancements focus on reducing latency, improving naturalness, and handling interruptions more effectively. While the model’s core capabilities—speech recognition, synthesis, and real-time feedback—remain in place, the refinements offer better alignment for dialog systems where responsiveness and tone variation are essential. This includes: Lower latency streaming: More immediate turn-taking in spoken conversations. Expressive audio generation: Improved intonation and pause modeling. Robustness to interruptions: Agents can respond gracefully to overlapping input. These changes align with OpenAI’s broader efforts to support embodied and conversational agents that function in dynamic, multimodal contexts. Conclusion Together, these four updates strengthen the foundation for building voice-enabled, traceable, and developer-friendly AI agents. By providing deeper integrations with TypeScript environments, introducing structured control points in real-time flows, and enhancing observability and speech interaction quality, OpenAI continues to move toward a more modular and interoperable agent ecosystem. Four updates to building agents with OpenAI: Agents SDK in TypeScript, a new RealtimeAgent feature for voice agents, Traces support for the Realtime API, and improvements to our speech-to-speech model. — OpenAI Developers (@OpenAIDevs) June 3, 2025 The post OpenAI Introduces Four Key Updates to Its AI Agent Framework appeared first on MarkTechPost.

OpenAI Introduces Four Key Updates to Its AI Agent Framework Lire l’article »

AI, Committee, Actualités, Uncategorized

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

arXiv:2506.01365v1 Announce Type: cross Abstract: Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion Lire l’article »

AI, Committee, Actualités, Uncategorized

Completing A Systematic Review in Hours instead of Months with Interactive AI Agents

arXiv:2504.14822v2 Announce Type: replace-cross Abstract: Systematic reviews (SRs) are vital for evidence-based practice in high stakes disciplines, such as healthcare, but are often impeded by intensive labors and lengthy processes that can take months to complete. Due to the high demand for domain expertise, existing automatic summarization methods fail to accurately identify relevant studies and generate high-quality summaries. To that end, we introduce InsightAgent, a human-centered interactive AI agent powered by large language models that revolutionize this workflow. InsightAgent partitions a large literature corpus based on semantics and employs a multi-agent design for more focused processing of literature, leading to significant improvement in the quality of generated SRs. InsightAgent also provides intuitive visualizations of the corpus and agent trajectories, allowing users to effortlessly monitor the actions of the agent and provide real-time feedback based on their expertise. Our user studies with 9 medical professionals demonstrate that the visualization and interaction mechanisms can effectively improve the quality of synthesized SRs by 27.2%, reaching 79.7% of human-written quality. At the same time, user satisfaction is improved by 34.4%. With InsightAgent, it only takes a clinician about 1.5 hours, rather than months, to complete a high-quality systematic review.

Completing A Systematic Review in Hours instead of Months with Interactive AI Agents Lire l’article »

AI, Committee, Actualités, Uncategorized

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You

arXiv:2401.16092v4 Announce Type: replace Abstract: Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as monolingual models do. Furthermore, the natural expectation that multilingual models will provide similar results across languages does not hold up. Instead, there are important differences between languages. We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models. We use MAGBIG to investigate the effect of multilingualism on gender bias in T2I models. To this end, we construct multilingual prompts requesting portraits of people with a certain occupation or trait. Our results show that not only do models exhibit strong gender biases but they also behave differently across languages. Furthermore, we investigate prompt engineering strategies, such as indirect, neutral formulations, to mitigate these biases. Unfortunately, these approaches have limited success and result in worse text-to-image alignment. Consequently, we call for more research into diverse representations across languages in image generators, as well as into steerability to address biased model behavior.

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You Lire l’article »

AI, Committee, Actualités, Uncategorized

Graph-Based Spectral Decomposition for Parameter Coordination in Language Model Fine-Tuning

arXiv:2504.19583v2 Announce Type: replace-cross Abstract: This paper proposes a parameter collaborative optimization algorithm for large language models, enhanced with graph spectral analysis. The goal is to improve both fine-tuning efficiency and structural awareness during training. In the proposed method, the parameters of a pre-trained language model are treated as nodes in a graph. A weighted graph is constructed, and Laplacian spectral decomposition is applied to enable frequency-domain modeling and structural representation of the parameter space. Based on this structure, a joint loss function is designed. It combines the task loss with a spectral regularization term to facilitate collaborative updates among parameters. In addition, a spectral filtering mechanism is introduced during the optimization phase. This mechanism adjusts gradients in a structure-aware manner, enhancing the model’s training stability and convergence behavior. The method is evaluated on multiple tasks, including traditional fine-tuning comparisons, few-shot generalization tests, and convergence speed analysis. In all settings, the proposed approach demonstrates superior performance. The experimental results confirm that the spectral collaborative optimization framework effectively reduces parameter perturbations and improves fine-tuning quality while preserving overall model performance. This work contributes significantly to the field of artificial intelligence by advancing parameter-efficient training methodologies for large-scale models, reinforcing the importance of structural signal processing in deep learning optimization, and offering a robust, generalizable framework for enhancing language model adaptability and performance.

Graph-Based Spectral Decomposition for Parameter Coordination in Language Model Fine-Tuning Lire l’article »

AI, Committee, Actualités, Uncategorized

CHEER-Ekman: Fine-grained Embodied Emotion Classification

arXiv:2506.01047v1 Announce Type: new Abstract: Emotions manifest through physical experiences and bodily reactions, yet identifying such embodied emotions in text remains understudied. We present an embodied emotion classification dataset, CHEER-Ekman, extending the existing binary embodied emotion dataset with Ekman’s six basic emotion categories. Using automatic best-worst scaling with large language models, we achieve performance superior to supervised approaches on our new dataset. Our investigation reveals that simplified prompting instructions and chain-of-thought reasoning significantly improve emotion recognition accuracy, enabling smaller models to achieve competitive performance with larger ones.

CHEER-Ekman: Fine-grained Embodied Emotion Classification Lire l’article »

fr_FR