YouZum

Committee

AI, Committee, 新闻, Uncategorized

M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset

arXiv:2506.02510v1 Announce Type: new Abstract: Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called $texttt{M$^3$FinMeeting}$, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, $texttt{M$^3$FinMeeting}$ supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, $texttt{M$^3$FinMeeting}$ includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of $texttt{M$^3$FinMeeting}$ as a benchmark for assessing LLMs’ financial meeting comprehension skills.

M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset Read Post »

AI, Committee, 新闻, Uncategorized

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

arXiv:2506.02553v1 Announce Type: cross Abstract: We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax, and RLOO inherently possess the capacity to model token-level reward signals, offering a theoretical justification for response-level reward approaches. Our findings pave the way for more practical, efficient LLM fine-tuning, allowing developers to treat training algorithms as black boxes and focus on improving the response-level reward model with auxiliary sub-models. We also offer a detailed analysis of popular RL and non-RL methods, comparing their theoretical foundations and practical advantages across common LLM tasks. Finally, we propose a new algorithm: Token-Reinforced Policy Optimization (TRePO), a theoretically grounded method that is simpler than PPO, matches GRPO in memory efficiency, and holds promise for broad applicability.

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective Read Post »

AI, Committee, 新闻, Uncategorized

CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

arXiv:2502.17214v2 Announce Type: replace Abstract: Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs’ inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on Llama Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: https://github.com/ZBox1005/CoT-UQ.

CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought Read Post »

AI, Committee, 新闻, Uncategorized

Unique Hard Attention: A Tale of Two Sides

arXiv:2503.14615v2 Announce Type: replace-cross Abstract: Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention — in that case, they correspond to a emph{strictly weaker} fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to emph{soft} attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.

Unique Hard Attention: A Tale of Two Sides Read Post »

AI, Committee, 新闻, Uncategorized

KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG

arXiv:2506.02503v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models’ capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.

KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG Read Post »

AI, Committee, 新闻, Uncategorized

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics

Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms—differences in morphology, sensors, and control modes—poses a further challenge to generalizability and cross-platform learning. Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework Hugging Face presents SmolVLA, a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs. A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools. Architectural Overview and Design Trade-Offs The SmolVLA model is structured into two primary components: Perception Module (SmolVLM-2): A pretrained compact vision-language encoder processes sequences of RGB images, sensorimotor states, and language instructions. For efficiency, the model limits visual tokens through downsampling and only uses the lower half of transformer layers, based on empirical findings that earlier layers often yield more transferable features. Action Expert: A lightweight transformer, trained with flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, balancing internal action coherence and conditioning on perception inputs. Causal masking is applied to enforce temporal consistency. To reduce computational overhead, linear projections are used to align the modalities’ token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch’s JIT compilation for runtime optimization. Empirical Evaluation: Simulation and Real-World Performance SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions. In the LIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as π₀ (3.3B). In Meta-World, the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA’s smaller training footprint and absence of robotics-specific pretraining. In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks—outperforming both ACT (trained from scratch) and π₀ (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data. Performance Implications of Asynchronous Inference SmolVLA’s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance. Conclusion SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices—layer pruning, chunked action prediction, and asynchronous execution—SmolVLA maintains performance while significantly reducing computational demands. The model’s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data. Check out the Paper and Model on Hugging Face . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics appeared first on MarkTechPost.

Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics Read Post »

AI, Committee, 新闻, Uncategorized

OpenAI Introduces Four Key Updates to Its AI Agent Framework

OpenAI has announced a set of targeted updates to its AI agent development stack, aimed at expanding platform compatibility, improving support for voice interfaces, and enhancing observability. These updates reflect a consistent progression toward building practical, controllable, and auditable AI agents that can be integrated into real-world applications across client and server environments. 1. TypeScript Support for the Agents SDK OpenAI’s Agents SDK is now available in TypeScript, extending the existing Python implementation to developers working in JavaScript and Node.js environments. The TypeScript SDK provides parity with the Python version, including foundational components such as: Handoffs: Mechanisms to route execution to other agents or processes. Guardrails: Runtime checks that constrain tool behavior to defined boundaries. Tracing: Hooks for collecting structured telemetry during agent execution. MCP (Model Context Protocol): Protocols for passing contextual state between agent steps and tool calls. This addition brings the SDK into alignment with modern web and cloud-native application stacks. Developers can now build and deploy agents across both frontend (browser) and backend (Node.js) contexts using a unified set of abstractions. The open documentation is available at openai-agents-js. 2. RealtimeAgent with Human-in-the-Loop Capabilities OpenAI introduced a new RealtimeAgent abstraction to support latency-sensitive voice applications. RealtimeAgents extend the Agents SDK with audio input/output, stateful interactions, and interruption handling. One of the more substantial features is human-in-the-loop (HITL) approval, allowing developers to intercept an agent’s execution at runtime, serialize its state, and require manual confirmation before continuing. This is especially relevant for applications requiring oversight, compliance checkpoints, or domain-specific validation during tool execution. Developers can pause execution, inspect the serialized state, and resume the agent with full context retention. The workflow is described in detail in OpenAI’s HITL documentation. 3. Traceability for Realtime API Sessions Complementing the RealtimeAgent feature, OpenAI has expanded the Traces dashboard to include support for voice agent sessions. Tracing now covers full Realtime API sessions—whether initiated via the SDK or directly through API calls. The Traces interface allows visualization of: Audio inputs and outputs (streamed or buffered) Tool invocations and parameters User interruptions and agent resumptions This provides a consistent audit trail for both text-based and audio-first agents, simplifying debugging, quality assurance, and performance tuning across modalities. The trace format is standardized and integrates with OpenAI’s broader monitoring stack, offering visibility without requiring additional instrumentation. Further implementation details are available in the voice agent guide at openai-agents-js/guides/voice-agents. 4. Refinements to the Speech-to-Speech Pipeline OpenAI has also made updates to its underlying speech-to-speech model, which powers real-time audio interactions. Enhancements focus on reducing latency, improving naturalness, and handling interruptions more effectively. While the model’s core capabilities—speech recognition, synthesis, and real-time feedback—remain in place, the refinements offer better alignment for dialog systems where responsiveness and tone variation are essential. This includes: Lower latency streaming: More immediate turn-taking in spoken conversations. Expressive audio generation: Improved intonation and pause modeling. Robustness to interruptions: Agents can respond gracefully to overlapping input. These changes align with OpenAI’s broader efforts to support embodied and conversational agents that function in dynamic, multimodal contexts. Conclusion Together, these four updates strengthen the foundation for building voice-enabled, traceable, and developer-friendly AI agents. By providing deeper integrations with TypeScript environments, introducing structured control points in real-time flows, and enhancing observability and speech interaction quality, OpenAI continues to move toward a more modular and interoperable agent ecosystem. Four updates to building agents with OpenAI: Agents SDK in TypeScript, a new RealtimeAgent feature for voice agents, Traces support for the Realtime API, and improvements to our speech-to-speech model. — OpenAI Developers (@OpenAIDevs) June 3, 2025 The post OpenAI Introduces Four Key Updates to Its AI Agent Framework appeared first on MarkTechPost.

OpenAI Introduces Four Key Updates to Its AI Agent Framework Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at 隱私權政策 and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
zh_CN