YouZum

Uncategorized

AI, Committee, ニュース, Uncategorized

OpenAI Introduces Four Key Updates to Its AI Agent Framework

OpenAI has announced a set of targeted updates to its AI agent development stack, aimed at expanding platform compatibility, improving support for voice interfaces, and enhancing observability. These updates reflect a consistent progression toward building practical, controllable, and auditable AI agents that can be integrated into real-world applications across client and server environments. 1. TypeScript Support for the Agents SDK OpenAI’s Agents SDK is now available in TypeScript, extending the existing Python implementation to developers working in JavaScript and Node.js environments. The TypeScript SDK provides parity with the Python version, including foundational components such as: Handoffs: Mechanisms to route execution to other agents or processes. Guardrails: Runtime checks that constrain tool behavior to defined boundaries. Tracing: Hooks for collecting structured telemetry during agent execution. MCP (Model Context Protocol): Protocols for passing contextual state between agent steps and tool calls. This addition brings the SDK into alignment with modern web and cloud-native application stacks. Developers can now build and deploy agents across both frontend (browser) and backend (Node.js) contexts using a unified set of abstractions. The open documentation is available at openai-agents-js. 2. RealtimeAgent with Human-in-the-Loop Capabilities OpenAI introduced a new RealtimeAgent abstraction to support latency-sensitive voice applications. RealtimeAgents extend the Agents SDK with audio input/output, stateful interactions, and interruption handling. One of the more substantial features is human-in-the-loop (HITL) approval, allowing developers to intercept an agent’s execution at runtime, serialize its state, and require manual confirmation before continuing. This is especially relevant for applications requiring oversight, compliance checkpoints, or domain-specific validation during tool execution. Developers can pause execution, inspect the serialized state, and resume the agent with full context retention. The workflow is described in detail in OpenAI’s HITL documentation. 3. Traceability for Realtime API Sessions Complementing the RealtimeAgent feature, OpenAI has expanded the Traces dashboard to include support for voice agent sessions. Tracing now covers full Realtime API sessions—whether initiated via the SDK or directly through API calls. The Traces interface allows visualization of: Audio inputs and outputs (streamed or buffered) Tool invocations and parameters User interruptions and agent resumptions This provides a consistent audit trail for both text-based and audio-first agents, simplifying debugging, quality assurance, and performance tuning across modalities. The trace format is standardized and integrates with OpenAI’s broader monitoring stack, offering visibility without requiring additional instrumentation. Further implementation details are available in the voice agent guide at openai-agents-js/guides/voice-agents. 4. Refinements to the Speech-to-Speech Pipeline OpenAI has also made updates to its underlying speech-to-speech model, which powers real-time audio interactions. Enhancements focus on reducing latency, improving naturalness, and handling interruptions more effectively. While the model’s core capabilities—speech recognition, synthesis, and real-time feedback—remain in place, the refinements offer better alignment for dialog systems where responsiveness and tone variation are essential. This includes: Lower latency streaming: More immediate turn-taking in spoken conversations. Expressive audio generation: Improved intonation and pause modeling. Robustness to interruptions: Agents can respond gracefully to overlapping input. These changes align with OpenAI’s broader efforts to support embodied and conversational agents that function in dynamic, multimodal contexts. Conclusion Together, these four updates strengthen the foundation for building voice-enabled, traceable, and developer-friendly AI agents. By providing deeper integrations with TypeScript environments, introducing structured control points in real-time flows, and enhancing observability and speech interaction quality, OpenAI continues to move toward a more modular and interoperable agent ecosystem. Four updates to building agents with OpenAI: Agents SDK in TypeScript, a new RealtimeAgent feature for voice agents, Traces support for the Realtime API, and improvements to our speech-to-speech model. — OpenAI Developers (@OpenAIDevs) June 3, 2025 The post OpenAI Introduces Four Key Updates to Its AI Agent Framework appeared first on MarkTechPost.

OpenAI Introduces Four Key Updates to Its AI Agent Framework 投稿を読む »

AI, Committee, ニュース, Uncategorized

CHEER-Ekman: Fine-grained Embodied Emotion Classification

arXiv:2506.01047v1 Announce Type: new Abstract: Emotions manifest through physical experiences and bodily reactions, yet identifying such embodied emotions in text remains understudied. We present an embodied emotion classification dataset, CHEER-Ekman, extending the existing binary embodied emotion dataset with Ekman’s six basic emotion categories. Using automatic best-worst scaling with large language models, we achieve performance superior to supervised approaches on our new dataset. Our investigation reveals that simplified prompting instructions and chain-of-thought reasoning significantly improve emotion recognition accuracy, enabling smaller models to achieve competitive performance with larger ones.

CHEER-Ekman: Fine-grained Embodied Emotion Classification 投稿を読む »

AI, Committee, ニュース, Uncategorized

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

arXiv:2506.01365v1 Announce Type: cross Abstract: Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion 投稿を読む »

AI, Committee, ニュース, Uncategorized

Completing A Systematic Review in Hours instead of Months with Interactive AI Agents

arXiv:2504.14822v2 Announce Type: replace-cross Abstract: Systematic reviews (SRs) are vital for evidence-based practice in high stakes disciplines, such as healthcare, but are often impeded by intensive labors and lengthy processes that can take months to complete. Due to the high demand for domain expertise, existing automatic summarization methods fail to accurately identify relevant studies and generate high-quality summaries. To that end, we introduce InsightAgent, a human-centered interactive AI agent powered by large language models that revolutionize this workflow. InsightAgent partitions a large literature corpus based on semantics and employs a multi-agent design for more focused processing of literature, leading to significant improvement in the quality of generated SRs. InsightAgent also provides intuitive visualizations of the corpus and agent trajectories, allowing users to effortlessly monitor the actions of the agent and provide real-time feedback based on their expertise. Our user studies with 9 medical professionals demonstrate that the visualization and interaction mechanisms can effectively improve the quality of synthesized SRs by 27.2%, reaching 79.7% of human-written quality. At the same time, user satisfaction is improved by 34.4%. With InsightAgent, it only takes a clinician about 1.5 hours, rather than months, to complete a high-quality systematic review.

Completing A Systematic Review in Hours instead of Months with Interactive AI Agents 投稿を読む »

AI, Committee, ニュース, Uncategorized

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You

arXiv:2401.16092v4 Announce Type: replace Abstract: Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as monolingual models do. Furthermore, the natural expectation that multilingual models will provide similar results across languages does not hold up. Instead, there are important differences between languages. We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models. We use MAGBIG to investigate the effect of multilingualism on gender bias in T2I models. To this end, we construct multilingual prompts requesting portraits of people with a certain occupation or trait. Our results show that not only do models exhibit strong gender biases but they also behave differently across languages. Furthermore, we investigate prompt engineering strategies, such as indirect, neutral formulations, to mitigate these biases. Unfortunately, these approaches have limited success and result in worse text-to-image alignment. Consequently, we call for more research into diverse representations across languages in image generators, as well as into steerability to address biased model behavior.

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You 投稿を読む »

AI, Committee, ニュース, Uncategorized

Graph-Based Spectral Decomposition for Parameter Coordination in Language Model Fine-Tuning

arXiv:2504.19583v2 Announce Type: replace-cross Abstract: This paper proposes a parameter collaborative optimization algorithm for large language models, enhanced with graph spectral analysis. The goal is to improve both fine-tuning efficiency and structural awareness during training. In the proposed method, the parameters of a pre-trained language model are treated as nodes in a graph. A weighted graph is constructed, and Laplacian spectral decomposition is applied to enable frequency-domain modeling and structural representation of the parameter space. Based on this structure, a joint loss function is designed. It combines the task loss with a spectral regularization term to facilitate collaborative updates among parameters. In addition, a spectral filtering mechanism is introduced during the optimization phase. This mechanism adjusts gradients in a structure-aware manner, enhancing the model’s training stability and convergence behavior. The method is evaluated on multiple tasks, including traditional fine-tuning comparisons, few-shot generalization tests, and convergence speed analysis. In all settings, the proposed approach demonstrates superior performance. The experimental results confirm that the spectral collaborative optimization framework effectively reduces parameter perturbations and improves fine-tuning quality while preserving overall model performance. This work contributes significantly to the field of artificial intelligence by advancing parameter-efficient training methodologies for large-scale models, reinforcing the importance of structural signal processing in deep learning optimization, and offering a robust, generalizable framework for enhancing language model adaptability and performance.

Graph-Based Spectral Decomposition for Parameter Coordination in Language Model Fine-Tuning 投稿を読む »

ja