YouZum

Committee

AI, Committee, News, Uncategorized

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

arXiv:2509.04439v1 Announce Type: cross Abstract: While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. On the challenging ARC-AGI benchmark, our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, we confirm that dynamically updating memory during test-time outperforms an otherwise identical fixed memory setting with additional attempts, supporting the hypothesis that solving more problems and abstracting more patterns to memory enables further solutions in a form of self-improvement. Code available at https://github.com/matt-seb-ho/arc_memo.

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory Read Post »

AI, Committee, News, Uncategorized

Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research

Table of contents The Growing Role of AI in Biomedical Research The Core Challenge: Matching Expert-Level Reasoning Why Traditional Approaches Fall Short Biomni-R0: A New Paradigm Using Reinforcement Learning Training Strategy and System Design Results That Outperform Frontier Models Designing for Scalability and Precision Key Takeaways from the research include: The Growing Role of AI in Biomedical Research The field of biomedical artificial intelligence is evolving rapidly, with increasing demand for agents capable of performing tasks that span genomics, clinical diagnostics, and molecular biology. These agents aren’t merely designed to retrieve facts; they are expected to reason through complex biological problems, interpret patient data, and extract meaningful insights from vast biomedical databases. Unlike general-purpose AI models, biomedical agents must interface with domain-specific tools, comprehend biological hierarchies, and simulate workflows similar to those of researchers to effectively support modern biomedical research. The Core Challenge: Matching Expert-Level Reasoning However, achieving expert-level performance in these tasks is far from trivial. Most large language models fall short when dealing with the nuance and depth of biomedical reasoning. They may succeed on surface-level retrieval or pattern recognition tasks, but often fail when challenged with multi-step reasoning, rare disease diagnosis, or gene prioritization, areas that require not just data access, but contextual understanding and domain-specific judgment. This limitation has created a clear gap: how to train biomedical AI agents that can think and act like domain experts. Why Traditional Approaches Fall Short While some solutions leverage supervised learning on curated biomedical datasets or retrieval-augmented generation to ground responses in literature or databases, these approaches have drawbacks. They often rely on static prompts and pre-defined behaviors that lack adaptability. Furthermore, many of these agents struggle to effectively execute external tools, and their reasoning chains collapse when faced with unfamiliar biomedical structures. This fragility makes them ill-suited for dynamic or high-stakes environments, where interpretability and accuracy are non-negotiable. Biomni-R0: A New Paradigm Using Reinforcement Learning Researchers from Stanford University and UC Berkeley introduced a new family of models called Biomni-R0, built by applying reinforcement learning (RL) to a biomedical agent foundation. These models, Biomni-R0-8B and Biomni-R0-32B, were trained in an RL environment specifically tailored for biomedical reasoning, using both expert-annotated tasks and a novel reward structure. The collaboration combines Stanford’s Biomni agent and environment platform with UC Berkeley’s SkyRL reinforcement learning infrastructure, aiming to push biomedical agents past human-level capabilities. Training Strategy and System Design The research introduced a two-phase training process. First, they used supervised fine-tuning (SFT) on high-quality trajectories sampled from Claude-4 Sonnet using rejection sampling, effectively bootstrapping the agent’s ability to follow structured reasoning formats. Next, they fine-tuned the models using reinforcement learning, optimizing for two kinds of rewards: one for correctness (e.g., selecting the right gene or diagnosis), and another for response formatting (e.g., using structured <think> and <answer> tags correctly). To ensure computational efficiency, the team developed asynchronous rollout scheduling that minimized bottlenecks caused by external tool delays. They also expanded the context length to 64k tokens, allowing the agent to manage long multi-step reasoning conversations effectively. Results That Outperform Frontier Models The performance gains were significant. Biomni-R0-32B achieved a score of 0.669, a jump from the base model’s 0.346. Even Biomni-R0-8B, the smaller version, scored 0.588, outperforming general-purpose models like Claude 4 Sonnet and GPT-5, which are both much larger. On a task-by-task basis, Biomni-R0-32B scored highest on 7 out of 10 tasks, while GPT-5 led in 2, and Claude 4 in just 1. One of the most striking results was in rare disease diagnosis, where Biomni-R0-32B reached 0.67, compared to Qwen-32B’s 0.03, a more than 20× improvement. Similarly, in GWAS variant prioritization, the model’s score increased from 0.16 to 0.74, demonstrating the value of domain-specific reasoning. Designing for Scalability and Precision Training large biomedical agents requires dealing with resource-heavy rollouts involving external tool execution, database queries, and code evaluation. To manage this, the system decoupled environment execution from model inference, allowing more flexible scaling and reducing idle GPU time. This innovation ensured efficient use of resources, even with tools that had varying execution latencies. Longer reasoning sequences also proved beneficial. The RL-trained models consistently produced lengthier, structured responses, which strongly correlated with better performance, highlighting that depth and structure in reasoning are key indicators of expert-level understanding in biomedicine. Key Takeaways from the research include: Biomedical agents must perform deep reasoning, not just retrieval, across genomics, diagnostics, and molecular biology. The central problem is achieving expert-level task performance, mainly in complex areas such as rare diseases and gene prioritization. Traditional methods, including supervised fine-tuning and retrieval-based models, often fall short in terms of robustness and adaptability. Biomni-R0, developed by Stanford and UC Berkeley, uses reinforcement learning with expert-based rewards and structured output formatting. The two-phase training pipeline, SFT followed by RL, proved highly effective in optimizing performance and reasoning quality. Biomni-R0-8B delivers strong results with a smaller architecture, while Biomni-R0-32B sets new benchmarks, outperforming Claude 4 and GPT-5 on 7 of 10 tasks. Reinforcement learning enabled the agent to generate longer, more coherent reasoning traces, a key trait of expert behavior. This work lays the foundation for super-expert biomedical agents, capable of automating complex research workflows with precision. Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research appeared first on MarkTechPost.

Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research Read Post »

AI, Committee, News, Uncategorized

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

arXiv:2411.05085v2 Announce Type: replace-cross Abstract: Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation Read Post »

AI, Committee, News, Uncategorized

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

arXiv:2502.11128v2 Announce Type: replace Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model’s output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching Read Post »

AI, Committee, News, Uncategorized

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

arXiv:2503.23768v3 Announce Type: replace Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition Read Post »

AI, Committee, News, Uncategorized

Learn and Unlearn: Addressing Misinformation in Multilingual LLMs

arXiv:2406.13748v3 Announce Type: replace Abstract: This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data can we effectively eliminate generations for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across diverse linguistic landscapes.

Learn and Unlearn: Addressing Misinformation in Multilingual LLMs Read Post »

AI, Committee, News, Uncategorized

Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

arXiv:2509.03020v1 Announce Type: new Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

Training LLMs to be Better Text Embedders through Bidirectional Reconstruction Read Post »

AI, Committee, News, Uncategorized

Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

arXiv:2509.01455v1 Announce Type: new Abstract: Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.

Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal Read Post »

AI, Committee, News, Uncategorized

Explaining Length Bias in LLM-Based Preference Evaluations

arXiv:2407.01085v4 Announce Type: replace-cross Abstract: The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

Explaining Length Bias in LLM-Based Preference Evaluations Read Post »

AI, Committee, News, Uncategorized

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

arXiv:2507.14201v2 Announce Type: replace-cross Abstract: We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent x on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous alert signals and security logs, follow multi-hop chains of evidence, and compile an incident report. With the developments of LLMs, building LLM-based agents for automatic thread investigation is a promising direction. To assist the development and evaluation of LLM agents, we construct a dataset from a controlled Azure tenant that covers 8 simulated real-world multi-step attacks, 57 log tables from Microsoft Sentinel and related services, and 589 automatically generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. This also enables the automatic generation of procedural tasks with verifiable rewards, which can be naturally extended to training agents via reinforcement learning. Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368, leaving substantial headroom for future research. Code and data are coming soon!

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US