YouZum

Noticias

AI, Committee, Noticias, Uncategorized

Google DeepMind Introduces Aeneas: AI-Powered Contextualization and Restoration of Ancient Latin Inscriptions

The discipline of epigraphy, focused on studying texts inscribed on durable materials like stone and metal, provides critical firsthand evidence for understanding the Roman world. The field faces numerous challenges including fragmentary inscriptions, uncertain dating, diverse geographical provenance, widespread use of abbreviations, and a large and rapidly growing corpus of over 176,000 Latin inscriptions, with approximately 1,500 new inscriptions added annually. To address these challenges, Google DeepMind developed Aeneas: a transformer-based generative neural network that performs restoration of damaged text segments, chronological dating, geographic attribution, and contextualization through retrieval of relevant epigraphic parallels. Challenges in Latin Epigraphy Latin inscriptions span more than two millennia, from roughly the 7th century BCE to the 8th century CE, across the vast Roman Empire comprising over sixty provinces. These inscriptions vary from imperial decrees and legal documents to tombstones and votive altars. Epigraphers traditionally restore partially lost or illegible texts using detailed knowledge of language, formulae, and cultural context, and attribute inscriptions to certain timeframes and locations by comparing linguistic and material evidence. However, many inscriptions suffer from physical damage with missing segments of uncertain lengths. The wide geographic dispersion and diachronic linguistic changes make dating and provenance attribution complex, especially when combined with the sheer corpus size. Manual identification of epigraphic parallels is labor-intensive and often limited by specialized expertise localized to certain regions or periods. Latin Epigraphic Dataset (LED) Aeneas is trained on the Latin Epigraphic Dataset (LED), an integrated and harmonized corpus of 176,861 Latin inscriptions aggregating records from three major databases. The dataset includes approximately 16 million characters covering inscriptions spanning seven centuries BCE to eight centuries CE. About 5% of these inscriptions have associated grayscale images. The dataset uses character-level transcriptions employing special placeholder tokens: – marks missing text of a known length while # denotes missing segments of unknown length. Metadata includes province-level provenance over 62 Roman provinces and dating by decade. Model Architecture and Input Modalities Aeneas’s core is a deep, narrow transformer decoder based on the T5 architecture, adapted with rotary positional embeddings for effective local and contextual character processing. The textual input is processed alongside optional inscription images (when available) through a shallow convolutional network (ResNet-8), which feeds image embeddings to the geographical attribution head only. The model includes multiple specialized task heads to perform: Restoration: Predict missing characters, supporting arbitrary-length unknown gaps using an auxiliary neural classifier. Geographical Attribution: Classify inscriptions among 62 provinces by combining text and visual embeddings. Chronological Attribution: Estimate text date by decade using a predictive probabilistic distribution aligned with historical date ranges. Additionally, the model generates a unified historically enriched embedding by combining outputs from the core and task heads. This embedding enables retrieval of ranked epigraphic parallels using cosine similarity, incorporating linguistic, epigraphic, and broader cultural analogies beyond exact textual matches. Training Setup and Data Augmentation Training occurs on TPU v5e hardware with batch sizes up to 1024 text-image pairs. Losses for each task are combined with optimized weighting. The data is augmented by random text masking (up to 75% characters), text clipping, word deletions, punctuation dropping, image augmentations (zoom, rotation, brightness/contrast adjustments), dropout, and label smoothing to improve generalization. Prediction uses beam search with specialized non-sequential logic for unknown-length text restoration, ensuring multiple restoration candidates ranked by joint probability and length. Performance and Evaluation Evaluated on the LED test set and through a human-AI collaboration study with 23 epigraphers, Aeneas demonstrates marked improvements: Restoration: Character error rate (CER) reduced to approximately 21% when Aeneas support is provided, compared to 39% for unaided human experts. The model itself achieves around 23% CER on the test set. Geographical Attribution: Achieves around 72% accuracy in correctly classifying the province among 62 options. With Aeneas assistance, historians improve accuracy up to 68%, outperforming either alone. Chronological Attribution: Average error in date estimation is approximately 13 years for Aeneas, with historians aided by Aeneas reducing error from about 31 years to 14 years. Contextual Parallels: Epigraphic parallels retrieved are accepted as useful starting points for historical research in approximately 90% of cases and increase historians’ confidence by an average of 44%. These improvements are statistically significant and highlight the model’s utility as an augmentation to expert scholarship. Case Studies Res Gestae Divi Augusti:Aeneas’s analysis of this monumental inscription reveals bimodal dating distributions reflecting scholarly debates about its compositional layers and stages (late first century BCE and early first century CE). Saliency maps highlight date-sensitive linguistic forms, archaic orthography, institutional titles, and personal names, mirroring expert epigraphic knowledge. Parallels retrieved predominantly include imperial legal decrees and official senatorial texts sharing formulaic and ideological features. Votive Altar from Mainz (CIL XIII, 6665):Dedicated in 211 CE by a military official, this inscription was accurately dated and geographically attributed to Germania Superior and related provinces. Saliency maps identify key consular dating formulas and cultic references. Aeneas retrieved highly related parallels including a 197 CE altar sharing rare textual formulas and iconography, revealing historically meaningful connections beyond direct text overlap or spatial metadata. Integration in Research Workflows and Education Aeneas operates as a cooperative tool, not a replacement for historians. It accelerates searching for epigraphic parallels, aids restoration, and refines attribution, freeing scholars to focus on higher-level interpretation. The tool and dataset are openly available via the Predicting the Past platform under permissive licenses. An educational curriculum has been co-developed targeting high school students and educators, promoting interdisciplinary digital literacy by bridging AI and classical studies. FAQ 1: What is Aeneas and what tasks does it perform? Aeneas is a generative multimodal neural network developed by Google DeepMind for Latin epigraphy. It assists historians by restoring damaged or missing text in ancient Latin inscriptions, estimating their date within about 13 years, attributing their geographical origin with around 72% accuracy, and retrieving historically relevant parallel inscriptions for contextual analysis. FAQ 2: How does Aeneas handle incomplete or damaged inscriptions? Aeneas can predict missing text segments even when the length of the gap is unknown, a capability known as arbitrary-length restoration. It uses a transformer-based architecture and specialized neural network heads to generate multiple plausible restoration hypotheses, ranked by likelihood, facilitating

Google DeepMind Introduces Aeneas: AI-Powered Contextualization and Restoration of Ancient Latin Inscriptions Leer entrada »

AI, Committee, Noticias, Uncategorized

GenSeg: Generative AI Transforms Medical Image Segmentation in Ultra Low-Data Regimes

Medical image segmentation is at the heart of modern healthcare AI, enabling crucial tasks such as disease detection, progression monitoring, and personalized treatment planning. In disciplines like dermatology, radiology, and cardiology, the need for precise segmentation—assigning a class to every pixel in a medical image—is acute. Yet, the main obstacle remains: the scarcity of large, expertly labeled datasets. Creating these datasets requires intensive, pixel-level annotations by trained specialists, making it expensive and time-consuming. In real-world clinical settings, this often leads to “ultra low-data regimes,” where there are simply too few annotated images for training robust deep learning models. As a result, segmentation AI models often perform well on training data but fail to generalize, especially across new patients, diverse imaging equipment, or external hospitals—a phenomenon known as overfitting. Conventional Approaches and Their Shortcomings To address this data limitation, two mainstream strategies have been attempted: Data augmentation: This technique artificially expands the dataset by modifying existing images (rotations, flips, translations, etc.), hoping to improve model robustness. Semi-supervised learning: These approaches leverage large pools of unlabeled medical images, refining the segmentation model even in the absence of full labels. However, both approaches have significant downsides: Separating data generation from model training means augmented data is often poorly matched to the needs of the segmentation model. Semi-supervised methods require substantial quantities of unlabeled data—difficult to source in medical contexts due to privacy laws, ethical concerns, and logistical barriers. Introducing GenSeg: Purpose-Built Generative AI for Medical Image Segmentation A team of leading researchers from the University of California San Diego, UC Berkeley, Stanford, and the Weizmann Institute of Science has developed GenSeg—a next-generation generative AI framework specifically designed for medical image segmentation in low-label scenarios. Key Features of GenSeg: End-to-end generative framework that produces realistic, high-quality synthetic image-mask pairs. Multi-Level Optimization (MLO): GenSeg integrates segmentation performance feedback directly into the synthetic data generation process. Unlike traditional augmentation, it ensures that every synthetic example is optimized to improve segmentation outcomes. No need for large unlabeled datasets: GenSeg eliminates dependency on scarce, privacy-sensitive external data. Model-agnostic: Can be integrated seamlessly with popular architectures like UNet, DeepLab, and Transformer-based models. How GenSeg Works: Optimizing Synthetic Data for Real Results Rather than generating synthetic images blindly, GenSeg follows a three-stage optimization process: Synthetic Mask-Augmented Image Generation: From a small set of expert-labeled masks, GenSeg applies augmentations, then uses a generative adversarial network (GAN) to synthesize corresponding images—creating accurate, paired, synthetic training examples. Segmentation Model Training: Both real and synthetic pairs train the segmentation model, with performance evaluated on a held-out validation set. Performance-Driven Data Generation: Feedback from segmentation accuracy on real data continuously informs and refines the synthetic data generator, ensuring relevance and maximizing performance. Empirical Results: GenSeg Sets New Benchmarks GenSeg was rigorously tested across 11 segmentation tasks, 19 diverse medical imaging datasets, and multiple disease types and organs, including skin lesions, lungs, breast cancer, foot ulcers, and polyps. Highlights include: Superior accuracy even with extremely small datasets (as few as 9-50 labeled images per task). 10–20% absolute performance improvements over standard data augmentation and semi-supervised baselines. Requires 8–20x less labeled data to reach equivalent or superior accuracy compared to conventional methods. Robust out-of-domain generalization: GenSeg-trained models transfer well to new hospitals, imaging modalities, or patient populations. Why GenSeg Is a Game-Changer for AI in Healthcare GenSeg’s ability to create task-optimized synthetic data directly responds to the greatest bottleneck in medical AI: the scarcity of labeled data. With GenSeg, hospitals, clinics, and researchers can: Drastically reduce annotation costs and time. Improve model reliability and generalization—a major concern for clinical deployment. Accelerate the development of AI solutions for rare diseases, underrepresented populations, or emerging imaging modalities. Conclusion: Bringing High-Quality Medical AI to Data-Limited Settings GenSeg is a significant leap forward in AI-driven medical image analysis, especially where labeled data is a limiting factor. By tightly coupling synthetic data generation with real validation, GenSeg delivers high accuracy, efficiency, and adaptability—without the privacy and ethical hurdles of collecting massive datasets. For medical AI developers and clinicians: Incorporating GenSeg can unlock the full potential of deep learning in even the most data-limited medical environments. Check out the Paper and Code. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter The post GenSeg: Generative AI Transforms Medical Image Segmentation in Ultra Low-Data Regimes appeared first on MarkTechPost.

GenSeg: Generative AI Transforms Medical Image Segmentation in Ultra Low-Data Regimes Leer entrada »

AI, Committee, Noticias, Uncategorized

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving and better reflect their real-world multi-context reasoning capabilities. Why Current Evaluation Benchmarks Fall Short for Large Reasoning Models Most current benchmarks, such as GSM8K and MATH, evaluate LRMs by asking one question at a time. While effective for initial model development, this isolated question approach faces two critical drawbacks: Decreasing Discriminative Power: Many state-of-the-art LRMs now achieve near-perfect scores on popular benchmarks (e.g., DeepSeek-R1 reaching 97% accuracy on MATH500). These saturated results make it increasingly difficult to distinguish true model improvements, forcing the expensive, continuous creation of harder datasets to differentiate capabilities. Lack of Real-World Multi-Context Evaluation: Real-world applications — like educational tutoring, technical support, or multitasking AI assistants — require reasoning across multiple, potentially interfering questions simultaneously. Single-question testing does not capture these dynamic, multi-problem challenges that reflect true cognitive load and reasoning robustness. Introducing REST: Stress-Testing LRMs with Multiple Problems at Once To address these challenges, researchers from Tsinghua University, OpenDataLab, Shanghai AI Laboratory, and Renmin University developed REST, a simple yet powerful evaluation method that simultaneously tests LRMs on multiple questions bundled into a single prompt. Multi-Question Benchmark Reconstruction: REST repurposes existing benchmarks by concatenating multiple questions into one prompt, adjusting the stress level parameter that controls how many questions are presented simultaneously. Comprehensive Evaluation: REST evaluates critical reasoning competencies beyond basic problem-solving — including contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Wide Applicability: The framework is validated on 34 advanced LRMs ranging from 1.5 billion to 671 billion parameters, tested on 7 diverse benchmarks across varying difficulty levels (from simple GSM8K to challenging AIME and GPQA). REST Reveals Key Insights About LRM Reasoning Abilities The REST evaluation uncovers several groundbreaking findings: 1. Significant Performance Degradation Under Multi-Problem Stress Even state-of-the-art LRMs like DeepSeek-R1 show notable accuracy drops when handling multiple questions together. For example, DeepSeek-R1’s accuracy on challenging benchmarks like AIME24 falls by nearly 30% under REST compared to isolated question testing. This contradicts prior assumptions that large language models are inherently capable of effortlessly multitasking across problems. 2. Enhanced Discriminative Power Among Similar Models REST dramatically amplifies the differences between models with near-identical single-question scores. On MATH500, for instance: R1-7B and R1-32B achieve close single-question accuracies of 93% and 94.6%, respectively. Under REST, R1-7B’s accuracy plummets to 66.75% while R1-32B maintains a high 88.97%, revealing a stark 22% performance gap. Similarly, among same-sized models like AReaL-boba-RL-7B and OpenThinker2-7B, REST captures significant differences in multi-problem handling abilities that single-question evaluations mask. 3. Post-Training Methods May Not Guarantee Robust Multi-Problem Reasoning Models fine-tuned with reinforcement learning or supervised tuning on single-problem reasoning often fail to preserve their advantages in REST’s multi-question setting. This calls for rethinking training strategies to optimize reasoning robustness under realistic multi-context scenarios. 4. “Long2Short” Training Enhances Performance Under Stress Models trained with “long2short” techniques — which encourage concise and efficient reasoning chains — maintain higher accuracy under REST. This suggests a promising avenue for designing models better suited to simultaneous multi-problem reasoning. How REST Stimulates Realistic Reasoning Challenges By increasing the cognitive load on LRMs through simultaneous problem presentation, REST simulates real-world demands where reasoning systems must dynamically prioritize, avoid overthinking one problem, and resist interference from concurrent tasks. REST also systematically analyzes error types, revealing common failure modes such as: Question Omission: Ignoring later questions in a multi-question prompt. Summary Errors: Incorrectly summarizing answers across problems. Reasoning Errors: Logical or calculation mistakes within the reasoning process. These nuanced insights are largely invisible in single-question assessments. Practical Evaluation Setup and Benchmark Coverage REST evaluated 34 LRMs spanning sizes from 1.5B to 671B parameters. Benchmarks tested include: Simple: GSM8K Medium: MATH500, AMC23 Challenging: AIME24, AIME25, GPQA Diamond, LiveCodeBench Model generation parameters are set according to official guidelines, with output token limits of 32K for reasoning models. Using the standardized OpenCompass toolkit ensures consistent, reproducible results. Conclusion: REST as a Future-Proof, Realistic LRM Evaluation Paradigm REST constitutes a significant leap forward in evaluating large reasoning models by: Addressing Benchmark Saturation: Revitalizes existing datasets without expensive full replacements. Reflecting Real-World Multi-Task Demands: Tests models under realistic, high cognitive load conditions. Guiding Model Development: Highlights the importance of training methods like Long2Short to mitigate overthinking and encourage adaptive reasoning focus. In sum, REST paves the way for more reliable, robust, and application-relevant benchmarking of next-generation reasoning AI systems. Check out the Paper, Project Page and Code. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter The post REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models appeared first on MarkTechPost.

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries

Language model users often ask questions without enough detail, making it hard to understand what they want. For example, a question like “What book should I read next?” depends heavily on personal taste. At the same time, “How do antibiotics work?” should be answered differently depending on the user’s background knowledge. Current evaluation methods often overlook this missing context, resulting in inconsistent judgments. For instance, a response praising coffee might seem fine, but could be unhelpful or even harmful for someone with a health condition. Without knowing the user’s intent or needs, it’s difficult to fairly assess a model’s response quality.  Prior research has focused on generating clarification questions to address ambiguity or missing information in tasks such as Q&A, dialogue systems, and information retrieval. These methods aim to improve the understanding of user intent. Similarly, studies on instruction-following and personalization emphasize the importance of tailoring responses to user attributes, such as expertise, age, or style preferences. Some works have also examined how well models adapt to diverse contexts and proposed training methods to enhance this adaptability. Additionally, language model-based evaluators have gained traction due to their efficiency, although they can be biased, prompting efforts to improve their fairness through clearer evaluation criteria.  Researchers from the University of Pennsylvania, the Allen Institute for AI, and the University of Maryland, College Park have proposed contextualized evaluations. This method adds synthetic context (in the form of follow-up question-answer pairs) to clarify underspecified queries during language model evaluation. Their study reveals that including context can significantly impact evaluation outcomes, sometimes even reversing model rankings, while also improving agreement between evaluators. It reduces reliance on superficial features, such as style, and uncovers potential biases in default model responses, particularly toward WEIRD (Western, Educated, Industrialized, Rich, Democratic) contexts. The work also demonstrates that models exhibit varying sensitivities to different user contexts.  The researchers developed a simple framework to evaluate how language models perform when given clearer, contextualized queries. First, they selected underspecified queries from popular benchmark datasets and enriched them by adding follow-up question-answer pairs that simulate user-specific contexts. They then collected responses from different language models. They had both human and model-based evaluators compare responses in two settings: one with only the original query, and another with the added context. This allowed them to measure how context affects model rankings, evaluator agreement, and the criteria used for judgment. Their setup offers a practical way to test how models handle real-world ambiguity.  Adding context, such as user intent or audience, greatly improves model evaluation, boosting inter-rater agreement by 3–10% and even reversing model rankings in some cases. For instance, GPT-4 outperformed Gemini-1.5-Flash only when context was provided. Without it, evaluations focus on tone or fluency, while context shifts attention to accuracy and helpfulness. Default generations often reflect Western, formal, and general-audience biases, making them less effective for diverse users. Current benchmarks that ignore context risk produce unreliable results. To ensure fairness and real-world relevance, evaluations must pair context-rich prompts with matching scoring rubrics that reflect the actual needs of users.  In conclusion, Many user queries to language models are vague, lacking key context like user intent or expertise. This makes evaluations subjective and unreliable. To address this, the study proposes contextualized evaluations, where queries are enriched with relevant follow-up questions and answers. This added context helps shift the focus from surface-level traits to meaningful criteria, such as helpfulness, and can even reverse model rankings. It also reveals underlying biases; models often default to WEIRD (Western, Educated, Industrialized, Rich, Democratic) assumptions. While the study uses a limited set of context types and relies partly on automated scoring, it offers a strong case for more context-aware evaluations in future work.  Check out the Paper, Code, Dataset and Blog. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter The post Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries appeared first on MarkTechPost.

Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries Leer entrada »

AI, Committee, Noticias, Uncategorized

CoSyn: The open-source tool that’s making GPT-4V-level vision AI accessible to everyone

Researchers at the University of Pennsylvania and the Allen Institute for Artificial Intelligence have developed a groundbreaking tool that allows open-source AI systems to match or surpass the visual understanding capabilities of proprietary models like GPT-4V and Gemini 1.5 Flash, potentially reshaping the competitive landscape between open and c…Read More

CoSyn: The open-source tool that’s making GPT-4V-level vision AI accessible to everyone Leer entrada »

AI, Committee, Noticias, Uncategorized

FEEDER: A Pre-Selection Framework for Efficient Demonstration Selection in LLMs

LLMs have demonstrated exceptional performance across multiple tasks by utilizing few-shot inference, also known as in-context learning (ICL). The main problem lies in selecting the most representative demonstrations from large training datasets. Early methods selected demonstrations based on relevance using similarity scores between each example and the input question. Current methods suggest using additional selection rules, along with similarity, to enhance the efficiency of demonstration selection. These improvements introduce significant computational overhead when the number of shots increases. The effectiveness of selected demonstrations should also consider the specific LLM in use, as different LLMs exhibit varying capabilities and knowledge domains. Researchers from Shanghai Jiao Tong University, Xiaohongshu Inc., Carnegie Mellon University, Peking University, No Affiliation, University College London, and University of Bristol have proposed FEEDER (FEw yet Essential Demonstration prE-selectoR), a method to identify a core subset of demonstrations containing the most representative examples in training data, adjusted to specific LLMs. To construct this subset, “sufficiency” and “necessity” metrics are introduced in the pre-selection stage, along with a tree-based algorithm. Moreover, FEEDER reduces training data size by 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection techniques in ICL across LLMs ranging from 300M to 8B parameters. FEEDER is evaluated on 6 text classification datasets: SST-2, SST-5, COLA, TREC, SUBJ, and FPB, covering tasks from sentiment classification and linguistic analysis to textual entailment. It is also evaluated on the reasoning dataset GSM8K, the semantic-parsing dataset SMCALFlow, and the scientific question-answering dataset GPQA. The official splits for each dataset are directly followed to get the training and test data. Moreover, multiple LLM variants are utilized to evaluate the performance of the method, including two GPT-2 variants, GPT-neo with 1.3B parameters, GPT-3 with 6B parameters, Gemma-2 with 2B parameters, Llama-2 with 7B parameters, Llama-3 with 8B parameters, and Qwen-2.5 with 32B parameters as the LLM base. Results regarding in-context learning performance show that FEEDER enables retention of almost half the training samples while achieving superior or comparable performance. Evaluation of few-shot performance on complex tasks using LLMs like Gemma-2 shows that FEEDER improves performance even when LLMs struggle with challenging tasks. It performs effectively with large numbers of shots, handling situations where LLM performance usually drops when the number of examples increases from 5 to 10 due to noisy or repeated demonstrations. Moreover, FEEDER minimizes negative impact on LLM performance by evaluating the sufficiency and necessity of each demonstration, and helps in the performance stability of LLMs On bi-level optimization, FEEDER achieves improved performance by utilizing a small yet high-quality dataset for fine-tuning while simultaneously reducing computational expenses, aligning with the core-set selection principle. Results indicate that fine-tuning LLMs provides greater performance improvements compared to augmenting LLMs with contexts, with FEEDER achieving even better performance gains in fine-tuning settings. Performance analysis reveals that FEEDER’s effectiveness first rises and then drops with increasing number of runs or rounds (R and K, respectively), confirming that identifying representative subsets from training datasets enhances LLM performance. However, overly narrow subsets may limit potential performance gains. In conclusion, researchers introduced FEEDER, a demonstration pre-selector designed to use LLM capabilities and domain knowledge to identify high-quality demonstrations through an efficient discovery approach. It reduces training data requirements while maintaining comparable performance, offering a practical solution for efficient LLM deployment. Future research directions include exploring applications with larger LLMs and extending FEEDER’s capabilities to areas such as data safety and data management. FEEDER makes a valuable contribution to demonstration selection, providing researchers and practitioners with an effective tool for optimizing LLM performance while reducing computational overhead. Check out the Paper. All credit for this research goes to the researchers of this project. Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW] The post FEEDER: A Pre-Selection Framework for Efficient Demonstration Selection in LLMs appeared first on MarkTechPost.

FEEDER: A Pre-Selection Framework for Efficient Demonstration Selection in LLMs Leer entrada »

AI, Committee, Noticias, Uncategorized

RoboBrain 2.0: The Next-Generation Vision-Language Model Unifying Embodied AI for Advanced Robotics

Advancements in artificial intelligence are rapidly closing the gap between digital reasoning and real-world interaction. At the forefront of this progress is embodied AI—the field focused on enabling robots to perceive, reason, and act effectively in physical environments. As industries look to automate complex spatial and temporal tasks—from household assistance to logistics—having AI systems that truly understand their surroundings and plan actions becomes critical. Introducing RoboBrain 2.0: A Breakthrough in Embodied Vision-Language AI Developed by the Beijing Academy of Artificial Intelligence (BAAI), RoboBrain 2.0 marks a major milestone in the design of foundation models for robotics and embodied artificial intelligence. Unlike conventional AI models, RoboBrain 2.0 unifies spatial perception, high-level reasoning, and long-horizon planning within a single architecture. Its versatility supports a diverse set of embodied tasks, such as affordance prediction, spatial object localization, trajectory planning, and multi-agent collaboration. Key Highlights of RoboBrain 2.0 Two Scalable Versions: Offers both a fast, resource-efficient 7-billion-parameter (7B) variant and a powerful 32-billion-parameter (32B) model for more demanding tasks. Unified Multi-Modal Architecture: Couples a high-resolution vision encoder with a decoder-only language model, enabling seamless integration of images, video, text instructions, and scene graphs. Advanced Spatial and Temporal Reasoning: Excels at tasks requiring an understanding of object relationships, motion forecasting, and complex, multi-step planning. Open-Source Foundation: Built using the FlagScale framework, RoboBrain 2.0 is designed for easy research adoption, reproducibility, and practical deployment. How RoboBrain 2.0 Works: Architecture and Training Multi-Modal Input Pipeline RoboBrain 2.0 ingests a diverse mix of sensory and symbolic data: Multi-View Images & Videos: Supports high-resolution, egocentric, and third-person visual streams for rich spatial context. Natural Language Instructions: Interprets a wide range of commands, from simple navigation to intricate manipulation instructions. Scene Graphs: Processes structured representations of objects, their relationships, and environmental layouts. The system’s tokenizer encodes language and scene graphs, while a specialized vision encoder utilizes adaptive positional encoding and windowed attention to process visual data effectively. Visual features are projected into the language model’s space via a multi-layer perceptron, enabling unified, multimodal token sequences. Three-Stage Training Process RoboBrain 2.0 achieves its embodied intelligence through a progressive, three-phase training curriculum: Foundational Spatiotemporal Learning: Builds core visual and language capabilities, grounding spatial perception and basic temporal understanding. Embodied Task Enhancement: Refines the model with real-world, multi-view video and high-resolution datasets, optimizing for tasks like 3D affordance detection and robot-centric scene analysis. Chain-of-Thought Reasoning: Integrates explainable step-by-step reasoning using diverse activity traces and task decompositions, underpinning robust decision-making for long-horizon, multi-agent scenarios. Scalable Infrastructure for Research and Deployment RoboBrain 2.0 leverages the FlagScale platform, offering: Hybrid parallelism for efficient use of compute resources Pre-allocated memory and high-throughput data pipelines to reduce training costs and latency Automatic fault tolerance to ensure stability across large-scale distributed systems This infrastructure allows for rapid model training, easy experimentation, and scalable deployment in real-world robotic applications. Real-World Applications and Performance RoboBrain 2.0 is evaluated on a broad suite of embodied AI benchmarks, consistently surpassing both open-source and proprietary models in spatial and temporal reasoning. Key capabilities include: Affordance Prediction: Identifying functional object regions for grasping, pushing, or interacting Precise Object Localization & Pointing: Accurately following textual instructions to find and point to objects or vacant spaces in complex scenes Trajectory Forecasting: Planning efficient, obstacle-aware end-effector movements Multi-Agent Planning: Decomposing tasks and coordinating multiple robots for collaborative goals Its robust, open-access design makes RoboBrain 2.0 immediately useful for applications in household robotics, industrial automation, logistics, and beyond. Potential in Embodied AI and Robotics By unifying vision-language understanding, interactive reasoning, and robust planning, RoboBrain 2.0 sets a new standard for embodied AI. Its modular, scalable architecture and open-source training recipes facilitate innovation across the robotics and AI research community. Whether you are a developer building intelligent assistants, a researcher advancing AI planning, or an engineer automating real-world tasks, RoboBrain 2.0 offers a powerful foundation for tackling the most complex spatial and temporal challenges. Check out the Paper and Codes. All credit for this research goes to the researchers of this project | Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW] The post RoboBrain 2.0: The Next-Generation Vision-Language Model Unifying Embodied AI for Advanced Robotics appeared first on MarkTechPost.

RoboBrain 2.0: The Next-Generation Vision-Language Model Unifying Embodied AI for Advanced Robotics Leer entrada »

AI, Committee, Noticias, Uncategorized

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

arXiv:2507.18043v1 Announce Type: new Abstract: Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs Leer entrada »

es_ES