Committee Archives - 50ページ目 (101ページ中)

Data Augmentation for Spoken Grammatical Error Correction

admin NU / 7月 28, 2025

arXiv:2507.19374v1 Announce Type: new Abstract: While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S&I Corpus, the first publicly available speech dataset with grammar error annotations.

Data Augmentation for Spoken Grammatical Error Correction 投稿を読む »

AI, Committee, ニュース, Uncategorized

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

admin NU / 7月 28, 2025

arXiv:2412.13666v2 Announce Type: replace Abstract: The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts raises many concerns regarding their misuse. Previous research has shown that LLMs can be effectively misused for generating disinformation news articles following predefined narratives. Their capabilities to generate personalized (in various aspects) content have also been evaluated and mostly found usable. However, a combination of personalization and disinformation abilities of LLMs has not been comprehensively studied yet. Such a dangerous combination should trigger integrated safety filters of the LLMs, if there are some. This study fills this gap by evaluating vulnerabilities of recent open and closed LLMs, and their willingness to generate personalized disinformation news articles in English. We further explore whether the LLMs can reliably meta-evaluate the personalization quality and whether the personalization affects the generated-texts detectability. Our results demonstrate the need for stronger safety-filters and disclaimers, as those are not properly functioning in most of the evaluated LLMs. Additionally, our study revealed that the personalization actually reduces the safety-filter activations; thus effectively functioning as a jailbreak. Such behavior must be urgently addressed by LLM developers and service providers.

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation 投稿を読む »

AI, Committee, ニュース, Uncategorized

HIVMedQA: Benchmarking large language models for HIV medical decision support

admin NU / 7月 28, 2025

arXiv:2507.18143v2 Announce Type: replace Abstract: Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.

HIVMedQA: Benchmarking large language models for HIV medical decision support 投稿を読む »

AI, Committee, ニュース, Uncategorized

TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

admin NU / 7月 28, 2025

arXiv:2507.19419v1 Announce Type: new Abstract: Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub1, with accompanying documentation and tutorials. A demonstration video is also available on YouTube.

TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability 投稿を読む »

AI, Committee, ニュース, Uncategorized

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

admin NU / 7月 28, 2025

arXiv:2501.12612v3 Announce Type: replace Abstract: Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation 投稿を読む »

AI, Committee, ニュース, Uncategorized

Why AI is making us lose our minds (and not in the way you’d think)

admin NU / 7月 27, 2025

The question isn’t, “will you use AI?” The question is, “what kind of AI user do you want to be: driver or passenger?”Read More

Why AI is making us lose our minds (and not in the way you’d think) 投稿を読む »

AI, Committee, ニュース, Uncategorized

Google DeepMind Introduces Aeneas: AI-Powered Contextualization and Restoration of Ancient Latin Inscriptions

admin NU / 7月 27, 2025

The discipline of epigraphy, focused on studying texts inscribed on durable materials like stone and metal, provides critical firsthand evidence for understanding the Roman world. The field faces numerous challenges including fragmentary inscriptions, uncertain dating, diverse geographical provenance, widespread use of abbreviations, and a large and rapidly growing corpus of over 176,000 Latin inscriptions, with approximately 1,500 new inscriptions added annually. To address these challenges, Google DeepMind developed Aeneas: a transformer-based generative neural network that performs restoration of damaged text segments, chronological dating, geographic attribution, and contextualization through retrieval of relevant epigraphic parallels. Challenges in Latin Epigraphy Latin inscriptions span more than two millennia, from roughly the 7th century BCE to the 8th century CE, across the vast Roman Empire comprising over sixty provinces. These inscriptions vary from imperial decrees and legal documents to tombstones and votive altars. Epigraphers traditionally restore partially lost or illegible texts using detailed knowledge of language, formulae, and cultural context, and attribute inscriptions to certain timeframes and locations by comparing linguistic and material evidence. However, many inscriptions suffer from physical damage with missing segments of uncertain lengths. The wide geographic dispersion and diachronic linguistic changes make dating and provenance attribution complex, especially when combined with the sheer corpus size. Manual identification of epigraphic parallels is labor-intensive and often limited by specialized expertise localized to certain regions or periods. Latin Epigraphic Dataset (LED) Aeneas is trained on the Latin Epigraphic Dataset (LED), an integrated and harmonized corpus of 176,861 Latin inscriptions aggregating records from three major databases. The dataset includes approximately 16 million characters covering inscriptions spanning seven centuries BCE to eight centuries CE. About 5% of these inscriptions have associated grayscale images. The dataset uses character-level transcriptions employing special placeholder tokens: – marks missing text of a known length while # denotes missing segments of unknown length. Metadata includes province-level provenance over 62 Roman provinces and dating by decade. Model Architecture and Input Modalities Aeneas’s core is a deep, narrow transformer decoder based on the T5 architecture, adapted with rotary positional embeddings for effective local and contextual character processing. The textual input is processed alongside optional inscription images (when available) through a shallow convolutional network (ResNet-8), which feeds image embeddings to the geographical attribution head only. The model includes multiple specialized task heads to perform: Restoration: Predict missing characters, supporting arbitrary-length unknown gaps using an auxiliary neural classifier. Geographical Attribution: Classify inscriptions among 62 provinces by combining text and visual embeddings. Chronological Attribution: Estimate text date by decade using a predictive probabilistic distribution aligned with historical date ranges. Additionally, the model generates a unified historically enriched embedding by combining outputs from the core and task heads. This embedding enables retrieval of ranked epigraphic parallels using cosine similarity, incorporating linguistic, epigraphic, and broader cultural analogies beyond exact textual matches. Training Setup and Data Augmentation Training occurs on TPU v5e hardware with batch sizes up to 1024 text-image pairs. Losses for each task are combined with optimized weighting. The data is augmented by random text masking (up to 75% characters), text clipping, word deletions, punctuation dropping, image augmentations (zoom, rotation, brightness/contrast adjustments), dropout, and label smoothing to improve generalization. Prediction uses beam search with specialized non-sequential logic for unknown-length text restoration, ensuring multiple restoration candidates ranked by joint probability and length. Performance and Evaluation Evaluated on the LED test set and through a human-AI collaboration study with 23 epigraphers, Aeneas demonstrates marked improvements: Restoration: Character error rate (CER) reduced to approximately 21% when Aeneas support is provided, compared to 39% for unaided human experts. The model itself achieves around 23% CER on the test set. Geographical Attribution: Achieves around 72% accuracy in correctly classifying the province among 62 options. With Aeneas assistance, historians improve accuracy up to 68%, outperforming either alone. Chronological Attribution: Average error in date estimation is approximately 13 years for Aeneas, with historians aided by Aeneas reducing error from about 31 years to 14 years. Contextual Parallels: Epigraphic parallels retrieved are accepted as useful starting points for historical research in approximately 90% of cases and increase historians’ confidence by an average of 44%. These improvements are statistically significant and highlight the model’s utility as an augmentation to expert scholarship. Case Studies Res Gestae Divi Augusti:Aeneas’s analysis of this monumental inscription reveals bimodal dating distributions reflecting scholarly debates about its compositional layers and stages (late first century BCE and early first century CE). Saliency maps highlight date-sensitive linguistic forms, archaic orthography, institutional titles, and personal names, mirroring expert epigraphic knowledge. Parallels retrieved predominantly include imperial legal decrees and official senatorial texts sharing formulaic and ideological features. Votive Altar from Mainz (CIL XIII, 6665):Dedicated in 211 CE by a military official, this inscription was accurately dated and geographically attributed to Germania Superior and related provinces. Saliency maps identify key consular dating formulas and cultic references. Aeneas retrieved highly related parallels including a 197 CE altar sharing rare textual formulas and iconography, revealing historically meaningful connections beyond direct text overlap or spatial metadata. Integration in Research Workflows and Education Aeneas operates as a cooperative tool, not a replacement for historians. It accelerates searching for epigraphic parallels, aids restoration, and refines attribution, freeing scholars to focus on higher-level interpretation. The tool and dataset are openly available via the Predicting the Past platform under permissive licenses. An educational curriculum has been co-developed targeting high school students and educators, promoting interdisciplinary digital literacy by bridging AI and classical studies. FAQ 1: What is Aeneas and what tasks does it perform? Aeneas is a generative multimodal neural network developed by Google DeepMind for Latin epigraphy. It assists historians by restoring damaged or missing text in ancient Latin inscriptions, estimating their date within about 13 years, attributing their geographical origin with around 72% accuracy, and retrieving historically relevant parallel inscriptions for contextual analysis. FAQ 2: How does Aeneas handle incomplete or damaged inscriptions? Aeneas can predict missing text segments even when the length of the gap is unknown, a capability known as arbitrary-length restoration. It uses a transformer-based architecture and specialized neural network heads to generate multiple plausible restoration hypotheses, ranked by likelihood, facilitating

Google DeepMind Introduces Aeneas: AI-Powered Contextualization and Restoration of Ancient Latin Inscriptions 投稿を読む »

AI, Committee, ニュース, Uncategorized

GenSeg: Generative AI Transforms Medical Image Segmentation in Ultra Low-Data Regimes

admin NU / 7月 27, 2025

Medical image segmentation is at the heart of modern healthcare AI, enabling crucial tasks such as disease detection, progression monitoring, and personalized treatment planning. In disciplines like dermatology, radiology, and cardiology, the need for precise segmentation—assigning a class to every pixel in a medical image—is acute. Yet, the main obstacle remains: the scarcity of large, expertly labeled datasets. Creating these datasets requires intensive, pixel-level annotations by trained specialists, making it expensive and time-consuming. In real-world clinical settings, this often leads to “ultra low-data regimes,” where there are simply too few annotated images for training robust deep learning models. As a result, segmentation AI models often perform well on training data but fail to generalize, especially across new patients, diverse imaging equipment, or external hospitals—a phenomenon known as overfitting. Conventional Approaches and Their Shortcomings To address this data limitation, two mainstream strategies have been attempted: Data augmentation: This technique artificially expands the dataset by modifying existing images (rotations, flips, translations, etc.), hoping to improve model robustness. Semi-supervised learning: These approaches leverage large pools of unlabeled medical images, refining the segmentation model even in the absence of full labels. However, both approaches have significant downsides: Separating data generation from model training means augmented data is often poorly matched to the needs of the segmentation model. Semi-supervised methods require substantial quantities of unlabeled data—difficult to source in medical contexts due to privacy laws, ethical concerns, and logistical barriers. Introducing GenSeg: Purpose-Built Generative AI for Medical Image Segmentation A team of leading researchers from the University of California San Diego, UC Berkeley, Stanford, and the Weizmann Institute of Science has developed GenSeg—a next-generation generative AI framework specifically designed for medical image segmentation in low-label scenarios. Key Features of GenSeg: End-to-end generative framework that produces realistic, high-quality synthetic image-mask pairs. Multi-Level Optimization (MLO): GenSeg integrates segmentation performance feedback directly into the synthetic data generation process. Unlike traditional augmentation, it ensures that every synthetic example is optimized to improve segmentation outcomes. No need for large unlabeled datasets: GenSeg eliminates dependency on scarce, privacy-sensitive external data. Model-agnostic: Can be integrated seamlessly with popular architectures like UNet, DeepLab, and Transformer-based models. How GenSeg Works: Optimizing Synthetic Data for Real Results Rather than generating synthetic images blindly, GenSeg follows a three-stage optimization process: Synthetic Mask-Augmented Image Generation: From a small set of expert-labeled masks, GenSeg applies augmentations, then uses a generative adversarial network (GAN) to synthesize corresponding images—creating accurate, paired, synthetic training examples. Segmentation Model Training: Both real and synthetic pairs train the segmentation model, with performance evaluated on a held-out validation set. Performance-Driven Data Generation: Feedback from segmentation accuracy on real data continuously informs and refines the synthetic data generator, ensuring relevance and maximizing performance. Empirical Results: GenSeg Sets New Benchmarks GenSeg was rigorously tested across 11 segmentation tasks, 19 diverse medical imaging datasets, and multiple disease types and organs, including skin lesions, lungs, breast cancer, foot ulcers, and polyps. Highlights include: Superior accuracy even with extremely small datasets (as few as 9-50 labeled images per task). 10–20% absolute performance improvements over standard data augmentation and semi-supervised baselines. Requires 8–20x less labeled data to reach equivalent or superior accuracy compared to conventional methods. Robust out-of-domain generalization: GenSeg-trained models transfer well to new hospitals, imaging modalities, or patient populations. Why GenSeg Is a Game-Changer for AI in Healthcare GenSeg’s ability to create task-optimized synthetic data directly responds to the greatest bottleneck in medical AI: the scarcity of labeled data. With GenSeg, hospitals, clinics, and researchers can: Drastically reduce annotation costs and time. Improve model reliability and generalization—a major concern for clinical deployment. Accelerate the development of AI solutions for rare diseases, underrepresented populations, or emerging imaging modalities. Conclusion: Bringing High-Quality Medical AI to Data-Limited Settings GenSeg is a significant leap forward in AI-driven medical image analysis, especially where labeled data is a limiting factor. By tightly coupling synthetic data generation with real validation, GenSeg delivers high accuracy, efficiency, and adaptability—without the privacy and ethical hurdles of collecting massive datasets. For medical AI developers and clinicians: Incorporating GenSeg can unlock the full potential of deep learning in even the most data-limited medical environments. Check out the Paper and Code. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter The post GenSeg: Generative AI Transforms Medical Image Segmentation in Ultra Low-Data Regimes appeared first on MarkTechPost.

GenSeg: Generative AI Transforms Medical Image Segmentation in Ultra Low-Data Regimes 投稿を読む »

AI, Committee, ニュース, Uncategorized

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

admin NU / 7月 27, 2025

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving and better reflect their real-world multi-context reasoning capabilities. Why Current Evaluation Benchmarks Fall Short for Large Reasoning Models Most current benchmarks, such as GSM8K and MATH, evaluate LRMs by asking one question at a time. While effective for initial model development, this isolated question approach faces two critical drawbacks: Decreasing Discriminative Power: Many state-of-the-art LRMs now achieve near-perfect scores on popular benchmarks (e.g., DeepSeek-R1 reaching 97% accuracy on MATH500). These saturated results make it increasingly difficult to distinguish true model improvements, forcing the expensive, continuous creation of harder datasets to differentiate capabilities. Lack of Real-World Multi-Context Evaluation: Real-world applications — like educational tutoring, technical support, or multitasking AI assistants — require reasoning across multiple, potentially interfering questions simultaneously. Single-question testing does not capture these dynamic, multi-problem challenges that reflect true cognitive load and reasoning robustness. Introducing REST: Stress-Testing LRMs with Multiple Problems at Once To address these challenges, researchers from Tsinghua University, OpenDataLab, Shanghai AI Laboratory, and Renmin University developed REST, a simple yet powerful evaluation method that simultaneously tests LRMs on multiple questions bundled into a single prompt. Multi-Question Benchmark Reconstruction: REST repurposes existing benchmarks by concatenating multiple questions into one prompt, adjusting the stress level parameter that controls how many questions are presented simultaneously. Comprehensive Evaluation: REST evaluates critical reasoning competencies beyond basic problem-solving — including contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Wide Applicability: The framework is validated on 34 advanced LRMs ranging from 1.5 billion to 671 billion parameters, tested on 7 diverse benchmarks across varying difficulty levels (from simple GSM8K to challenging AIME and GPQA). REST Reveals Key Insights About LRM Reasoning Abilities The REST evaluation uncovers several groundbreaking findings: 1. Significant Performance Degradation Under Multi-Problem Stress Even state-of-the-art LRMs like DeepSeek-R1 show notable accuracy drops when handling multiple questions together. For example, DeepSeek-R1’s accuracy on challenging benchmarks like AIME24 falls by nearly 30% under REST compared to isolated question testing. This contradicts prior assumptions that large language models are inherently capable of effortlessly multitasking across problems. 2. Enhanced Discriminative Power Among Similar Models REST dramatically amplifies the differences between models with near-identical single-question scores. On MATH500, for instance: R1-7B and R1-32B achieve close single-question accuracies of 93% and 94.6%, respectively. Under REST, R1-7B’s accuracy plummets to 66.75% while R1-32B maintains a high 88.97%, revealing a stark 22% performance gap. Similarly, among same-sized models like AReaL-boba-RL-7B and OpenThinker2-7B, REST captures significant differences in multi-problem handling abilities that single-question evaluations mask. 3. Post-Training Methods May Not Guarantee Robust Multi-Problem Reasoning Models fine-tuned with reinforcement learning or supervised tuning on single-problem reasoning often fail to preserve their advantages in REST’s multi-question setting. This calls for rethinking training strategies to optimize reasoning robustness under realistic multi-context scenarios. 4. “Long2Short” Training Enhances Performance Under Stress Models trained with “long2short” techniques — which encourage concise and efficient reasoning chains — maintain higher accuracy under REST. This suggests a promising avenue for designing models better suited to simultaneous multi-problem reasoning. How REST Stimulates Realistic Reasoning Challenges By increasing the cognitive load on LRMs through simultaneous problem presentation, REST simulates real-world demands where reasoning systems must dynamically prioritize, avoid overthinking one problem, and resist interference from concurrent tasks. REST also systematically analyzes error types, revealing common failure modes such as: Question Omission: Ignoring later questions in a multi-question prompt. Summary Errors: Incorrectly summarizing answers across problems. Reasoning Errors: Logical or calculation mistakes within the reasoning process. These nuanced insights are largely invisible in single-question assessments. Practical Evaluation Setup and Benchmark Coverage REST evaluated 34 LRMs spanning sizes from 1.5B to 671B parameters. Benchmarks tested include: Simple: GSM8K Medium: MATH500, AMC23 Challenging: AIME24, AIME25, GPQA Diamond, LiveCodeBench Model generation parameters are set according to official guidelines, with output token limits of 32K for reasoning models. Using the standardized OpenCompass toolkit ensures consistent, reproducible results. Conclusion: REST as a Future-Proof, Realistic LRM Evaluation Paradigm REST constitutes a significant leap forward in evaluating large reasoning models by: Addressing Benchmark Saturation: Revitalizes existing datasets without expensive full replacements. Reflecting Real-World Multi-Task Demands: Tests models under realistic, high cognitive load conditions. Guiding Model Development: Highlights the importance of training methods like Long2Short to mitigate overthinking and encourage adaptive reasoning focus. In sum, REST paves the way for more reliable, robust, and application-relevant benchmarking of next-generation reasoning AI systems. Check out the Paper, Project Page and Code. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter The post REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models appeared first on MarkTechPost.

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries

admin NU / 7月 27, 2025

Language model users often ask questions without enough detail, making it hard to understand what they want. For example, a question like “What book should I read next?” depends heavily on personal taste. At the same time, “How do antibiotics work?” should be answered differently depending on the user’s background knowledge. Current evaluation methods often overlook this missing context, resulting in inconsistent judgments. For instance, a response praising coffee might seem fine, but could be unhelpful or even harmful for someone with a health condition. Without knowing the user’s intent or needs, it’s difficult to fairly assess a model’s response quality. Prior research has focused on generating clarification questions to address ambiguity or missing information in tasks such as Q&A, dialogue systems, and information retrieval. These methods aim to improve the understanding of user intent. Similarly, studies on instruction-following and personalization emphasize the importance of tailoring responses to user attributes, such as expertise, age, or style preferences. Some works have also examined how well models adapt to diverse contexts and proposed training methods to enhance this adaptability. Additionally, language model-based evaluators have gained traction due to their efficiency, although they can be biased, prompting efforts to improve their fairness through clearer evaluation criteria. Researchers from the University of Pennsylvania, the Allen Institute for AI, and the University of Maryland, College Park have proposed contextualized evaluations. This method adds synthetic context (in the form of follow-up question-answer pairs) to clarify underspecified queries during language model evaluation. Their study reveals that including context can significantly impact evaluation outcomes, sometimes even reversing model rankings, while also improving agreement between evaluators. It reduces reliance on superficial features, such as style, and uncovers potential biases in default model responses, particularly toward WEIRD (Western, Educated, Industrialized, Rich, Democratic) contexts. The work also demonstrates that models exhibit varying sensitivities to different user contexts. The researchers developed a simple framework to evaluate how language models perform when given clearer, contextualized queries. First, they selected underspecified queries from popular benchmark datasets and enriched them by adding follow-up question-answer pairs that simulate user-specific contexts. They then collected responses from different language models. They had both human and model-based evaluators compare responses in two settings: one with only the original query, and another with the added context. This allowed them to measure how context affects model rankings, evaluator agreement, and the criteria used for judgment. Their setup offers a practical way to test how models handle real-world ambiguity. Adding context, such as user intent or audience, greatly improves model evaluation, boosting inter-rater agreement by 3–10% and even reversing model rankings in some cases. For instance, GPT-4 outperformed Gemini-1.5-Flash only when context was provided. Without it, evaluations focus on tone or fluency, while context shifts attention to accuracy and helpfulness. Default generations often reflect Western, formal, and general-audience biases, making them less effective for diverse users. Current benchmarks that ignore context risk produce unreliable results. To ensure fairness and real-world relevance, evaluations must pair context-rich prompts with matching scoring rubrics that reflect the actual needs of users. In conclusion, Many user queries to language models are vague, lacking key context like user intent or expertise. This makes evaluations subjective and unreliable. To address this, the study proposes contextualized evaluations, where queries are enriched with relevant follow-up questions and answers. This added context helps shift the focus from surface-level traits to meaningful criteria, such as helpfulness, and can even reverse model rankings. It also reveals underlying biases; models often default to WEIRD (Western, Educated, Industrialized, Rich, Democratic) assumptions. While the study uses a limited set of context types and relies partly on automated scoring, it offers a strong case for more context-aware evaluations in future work. Check out the Paper, Code, Dataset and Blog. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter The post Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries appeared first on MarkTechPost.

Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries 投稿を読む »

Committee

Data Augmentation for Spoken Grammatical Error Correction

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

HIVMedQA: Benchmarking large language models for HIV medical decision support

TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Why AI is making us lose our minds (and not in the way you’d think)

Google DeepMind Introduces Aeneas: AI-Powered Contextualization and Restoration of Ancient Latin Inscriptions

GenSeg: Generative AI Transforms Medical Image Segmentation in Ultra Low-Data Regimes

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries

私たちのサービス

ホーム

仕組み

ニュース

料金

サポート

ヘルプセンター

問題を報告

フィードバックを送る

プライバシーポリシー

ユーザーアカウント

フォローする