YouZum

Committee

AI, Committee, ニュース, Uncategorized

Efficiency and Effectiveness of LLM-Based Summarization of Evidence in Crowdsourced Fact-Checking

arXiv:2501.18265v2 Announce Type: replace-cross Abstract: Evaluating the truthfulness of online content is critical for combating misinformation. This study examines the efficiency and effectiveness of crowdsourced truthfulness assessments through a comparative analysis of two approaches: one involving full-length webpages as evidence for each claim, and another using summaries for each evidence document generated with a large language model. Using an A/B testing setting, we engage a diverse pool of participants tasked with evaluating the truthfulness of statements under these conditions. Our analysis explores both the quality of assessments and the behavioral patterns of participants. The results reveal that relying on summarized evidence offers comparable accuracy and error metrics to the Standard modality while significantly improving efficiency. Workers in the Summary setting complete a significantly higher number of assessments, reducing task duration and costs. Additionally, the Summary modality maximizes internal agreement and maintains consistent reliance on and perceived usefulness of evidence, demonstrating its potential to streamline large-scale truthfulness evaluations.

Efficiency and Effectiveness of LLM-Based Summarization of Evidence in Crowdsourced Fact-Checking 投稿を読む »

AI, Committee, ニュース, Uncategorized

Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning

arXiv:2505.00001v1 Announce Type: new Abstract: Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs’ logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.

Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning 投稿を読む »

AI, Committee, ニュース, Uncategorized

A long-abandoned US nuclear technology is making a comeback in China

China has once again beat everyone else to a clean energy milestone—its new nuclear reactor is reportedly one of the first to use thorium instead of uranium as a fuel and the first of its kind that can be refueled while it’s running. It’s an interesting (if decidedly experimental) development out of a country that’s edging toward becoming the world leader in nuclear energy. China has now surpassed France in terms of generation, though not capacity; it still lags behind the US in both categories. But one recurring theme in media coverage about the reactor struck me, because it’s so familiar: This technology was invented decades ago, and then abandoned. You can basically copy and paste that line into countless stories about today’s advanced reactor technology. Molten-salt cooling systems? Invented in the mid-20th century but never commercialized. Same for several alternative fuels, like TRISO. And, of course, there’s thorium. This one research reactor in China running with an alternative fuel says a lot about this moment for nuclear energy technology: Many groups are looking into the past for technologies, with a new appetite for building them. First, it’s important to note that China is the hot spot for nuclear energy right now. While the US still has the most operational reactors in the world, China is catching up quickly. The country is building reactors at a remarkable clip and currently has more reactors under construction than any other country by far. Just this week, China approved 10 new reactors, totaling over $27 billion in investment. China is also leading the way for some advanced reactor technologies (that category includes basically anything that deviates from the standard blueprint of what’s on the grid today: large reactors that use enriched uranium for fuel and high-pressure water to keep the reactor cool). High-temperature reactors that use gas as a coolant are one major area of focus for China—a few reactors that use this technology have recently started up, and more are in the planning stages or under construction. Now, Chinese state media is reporting that scientists in the country reached a milestone with a thorium-based reactor. The reactor came online in June 2024, but researchers say it recently went through refueling without shutting down. (Conventional reactors generally need to be stopped to replenish the fuel supply.) The project’s lead scientists shared the results during a closed meeting at the Chinese Academy of Sciences. I’ll emphasize here that this isn’t some massive power plant: This reactor is tiny. It generates just two megawatts of heat—less than the research reactor on MIT’s campus, which rings in at six megawatts. (To be fair, MIT’s is one of the largest university research reactors in the US, but still … it’s small.) Regardless, progress is progress for thorium reactors, as the world has been entirely focused on uranium for the last 50 years or so. Much of the original research on thorium came out of the US, which pumped resources into all sorts of different reactor technologies in the 1950s and ’60s. A reactor at Oak Ridge National Laboratory in Tennessee that ran in the 1960s used Uranium-233 fuel (which can be generated when thorium is bombarded with radiation). Eventually, though, the world more or less settled on a blueprint for nuclear reactors, focusing on those that use Uranium-238 as fuel and are cooled by water at a high pressure. One reason for the focus on uranium for energy tech? The research could also be applied to nuclear weapons. But now there’s a renewed interest in alternative nuclear technologies, and the thorium-fueled reactor is just one of several examples. A prominent one we’ve covered before: Kairos Power is building reactors that use molten salt as a coolant for small nuclear reactors, also a technology invented and developed in the 1950s and ’60s before being abandoned.  Another old-but-new concept is using high-temperature gas to cool reactors, as X-energy is aiming to do in its proposed power station at a chemical plant in Texas. (That reactor will be able to be refueled while it’s running, like the new thorium reactor.)  Some problems from decades ago that contributed to technologies being abandoned will still need to be dealt with today. In the case of molten-salt reactors, for example, it can be tricky to find materials that can withstand the corrosive properties of super-hot salt. For thorium reactors, the process of transforming thorium into U-233 fuel has historically been one of the hurdles.  But as early progress shows, the archives could provide fodder for new commercial reactors, and revisiting these old ideas could give the nuclear industry a much-needed boost.  This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

A long-abandoned US nuclear technology is making a comeback in China 投稿を読む »

AI, Committee, ニュース, Uncategorized

Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models

arXiv:2504.21026v1 Announce Type: new Abstract: With the growing presence of multilingual users on social media, detecting abusive language in code-mixed text has become increasingly challenging. Code-mixed communication, where users seamlessly switch between English and their native languages, poses difficulties for traditional abuse detection models, as offensive content may be context-dependent or obscured by linguistic blending. While abusive language detection has been extensively explored for high-resource languages like English and Hindi, low-resource languages such as Telugu and Nepali remain underrepresented, leaving gaps in effective moderation. In this study, we introduce a novel, manually annotated dataset of 2 thousand Telugu-English and 5 Nepali-English code-mixed comments, categorized as abusive and non-abusive, collected from various social media platforms. The dataset undergoes rigorous preprocessing before being evaluated across multiple Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs). We experimented with models including Logistic Regression, Random Forest, Support Vector Machines (SVM), Neural Networks (NN), LSTM, CNN, and LLMs, optimizing their performance through hyperparameter tuning, and evaluate it using 10-fold cross-validation and statistical significance testing (t-test). Our findings provide key insights into the challenges of detecting abusive language in code-mixed settings and offer a comparative analysis of computational approaches. This study contributes to advancing NLP for low-resource languages by establishing benchmarks for abusive language detection in Telugu-English and Nepali-English code-mixed text. The dataset and insights can aid in the development of more robust moderation strategies for multilingual social media environments.

Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Waking Up an AI: A Quantitative Framework for Prompt-Induced Phase Transition in Large Language Models

arXiv:2504.21012v1 Announce Type: new Abstract: What underlies intuitive human thinking? One approach to this question is to compare the cognitive dynamics of humans and large language models (LLMs). However, such a comparison requires a method to quantitatively analyze AI cognitive behavior under controlled conditions. While anecdotal observations suggest that certain prompts can dramatically change LLM behavior, these observations have remained largely qualitative. Here, we propose a two-part framework to investigate this phenomenon: a Transition-Inducing Prompt (TIP) that triggers a rapid shift in LLM responsiveness, and a Transition Quantifying Prompt (TQP) that evaluates this change using a separate LLM. Through controlled experiments, we examined how LLMs react to prompts embedding two semantically distant concepts (e.g., mathematical aperiodicity and traditional crafts)–either fused together or presented separately–by changing their linguistic quality and affective tone. Whereas humans tend to experience heightened engagement when such concepts are meaningfully blended producing a novel concept–a form of conceptual fusion–current LLMs showed no significant difference in responsiveness between semantically fused and non-fused prompts. This suggests that LLMs may not yet replicate the conceptual integration processes seen in human intuition. Our method enables fine-grained, reproducible measurement of cognitive responsiveness, and may help illuminate key differences in how intuition and conceptual leaps emerge in artificial versus human minds.

Waking Up an AI: A Quantitative Framework for Prompt-Induced Phase Transition in Large Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

arXiv:2504.21776v1 Announce Type: new Abstract: Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

WebThinker: Empowering Large Reasoning Models with Deep Research Capability 投稿を読む »

AI, Committee, ニュース, Uncategorized

Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion

arXiv:2411.08165v2 Announce Type: replace-cross Abstract: The Knowledge Graph Completion~(KGC) task aims to infer the missing entity from an incomplete triple. Existing embedding-based methods rely solely on triples in the KG, which is vulnerable to specious relation patterns and long-tail entities. On the other hand, text-based methods struggle with the semantic gap between KG triples and natural language. Apart from triples, entity contexts (e.g., labels, descriptions, aliases) also play a significant role in augmenting KGs. To address these limitations, we propose KGR3, a context-enriched framework for KGC. KGR3 is composed of three modules. Firstly, the Retrieval module gathers supporting triples from the KG, collects plausible candidate answers from a base embedding model, and retrieves context for each related entity. Then, the Reasoning module employs a large language model to generate potential answers for each query triple. Finally, the Re-ranking module combines candidate answers from the two modules mentioned above, and fine-tunes an LLM to provide the best answer. Extensive experiments on widely used datasets demonstrate that KGR3 consistently improves various KGC methods. Specifically, the best variant of KGR3 achieves absolute Hits@1 improvements of 12.3% and 5.6% on the FB15k237 and WN18RR datasets.

Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion 投稿を読む »

AI, Committee, ニュース, Uncategorized

Context Selection and Rewriting for Video-based Educational Question Generation

arXiv:2504.19406v2 Announce Type: replace Abstract: Educational question generation (EQG) is a crucial component of intelligent educational systems, significantly aiding self-assessment, active learning, and personalized education. While EQG systems have emerged, existing datasets typically rely on predefined, carefully edited texts, failing to represent real-world classroom content, including lecture speech with a set of complementary slides. To bridge this gap, we collect a dataset of educational questions based on lectures from real-world classrooms. On this realistic dataset, we find that current methods for EQG struggle with accurately generating questions from educational videos, particularly in aligning with specific timestamps and target answers. Common challenges include selecting informative contexts from extensive transcripts and ensuring generated questions meaningfully incorporate the target answer. To address the challenges, we introduce a novel framework utilizing large language models for dynamically selecting and rewriting contexts based on target timestamps and answers. First, our framework selects contexts from both lecture transcripts and video keyframes based on answer relevance and temporal proximity. Then, we integrate the contexts selected from both modalities and rewrite them into answer-containing knowledge statements, to enhance the logical connection between the contexts and the desired answer. This approach significantly improves the quality and relevance of the generated questions. Our dataset and code are released in https://github.com/mengxiayu/COSER.

Context Selection and Rewriting for Video-based Educational Question Generation 投稿を読む »

AI, Committee, ニュース, Uncategorized

It’s the same but not the same: Do LLMs distinguish Spanish varieties?

arXiv:2504.20049v1 Announce Type: new Abstract: In recent years, large language models (LLMs) have demonstrated a high capacity for understanding and generating text in Spanish. However, with five hundred million native speakers, Spanish is not a homogeneous language but rather one rich in diatopic variations spanning both sides of the Atlantic. For this reason, in this study, we evaluate the ability of nine language models to identify and distinguish the morphosyntactic and lexical peculiarities of seven varieties of Spanish (Andean, Antillean, Continental Caribbean, Chilean, Peninsular, Mexican and Central American and Rioplatense) through a multiple-choice test. The results indicate that the Peninsular Spanish variety is the best identified by all models and that, among them, GPT-4o is the only model capable of recognizing the variability of the Spanish language. — En los ‘ultimos a~nos, los grandes modelos de lenguaje (LLMs, por sus siglas en ingl’es) han demostrado una alta capacidad para comprender y generar texto en espa~nol. Sin embargo, con quinientos millones de hablantes nativos, la espa~nola no es una lengua homog’enea, sino rica en variedades diat’opicas que se extienden a ambos lados del Atl’antico. Por todo ello, evaluamos en este trabajo la capacidad de nueve modelos de lenguaje de identificar y discernir las peculiaridades morfosint’acticas y l’exicas de siete variedades de espa~nol (andino, antillano, caribe~no continental, chileno, espa~nol peninsular, mexicano y centroamericano y rioplatense) mediante un test de respuesta m’ultiple. Los resultados obtenidos indican que la variedad de espa~nol peninsular es la mejor identificada por todos los modelos y que, de entre todos, GPT-4o es el ‘unico modelo capaz de identificar la variabilidad de la lengua espa~nola.

It’s the same but not the same: Do LLMs distinguish Spanish varieties? 投稿を読む »

AI, Committee, ニュース, Uncategorized

MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?

arXiv:2504.20094v1 Announce Type: cross Abstract: In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent analysis, candidate generation, ranking, re-ranking, explainability, and safeguards. These agents collaboratively improve recommendations accuracy, diversity, and safety. On eight metrics, our model achieves superior or comparable performance to the current state-of-the-art. Through comparisons with six baseline models, our approach addresses key challenges in conversational recommendation systems for game recommendations, including: (1) handling complex, user-specific requests, (2) enhancing personalization through multi-agent collaboration, (3) empirical evaluation and deployment, and (4) ensuring safe and trustworthy interactions.

MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender? 投稿を読む »

ja