YouZum

Committee

AI, Committee, News, Uncategorized

Reasoning Promotes Robustness in Theory of Mind Tasks

arXiv:2601.16853v1 Announce Type: cross Abstract: Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.

Reasoning Promotes Robustness in Theory of Mind Tasks Read Post »

AI, Committee, News, Uncategorized

Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking

arXiv:2601.16555v1 Announce Type: new Abstract: Fact-checking aims to verify the truthfulness of a claim based on the retrieved evidence. Existing methods typically follow a decomposition paradigm, in which a claim is broken down into sub-claims that are individually verified. However, the decomposition paradigm may introduce noise to the verification process due to irrelevant entities or evidence, ultimately degrading verification accuracy. To address this problem, we propose a Retrieve-Refine-Calibrate (RRC) framework based on large language models (LLMs). Specifically, the framework first identifies the entities mentioned in the claim and retrieves evidence relevant to them. Then, it refines the retrieved evidence based on the claim to reduce irrelevant information. Finally, it calibrates the verification process by re-evaluating low-confidence predictions. Experiments on two popular fact-checking datasets (HOVER and FEVEROUS-S) demonstrate that our framework achieves superior performance compared with competitive baselines.

Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking Read Post »

AI, Committee, News, Uncategorized

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

arXiv:2601.16527v1 Announce Type: cross Abstract: Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs Read Post »

AI, Committee, News, Uncategorized

Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification

arXiv:2601.16400v1 Announce Type: new Abstract: Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines

Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification Read Post »

AI, Committee, News, Uncategorized

AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

arXiv:2601.15297v2 Announce Type: replace Abstract: We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.

AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports Read Post »

AI, Committee, News, Uncategorized

How Machine Learning and Semantic Embeddings Reorder CVE Vulnerabilities Beyond Raw CVSS Scores

In this tutorial, we build an AI-assisted vulnerability scanner that goes beyond static CVSS scoring and instead learns to prioritize vulnerabilities using semantic understanding and machine learning. We treat vulnerability descriptions as rich linguistic artifacts, embed them using modern sentence transformers, and combine these representations with structural metadata to produce a data-driven priority score. Also, we demonstrate how security teams can shift from rule-based triage to adaptive, explainable, ML-driven risk assessment. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“Installing required packages…”) import subprocess import sys packages = [ ‘sentence-transformers’, ‘scikit-learn’, ‘pandas’, ‘numpy’, ‘matplotlib’, ‘seaborn’, ‘requests’ ] for package in packages: subprocess.check_call([sys.executable, ‘-m’, ‘pip’, ‘install’, ‘-q’, package]) import requests import pandas as pd import numpy as np from datetime import datetime, timedelta import json import re from collections import Counter import warnings warnings.filterwarnings(‘ignore’) from sentence_transformers import SentenceTransformer from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, mean_squared_error import matplotlib.pyplot as plt import seaborn as sns print(“✓ All packages installed successfully!n”) We install and load all required NLP, machine learning, and visualization libraries for the end-to-end pipeline. We ensure the runtime is fully self-contained and ready to execute in Colab or similar notebook environments. It establishes a reproducible foundation for the scanner. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class CVEDataFetcher: def __init__(self): self.base_url = “https://services.nvd.nist.gov/rest/json/cves/2.0″ def fetch_recent_cves(self, days=30, max_results=100): print(f”Fetching CVEs from last {days} days…”) end_date = datetime.now() start_date = end_date – timedelta(days=days) params = { ‘pubStartDate’: start_date.strftime(‘%Y-%m-%dT00:00:00.000’), ‘pubEndDate’: end_date.strftime(‘%Y-%m-%dT23:59:59.999’), ‘resultsPerPage’: min(max_results, 2000) } try: response = requests.get(self.base_url, params=params, timeout=30) response.raise_for_status() data = response.json() cves = [] for item in data.get(‘vulnerabilities’, [])[:max_results]: cve = item.get(‘cve’, {}) cve_id = cve.get(‘id’, ‘Unknown’) descriptions = cve.get(‘descriptions’, []) description = next((d[‘value’] for d in descriptions if d[‘lang’] == ‘en’), ‘No description’) metrics = cve.get(‘metrics’, {}) cvss_v3 = metrics.get(‘cvssMetricV31’, [{}])[0].get(‘cvssData’, {}) cvss_v2 = metrics.get(‘cvssMetricV2’, [{}])[0].get(‘cvssData’, {}) base_score = cvss_v3.get(‘baseScore’) or cvss_v2.get(‘baseScore’) or 0.0 severity = cvss_v3.get(‘baseSeverity’) or ‘UNKNOWN’ published = cve.get(‘published’, ”) references = cve.get(‘references’, []) cves.append({ ‘cve_id’: cve_id, ‘description’: description, ‘cvss_score’: float(base_score), ‘severity’: severity, ‘published’: published, ‘reference_count’: len(references), ‘attack_vector’: cvss_v3.get(‘attackVector’, ‘UNKNOWN’), ‘attack_complexity’: cvss_v3.get(‘attackComplexity’, ‘UNKNOWN’), ‘privileges_required’: cvss_v3.get(‘privilegesRequired’, ‘UNKNOWN’), ‘user_interaction’: cvss_v3.get(‘userInteraction’, ‘UNKNOWN’) }) print(f”✓ Fetched {len(cves)} CVEsn”) return pd.DataFrame(cves) except Exception as e: print(f”Error fetching CVEs: {e}”) return self._generate_sample_data(max_results) def _generate_sample_data(self, n=50): print(“Using sample CVE data for demonstration…n”) sample_descriptions = [ “A buffer overflow vulnerability in the network driver allows remote code execution”, “SQL injection vulnerability in web application login form enables unauthorized access”, “Cross-site scripting (XSS) vulnerability in user input validation”, “Authentication bypass in admin panel due to weak session management”, “Remote code execution via deserialization of untrusted data”, “Path traversal vulnerability allows reading arbitrary files”, “Privilege escalation through improper input validation”, “Denial of service through resource exhaustion in API endpoint”, “Information disclosure via error messages exposing sensitive data”, “Memory corruption vulnerability in image processing library”, “Command injection in file upload functionality”, “Integer overflow leading to heap buffer overflow”, “Use-after-free vulnerability in memory management”, “Race condition in multi-threaded application”, “Cryptographic weakness in password storage mechanism” ] severities = [‘LOW’, ‘MEDIUM’, ‘HIGH’, ‘CRITICAL’] attack_vectors = [‘NETWORK’, ‘ADJACENT’, ‘LOCAL’, ‘PHYSICAL’] complexities = [‘LOW’, ‘HIGH’] data = [] for i in range(n): severity = np.random.choice(severities, p=[0.1, 0.3, 0.4, 0.2]) score_ranges = {‘LOW’: (0.1, 3.9), ‘MEDIUM’: (4.0, 6.9), ‘HIGH’: (7.0, 8.9), ‘CRITICAL’: (9.0, 10.0)} data.append({ ‘cve_id’: f’CVE-2024-{10000+i}’, ‘description’: np.random.choice(sample_descriptions), ‘cvss_score’: np.random.uniform(*score_ranges[severity]), ‘severity’: severity, ‘published’: (datetime.now() – timedelta(days=np.random.randint(1, 30))).isoformat(), ‘reference_count’: np.random.randint(1, 10), ‘attack_vector’: np.random.choice(attack_vectors), ‘attack_complexity’: np.random.choice(complexities), ‘privileges_required’: np.random.choice([‘NONE’, ‘LOW’, ‘HIGH’]), ‘user_interaction’: np.random.choice([‘NONE’, ‘REQUIRED’]) }) return pd.DataFrame(data) We implement a robust CVE ingestion component that pulls recent vulnerabilities directly from the NVD API. We normalize raw CVE records into structured features while gracefully falling back to synthetic data when API access fails. It allows the tutorial to remain runnable while reflecting real-world challenges in data ingestion. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class VulnerabilityFeatureExtractor: def __init__(self): print(“Loading sentence transformer model…”) self.model = SentenceTransformer(‘all-MiniLM-L6-v2’) print(“✓ Model loadedn”) self.critical_keywords = { ‘execution’: [‘remote code execution’, ‘rce’, ‘execute’, ‘arbitrary code’], ‘injection’: [‘sql injection’, ‘command injection’, ‘code injection’], ‘authentication’: [‘bypass’, ‘authentication’, ‘authorization’], ‘overflow’: [‘buffer overflow’, ‘heap overflow’, ‘stack overflow’], ‘exposure’: [‘information disclosure’, ‘data leak’, ‘exposure’], } def extract_semantic_features(self, descriptions): print(“Generating semantic embeddings…”) embeddings = self.model.encode(descriptions, show_progress_bar=True) return embeddings def extract_keyword_features(self, df): print(“Extracting keyword features…”) for category, keywords in self.critical_keywords.items(): df[f’has_{category}’] = df[‘description’].apply( lambda x: any(kw in x.lower() for kw in keywords) ).astype(int) df[‘desc_length’] = df[‘description’].apply(len) df[‘word_count’] = df[‘description’].apply(lambda x: len(x.split())) return df def encode_categorical_features(self, df): print(“Encoding categorical features…”) categorical_cols = [‘attack_vector’, ‘attack_complexity’, ‘privileges_required’, ‘user_interaction’] for col in categorical_cols: dummies = pd.get_dummies(df[col], prefix=col) df = pd.concat([df, dummies], axis=1) return df We transform unstructured vulnerability descriptions into dense semantic embeddings using a sentence-transformer model. We also extract keyword-based risk indicators and textual statistics that capture exploit intent and complexity. Together, these features bridge linguistic context with quantitative ML inputs. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class VulnerabilityPrioritizer: def __init__(self): self.severity_classifier = RandomForestClassifier(n_estimators=100, random_state=42) self.score_predictor = GradientBoostingRegressor(n_estimators=100, random_state=42) self.scaler = StandardScaler() self.feature_cols = None def prepare_features(self, df, embeddings): numeric_features = [‘reference_count’, ‘desc_length’, ‘word_count’] keyword_features = [col for col in df.columns if col.startswith(‘has_’)] categorical_features = [col for col in df.columns if any(col.startswith(prefix) for prefix in [‘attack_vector_’, ‘attack_complexity_’, ‘privileges_required_’, ‘user_interaction_’])] self.feature_cols = numeric_features + keyword_features + categorical_features X_structured = df[self.feature_cols].values X_embeddings = embeddings X_combined = np.hstack([X_structured, X_embeddings]) return X_combined def train_models(self, X, y_severity, y_score): print(“nTraining ML models…”) X_scaled = self.scaler.fit_transform(X) X_train, X_test, y_sev_train, y_sev_test, y_score_train, y_score_test = train_test_split( X_scaled, y_severity, y_score, test_size=0.2, random_state=42 ) self.severity_classifier.fit(X_train, y_sev_train) sev_pred = self.severity_classifier.predict(X_test) self.score_predictor.fit(X_train, y_score_train) score_pred = self.score_predictor.predict(X_test) print(“n— Severity Classification Report —“) print(classification_report(y_sev_test, sev_pred)) print(f”n— CVSS Score Prediction —“) print(f”RMSE: {np.sqrt(mean_squared_error(y_score_test, score_pred)):.2f}”) return X_scaled def predict_priority(self, X): X_scaled = self.scaler.transform(X) severity_pred = self.severity_classifier.predict_proba(X_scaled) score_pred = self.score_predictor.predict(X_scaled) severity_weight = severity_pred[:, -1] * 0.4 score_weight = (score_pred / 10.0) * 0.6 priority_score = severity_weight + score_weight return priority_score, severity_pred, score_pred def get_feature_importance(self): importance = self.score_predictor.feature_importances_ n_structured = len(self.feature_cols) structured_importance = importance[:n_structured] embedding_importance = importance[n_structured:] feature_imp_df = pd.DataFrame({ ‘feature’: self.feature_cols, ‘importance’: structured_importance }).sort_values(‘importance’, ascending=False) return feature_imp_df, embedding_importance.mean() We train supervised

How Machine Learning and Semantic Embeddings Reorder CVE Vulnerabilities Beyond Raw CVSS Scores Read Post »

AI, Committee, News, Uncategorized

How an AI Agent Chooses What to Do Under Tokens, Latency, and Tool-Call Budget Constraints?

In this tutorial, we build a cost-aware planning agent that deliberately balances output quality against real-world constraints such as token usage, latency, and tool-call budgets. We design the agent to generate multiple candidate actions, estimate their expected costs and benefits, and then select an execution plan that maximizes value while staying within strict budgets. With this, we demonstrate how agentic systems can move beyond “always use the LLM” behavior and instead reason explicitly about trade-offs, efficiency, and resource awareness, which is critical for deploying agents reliably in constrained environments. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import os, time, math, json, random from dataclasses import dataclass, field from typing import List, Dict, Optional, Tuple, Any from getpass import getpass USE_OPENAI = True if USE_OPENAI: if not os.getenv(“OPENAI_API_KEY”): os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (hidden): “).strip() try: from openai import OpenAI client = OpenAI() except Exception as e: print(“OpenAI SDK import failed. Falling back to offline mode.nError:”, e) USE_OPENAI = False We set up the execution environment and securely load the OpenAI API key at runtime without hardcoding it. We also initialize the client so the agent gracefully falls back to offline mode if the API is unavailable. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def approx_tokens(text: str) -> int: return max(1, math.ceil(len(text) / 4)) @dataclass class Budget: max_tokens: int max_latency_ms: int max_tool_calls: int @dataclass class Spend: tokens: int = 0 latency_ms: int = 0 tool_calls: int = 0 def within(self, b: Budget) -> bool: return (self.tokens <= b.max_tokens and self.latency_ms <= b.max_latency_ms and self.tool_calls <= b.max_tool_calls) def add(self, other: “Spend”) -> “Spend”: return Spend( tokens=self.tokens + other.tokens, latency_ms=self.latency_ms + other.latency_ms, tool_calls=self.tool_calls + other.tool_calls ) We define the core budgeting abstractions that enable the agent to reason explicitly about costs. We model token usage, latency, and tool calls as first-class quantities and provide utility methods to accumulate and validate spend. It gives us a clean foundation for enforcing constraints throughout planning and execution. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser @dataclass class StepOption: name: str description: str est_spend: Spend est_value: float executor: str payload: Dict[str, Any] = field(default_factory=dict) @dataclass class PlanCandidate: steps: List[StepOption] spend: Spend value: float rationale: str = “” def llm_text(prompt: str, *, model: str = “gpt-5”, effort: str = “low”) -> str: if not USE_OPENAI: return “” t0 = time.time() resp = client.responses.create( model=model, reasoning={“effort”: effort}, input=prompt, ) _ = (time.time() – t0) return resp.output_text or “” We introduce the data structures that represent individual action choices and full plan candidates. We also define a lightweight LLM wrapper that standardizes how text is generated and measured. This separation allows the planner to reason about actions abstractly without being tightly coupled to execution details. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def generate_step_options(task: str) -> List[StepOption]: base = [ StepOption( name=”Clarify deliverables (local)”, description=”Extract deliverable checklist + acceptance criteria from the task.”, est_spend=Spend(tokens=60, latency_ms=20, tool_calls=0), est_value=6.0, executor=”local”, ), StepOption( name=”Outline plan (LLM)”, description=”Create a structured outline with sections, constraints, and assumptions.”, est_spend=Spend(tokens=600, latency_ms=1200, tool_calls=1), est_value=10.0, executor=”llm”, payload={“prompt_kind”:”outline”} ), StepOption( name=”Outline plan (local)”, description=”Create a rough outline using templates (no LLM).”, est_spend=Spend(tokens=120, latency_ms=40, tool_calls=0), est_value=5.5, executor=”local”, ), StepOption( name=”Risk register (LLM)”, description=”Generate risks, mitigations, owners, and severity.”, est_spend=Spend(tokens=700, latency_ms=1400, tool_calls=1), est_value=9.0, executor=”llm”, payload={“prompt_kind”:”risks”} ), StepOption( name=”Risk register (local)”, description=”Generate a standard risk register from a reusable template.”, est_spend=Spend(tokens=160, latency_ms=60, tool_calls=0), est_value=5.0, executor=”local”, ), StepOption( name=”Timeline (LLM)”, description=”Draft a realistic milestone timeline with dependencies.”, est_spend=Spend(tokens=650, latency_ms=1300, tool_calls=1), est_value=8.5, executor=”llm”, payload={“prompt_kind”:”timeline”} ), StepOption( name=”Timeline (local)”, description=”Draft a simple timeline from a generic milestone template.”, est_spend=Spend(tokens=150, latency_ms=60, tool_calls=0), est_value=4.8, executor=”local”, ), StepOption( name=”Quality pass (LLM)”, description=”Rewrite for clarity, consistency, and formatting.”, est_spend=Spend(tokens=900, latency_ms=1600, tool_calls=1), est_value=8.0, executor=”llm”, payload={“prompt_kind”:”polish”} ), StepOption( name=”Quality pass (local)”, description=”Light formatting + consistency checks without LLM.”, est_spend=Spend(tokens=120, latency_ms=50, tool_calls=0), est_value=3.5, executor=”local”, ), ] if USE_OPENAI: meta_prompt = f””” You are a planning assistant. For the task below, propose 3-5 OPTIONAL extra steps that improve quality, like checks, validations, or stakeholder tailoring. Keep each step short. TASK: {task} Return JSON list with fields: name, description, est_value(1-10). “”” txt = llm_text(meta_prompt, model=”gpt-5″, effort=”low”) try: items = json.loads(txt.strip()) for it in items[:5]: base.append( StepOption( name=str(it.get(“name”,”Extra step (local)”))[:60], description=str(it.get(“description”,””))[:200], est_spend=Spend(tokens=120, latency_ms=60, tool_calls=0), est_value=float(it.get(“est_value”, 5.0)), executor=”local”, ) ) except Exception: pass return base We focus on generating a diverse set of candidate steps, including both LLM-based and local alternatives with different cost–quality trade-offs. We optionally use the model itself to suggest additional low-cost improvements while still controlling their impact on the budget. By doing so, we enrich the action space without losing efficiency. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def plan_under_budget( options: List[StepOption], budget: Budget, *, max_steps: int = 6, beam_width: int = 12, diversity_penalty: float = 0.2 ) -> PlanCandidate: def redundancy_cost(chosen: List[StepOption], new: StepOption) -> float: key_new = new.name.split(“(“)[0].strip().lower() overlap = 0 for s in chosen: key_s = s.name.split(“(“)[0].strip().lower() if key_s == key_new: overlap += 1 return overlap * diversity_penalty beams: List[PlanCandidate] = [PlanCandidate(steps=[], spend=Spend(), value=0.0, rationale=””)] for _ in range(max_steps): expanded: List[PlanCandidate] = [] for cand in beams: for opt in options: if opt in cand.steps: continue new_spend = cand.spend.add(opt.est_spend) if not new_spend.within(budget): continue new_value = cand.value + opt.est_value – redundancy_cost(cand.steps, opt) expanded.append( PlanCandidate( steps=cand.steps + [opt], spend=new_spend, value=new_value, rationale=cand.rationale ) ) if not expanded: break expanded.sort(key=lambda c: c.value, reverse=True) beams = expanded[:beam_width] best = max(beams, key=lambda c: c.value) return best We implement the budget-constrained planning logic that searches for the highest-value combination of steps under strict limits. We apply a beam-style search with redundancy penalties to avoid wasteful action overlap. This is where the agent truly becomes cost-aware by optimizing value subject to constraints. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def run_local_step(task: str, step: StepOption, working: Dict[str, Any]) -> str: name = step.name.lower() if “clarify deliverables” in name: return ( “Deliverables checklist:n” “- Executive summaryn- Scope & assumptionsn- Workplan + milestonesn” “- Risk register (risk, impact, likelihood, mitigation, owner)n” “- Next steps + data neededn” ) if “outline plan” in name: return

How an AI Agent Chooses What to Do Under Tokens, Latency, and Tool-Call Budget Constraints? Read Post »

AI, Committee, News, Uncategorized

The Download: chatbots for health, and US fights over AI regulation

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. “Dr. Google” had its issues. Can ChatGPT Health do better?   For the past two decades, there’s been a clear first step for anyone who starts experiencing new medical symptoms: Look them up online. The practice was so common that it gained the pejorative moniker “Dr. Google.” But times are changing, and many medical-information seekers are now using LLMs. According to OpenAI, 230 million people ask ChatGPT health-related queries each week.   That’s the context around the launch of OpenAI’s new ChatGPT Health product, which debuted earlier this month. The big question is: can the obvious risks of using AI for health-related queries be mitigated enough for them to be a net benefit? Read the full story.  —Grace Huckins America’s coming war over AI regulation   In the final weeks of 2025, the battle over regulating artificial intelligence in the US reached boiling point. On December 11, after Congress failed twice to pass a law banning state AI laws, President Donald Trump signed a sweeping executive order seeking to handcuff states from regulating the booming industry.   Instead, he vowed to work with Congress to establish a “minimally burdensome” national AI policy. The move marked a victory for tech titans, who have been marshaling multimillion-dollar war chests to oppose AI regulations, arguing that a patchwork of state laws would stifle innovation. In 2026, the battleground will shift to the courts. While some states might back down from passing AI laws, others will charge ahead. Read our story about what’s on the horizon.  —Michelle Kim This story is from MIT Technology Review’s What’s Next series of stories that look across industries, trends, and technologies to give you a first look at the future. You can read the rest of them here.   Measles is surging in the US. Wastewater tracking could help. This week marked a rather unpleasant anniversary: It’s a year since Texas reported a case of measles—the start of a significant outbreak that ended up spreading across multiple states. Since the start of January 2025, there have been over 2,500 confirmed cases of measles in the US. Three people have died.  As vaccination rates drop and outbreaks continue, scientists have been experimenting with new ways to quickly identify new cases and prevent the disease from spreading. And they are starting to see some success with wastewater surveillance. Read the full story. —Jessica Hamzelou  This story is from The Checkup, our weekly newsletter giving you the inside track on all things health and biotech. Sign up to receive it in your inbox every Thursday. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 The US is dismantling itselfA foreign enemy could not invent a better chain of events to wreck its standing in the world. (Wired $)  + We need to talk about whether Donald Trump might be losing it.  (New Yorker $) 2 Big Tech is taking on more debt to fund its AI aspirationsAnd the bubble just keeps growing. (WP $)+ Forget unicorns. 2026 is shaping up to be the year of the “hectocorn.” (The Guardian)+ Everyone in tech agrees we’re in a bubble. They just can’t agree on what happens when it pops. (MIT Technology Review) 3 DOGE accessed even more personal data than we thought Even now, the Trump administration still can’t say how much data is at risk, or what it was used for. (NPR) 4 TikTok has finalized a deal to create a new US entity Ending years of uncertainty about its fate in America. (CNN)+ Why China is the big winner out of all of this. (FT $) 5 The US is now officially out of the World Health Organization And it’s leaving behind nearly $300 million in bills unpaid. (Ars Technica) + The US withdrawal from the WHO will hurt us all. (MIT Technology Review) 6 AI-powered disinformation swarms pose a threat to democracyA would-be autocrat could use them to persuade populations to accept cancelled elections or overturn results. (The Guardian)+ The era of AI persuasion in elections is about to begin. (MIT Technology Review) 7 We’re about to start seeing more robots everywhereBut exactly what they’ll look like remains up for debate. (Vox $)+ Chinese companies are starting to dominate entire sectors of AI and robotics. (MIT Technology Review) 8 Some people seem to be especially vulnerable to lonelinessIf you’re ‘other-directed’, you could particularly benefit from less screentime. (New Scientist $) 9 This academic lost two years of work with a single clickTL;DR: Don’t rely on ChatGPT to store your data. (Nature) 10 How animals develop a sense of direction Their ‘internal compass’ seems to be informed by landmarks that help them form a mental map. (Quanta $) Quote of the day “The rate at which AI is progressing, I think we have AI that is smarter than any human this year, and no later than next year.” —Elon Musk simply cannot resist the urge to make wild predictions at Davos, Wired reports.  One more thing ADAM DETOUR Africa fights rising hunger by looking to foods of the past After falling steadily for decades, the prevalence of global hunger is now on the rise—nowhere more so than in sub-Saharan Africa.  Africa’s indigenous crops are often more nutritious and better suited to the hot and dry conditions that are becoming more prevalent, yet many have been neglected by science, which means they tend to be more vulnerable to diseases and pests and yield well below their theoretical potential. Now the question is whether researchers, governments, and farmers can work together in a way that gets these crops onto plates and provides Africans from all walks of life with the energy and nutrition that they need to thrive, whatever climate change throws their way. Read the full story. —Jonathan W. Rosen We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + The only thing I fancy dry this January is a martini. Here’s how to make one.+ If you absolutely adore the

The Download: chatbots for health, and US fights over AI regulation Read Post »

AI, Committee, News, Uncategorized

LLM or Human? Perceptions of Trust and Information Quality in Research Summaries

arXiv:2601.15556v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used to generate and edit scientific abstracts, yet their integration into academic writing raises questions about trust, quality, and disclosure. Despite growing adoption, little is known about how readers perceive LLM-generated summaries and how these perceptions influence evaluations of scientific work. This paper presents a mixed-methods survey experiment investigating whether readers with ML expertise can distinguish between human- and LLM-generated abstracts, how actual and perceived LLM involvement affects judgments of quality and trustworthiness, and what orientations readers adopt toward AI-assisted writing. Our findings show that participants struggle to reliably identify LLM-generated content, yet their beliefs about LLM involvement significantly shape their evaluations. Notably, abstracts edited by LLMs are rated more favorably than those written solely by humans or LLMs. We also identify three distinct reader orientations toward LLM-assisted writing, offering insights into evolving norms and informing policy around disclosure and acceptable use in scientific communication.

LLM or Human? Perceptions of Trust and Information Quality in Research Summaries Read Post »

AI, Committee, News, Uncategorized

ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms

arXiv:2601.15605v1 Announce Type: new Abstract: The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.

ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US