YouZum

Committee

AI, Committee, ข่าว, Uncategorized

How to Build an Advanced Agentic Retrieval-Augmented Generation (RAG) System with Dynamic Strategy and Smart Retrieval?

In this tutorial, we walk through the implementation of an Agentic Retrieval-Augmented Generation (RAG) system. We design it so that the agent does more than just retrieve documents; it actively decides when retrieval is needed, selects the best retrieval strategy, and synthesizes responses with contextual awareness. By combining embeddings, FAISS indexing, and a mock LLM, we create a practical demonstration of how agentic decision-making can elevate the standard RAG pipeline into something more adaptive and intelligent. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import numpy as np import faiss from sentence_transformers import SentenceTransformer import json import re from typing import List, Dict, Any, Optional from dataclasses import dataclass from enum import Enum class MockLLM: def generate(self, prompt: str, max_tokens: int = 150) -> str: prompt_lower = prompt.lower() if “decide whether to retrieve” in prompt_lower: if any(word in prompt_lower for word in [“specific”, “recent”, “data”, “facts”, “when”, “who”, “what”]): return “RETRIEVE: The query requires specific factual information that needs to be retrieved.” else: return “NO_RETRIEVE: This is a general question that can be answered with existing knowledge.” elif “choose retrieval strategy” in prompt_lower: if “comparison” in prompt_lower or “versus” in prompt_lower: return “STRATEGY: multi_query – Need to retrieve information about multiple entities for comparison.” elif “recent” in prompt_lower or “latest” in prompt_lower: return “STRATEGY: temporal – Focus on recent information.” else: return “STRATEGY: semantic – Standard semantic similarity search.” elif “synthesize” in prompt_lower and “context:” in prompt_lower: return “Based on the retrieved information, here’s a comprehensive answer that combines multiple sources and provides specific details with proper context.” return “This is a mock response. In practice, use a real LLM like OpenAI’s GPT or similar.” class RetrievalStrategy(Enum): SEMANTIC = “semantic” MULTI_QUERY = “multi_query” TEMPORAL = “temporal” HYBRID = “hybrid” @dataclass class Document: id: str content: str metadata: Dict[str, Any] embedding: Optional[np.ndarray] = None We set up the foundation of our Agentic RAG system. We define a mock LLM to simulate decision-making, create a retrieval strategy enum, and design a Document dataclass so we can structure and manage our knowledge base efficiently. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AgenticRAGSystem: def __init__(self, model_name: str = “all-MiniLM-L6-v2″): self.encoder = SentenceTransformer(model_name) self.llm = MockLLM() self.documents: List[Document] = [] self.index: Optional[faiss.Index] = None def add_documents(self, documents: List[Dict[str, Any]]) -> None: print(f”Processing {len(documents)} documents…”) for i, doc in enumerate(documents): doc_obj = Document( id=doc.get(‘id’, str(i)), content=doc[‘content’], metadata=doc.get(‘metadata’, {}) ) self.documents.append(doc_obj) contents = [doc.content for doc in self.documents] embeddings = self.encoder.encode(contents, show_progress_bar=True) for doc, embedding in zip(self.documents, embeddings): doc.embedding = embedding dimension = embeddings.shape[1] self.index = faiss.IndexFlatIP(dimension) faiss.normalize_L2(embeddings) self.index.add(embeddings.astype(‘float32’)) print(f”Knowledge base built with {len(self.documents)} documents”) We build the core of our Agentic RAG system. We initialize the embedding model, set up the FAISS index, and add documents by encoding their contents into vectors, enabling fast and accurate semantic retrieval from our knowledge base. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def decide_retrieval(self, query: str) -> bool: decision_prompt = f””” Analyze the following query and decide whether to retrieve information: Query: “{query}” Decide whether to retrieve information from the knowledge base. Consider if this needs specific facts, recent data, or can be answered generally. Respond with either: RETRIEVE: [reason] or NO_RETRIEVE: [reason] “”” response = self.llm.generate(decision_prompt) should_retrieve = response.startswith(“RETRIEVE:”) print(f” Agent Decision: {‘Retrieve’ if should_retrieve else ‘Direct Answer’}”) print(f” Reasoning: {response.split(‘:’, 1)[1].strip() if ‘:’ in response else response}”) return should_retrieve def choose_strategy(self, query: str) -> RetrievalStrategy: strategy_prompt = f””” Choose the best retrieval strategy for this query: Query: “{query}” Available strategies: – semantic: Standard similarity search – multi_query: Multiple related queries (for comparisons) – temporal: Focus on recent information – hybrid: Combination approach Choose retrieval strategy and explain why. Respond with: STRATEGY: [strategy_name] – [reasoning] “”” response = self.llm.generate(strategy_prompt) if “multi_query” in response.lower(): strategy = RetrievalStrategy.MULTI_QUERY elif “temporal” in response.lower(): strategy = RetrievalStrategy.TEMPORAL elif “hybrid” in response.lower(): strategy = RetrievalStrategy.HYBRID else: strategy = RetrievalStrategy.SEMANTIC print(f” Retrieval Strategy: {strategy.value}”) print(f” Reasoning: {response.split(‘-‘, 1)[1].strip() if ‘-‘ in response else response}”) return strategy We give our agent the ability to think before it fetches. We first determine if a query truly requires retrieval, then we select the most suitable strategy: semantic, multi-query, temporal, or hybrid. This allows us to target the correct context with clear, printed reasoning for each step. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def retrieve_documents(self, query: str, strategy: RetrievalStrategy, k: int = 3) -> List[Document]: if not self.index: print(” No knowledge base available”) return [] if strategy == RetrievalStrategy.MULTI_QUERY: queries = [query, f”advantages of {query}”, f”disadvantages of {query}”] all_docs = [] for q in queries: docs = self._semantic_search(q, k=2) all_docs.extend(docs) seen_ids = set() unique_docs = [] for doc in all_docs: if doc.id not in seen_ids: unique_docs.append(doc) seen_ids.add(doc.id) return unique_docs[:k] elif strategy == RetrievalStrategy.TEMPORAL: docs = self._semantic_search(query, k=k*2) docs_with_dates = [(doc, doc.metadata.get(‘date’, ‘1900-01-01’)) for doc in docs] docs_with_dates.sort(key=lambda x: x[1], reverse=True) return [doc for doc, _ in docs_with_dates[:k]] else: return self._semantic_search(query, k=k) def _semantic_search(self, query: str, k: int) -> List[Document]: query_embedding = self.encoder.encode([query]) faiss.normalize_L2(query_embedding) scores, indices = self.index.search(query_embedding.astype(‘float32’), k) results = [] for score, idx in zip(scores[0], indices[0]): if idx < len(self.documents): results.append(self.documents[idx]) return results def synthesize_response(self, query: str, retrieved_docs: List[Document]) -> str: if not retrieved_docs: return self.llm.generate(f”Answer this query: {query}”) context = “nn”.join([f”Document {i+1}: {doc.content}” for i, doc in enumerate(retrieved_docs)]) synthesis_prompt = f””” Query: {query} Context: {context} Synthesize a comprehensive answer using the provided context. Be specific and reference the information sources when relevant. “”” return self.llm.generate(synthesis_prompt, max_tokens=200) We implement how we actually fetch and use knowledge. We perform semantic search, branch into multi-query or temporal re-ranking when needed, deduplicate results, and then synthesize a focused answer from the retrieved context. In doing so, we maintain efficient, transparent, and tightly aligned retrieval. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def query(self, query: str) -> str: print(f”n Processing Query: ‘{query}'”) print(“=” * 50) if not self.decide_retrieval(query): print(“n Generating direct response…”) return self.llm.generate(f”Answer this query: {query}”) strategy = self.choose_strategy(query) print(f”n Retrieving documents using {strategy.value} strategy…”) retrieved_docs = self.retrieve_documents(query, strategy) print(f” Retrieved {len(retrieved_docs)} documents”)

How to Build an Advanced Agentic Retrieval-Augmented Generation (RAG) System with Dynamic Strategy and Smart Retrieval? Read Post »

AI, Committee, ข่าว, Uncategorized

End-to-End Aspect-Guided Review Summarization at Scale

arXiv:2509.26103v1 Announce Type: new Abstract: We present a scalable large language model (LLM)-based system that combines aspect-based sentiment analysis (ABSA) with guided summarization to generate concise and interpretable product review summaries for the Wayfair platform. Our approach first extracts and consolidates aspect-sentiment pairs from individual reviews, selects the most frequent aspects for each product, and samples representative reviews accordingly. These are used to construct structured prompts that guide the LLM to produce summaries grounded in actual customer feedback. We demonstrate the real-world effectiveness of our system through a large-scale online A/B test. Furthermore, we describe our real-time deployment strategy and release a dataset of 11.8 million anonymized customer reviews covering 92,000 products, including extracted aspects and generated summaries, to support future research in aspect-guided review summarization.

End-to-End Aspect-Guided Review Summarization at Scale Read Post »

AI, Committee, ข่าว, Uncategorized

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation

arXiv:2505.21956v3 Announce Type: replace-cross Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy – combining a sub-dimensional sparse retriever with a dense retriever – to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in the retrieval and further contributes to generation quality, while maintaining high efficiency.

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation Read Post »

AI, Committee, ข่าว, Uncategorized

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

arXiv:2509.15235v4 Announce Type: replace-cross Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding Read Post »

AI, Committee, ข่าว, Uncategorized

Can Large Language Models Express Uncertainty Like Human?

arXiv:2509.24202v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.

Can Large Language Models Express Uncertainty Like Human? Read Post »

AI, Committee, ข่าว, Uncategorized

Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

arXiv:2509.24322v1 Announce Type: new Abstract: In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: href{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}.

Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey Read Post »

AI, Committee, ข่าว, Uncategorized

Latent Visual Reasoning

arXiv:2509.24251v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.

Latent Visual Reasoning Read Post »

AI, Committee, ข่าว, Uncategorized

MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark

arXiv:2509.22461v1 Announce Type: cross Abstract: The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.

MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark Read Post »

AI, Committee, ข่าว, Uncategorized

Context Parametrization with Compositional Adapters

arXiv:2509.22158v1 Announce Type: new Abstract: Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAs encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.

Context Parametrization with Compositional Adapters Read Post »

AI, Committee, ข่าว, Uncategorized

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

arXiv:2509.22646v1 Announce Type: cross Abstract: Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension — whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated — has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th