YouZum

Committee

AI, Committee, Noticias, Uncategorized

KL-based self-distillation for large language models

arXiv:2508.15807v1 Announce Type: new Abstract: Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.

KL-based self-distillation for large language models Leer entrada »

AI, Committee, Noticias, Uncategorized

Large Language Models LLMs vs. Small Language Models SLMs for Financial Institutions: A 2025 Practical Enterprise AI Guide

Table of contents 1. Regulatory and Risk Posture 2. Capability vs. Cost, Latency, and Footprint 3. Security and Compliance Trade-offs 4. Deployment Patterns 5. Decision Matrix (Quick Reference) 6. Concrete Use-Cases 7. Performance/Cost Levers Before “Going Bigger” EXAMPLES No single solution universally wins between Large Language Models (LLMs, ≥30B parameters, often via APIs) and Small Language Models (SLMs, ~1–15B, typically open-weights or proprietary specialist models). For banks, insurers, and asset managers in 2025, your selection should be governed by regulatory risk, data sensitivity, latency and cost requirements, and the complexity of the use case. SLM-first is recommended for structured information extraction, customer service, coding assistance, and internal knowledge tasks, especially with retrieval-augmented generation (RAG) and strong guardrails. Escalate to LLMs for heavy synthesis, multi-step reasoning, or when SLMs cannot meet your performance bar within latency/cost envelope. Governance is mandatory for both: treat LLMs and SLMs under your model risk management framework (MRM), align to NIST AI RMF, and map high-risk applications (such as credit scoring) to obligations under the EU AI Act. 1. Regulatory and Risk Posture Financial services are subject to mature model governance standards. In the US, Federal Reserve/OCC/FDIC SR 11-7 covers any model used for business decisioning, including LLMs and SLMs. This means required validation, monitoring, and documentation—irrespective of model size. The NIST AI Risk Management Framework (AI RMF 1.0) is the gold standard for AI risk controls, now widely adopted by financial institutions for both traditional and generative AI risks. In the EU, the AI Act is in force, with staged compliance dates (Aug 2025 for general purpose models, Aug 2026 for high-risk systems such as credit scoring per Annex III). High-risk means pre-market conformity, risk management, documentation, logging, and human oversight. Institutions targeting the EU must align remediation timelines accordingly. Core sectoral data rules apply: GLBA Safeguards Rule: Security controls and vendor oversight for consumer financial data. PCI DSS v4.0: New cardholder data controls—mandatory from March 31, 2025, with upgraded authentication, retention, and encryption. Supervisors (FSB/BIS/ECB) and standard setters highlight systemic risk from concentration, vendor lock-in, and model risk—neutral to model size. Key point: High-risk uses (credit, underwriting) require tight controls regardless of parameters. Both SLMs and LLMs demand traceable validation, privacy assurance, and sector compliance. 2. Capability vs. Cost, Latency, and Footprint SLMs (3–15B) now deliver strong accuracy on domain workloads, especially after fine-tuning and with retrieval augmentation. Recent SLMs (e.g., Phi-3, FinBERT, COiN) excel at targeted extraction, classification, and workflow augmentation, cut latency (<50ms), and allow self-hosting for strict data residency—and are feasible for edge deployment. LLMs unlock cross-document synthesis, heterogeneous data reasoning, and long-context operations (>100K tokens). Domain-specialized LLMs (e.g., BloombergGPT, 50B) outperform general models on financial benchmarks and multi-step reasoning tasks. Compute economics: Transformer self-attention scales quadratically with sequence length. FlashAttention/SlimAttention optimizations reduce compute costs, but don’t defeat the quadratic lower bound; long-context LLMs can be exponentially costlier at inference than short-context SLMs. Key point: Short, structured, latency-sensitive tasks (contact center, claims, KYC extraction, knowledge search) fit SLMs. If you need 100K+ token contexts or deep synthesis, budget for LLMs and mitigate cost via caching and selective “escalation.” 3. Security and Compliance Trade-offs Common risks: Both model types are exposed to prompt injection, insecure output handling, data leakage, and supply chain risks. SLMs: Preferred for self-hosting—satisfying GLBA/PCI/data sovereignty concerns and minimizing legal risks from cross-border transfers. LLMs: APIs introduce concentration and lock-in risks; supervisors require documented exit, fallback, and multi-vendor strategies. Explainability: High-risk uses require transparent features, challenger models, full decision logs, and human oversight; LLM reasoning traces cannot substitute for formal validation required by SR 11-7 / EU AI Act. 4. Deployment Patterns Three proven modes in finance: SLM-first, LLM fallback: Route 80%+ queries to a tuned SLM with RAG; escalate low-confidence/long-context cases to an LLM. Predictable cost/latency; good for call centers, operations, and form parsing. LLM-primary with tool-use: LLM as orchestrator for synthesis, with deterministic tools for data access, calculations, and protected by DLP. Suited for complex research, policy/regulatory work. Domain-specialized LLM: Large models adapted to financial corpora; higher MRM burden but measurable gains for niche tasks. Regardless, always implement content filters, PII redaction, least-privilege connectors, output verification, red-teaming, and continuous monitoring under NIST AI RMF and OWASP guidance. 5. Decision Matrix (Quick Reference) Criterion Prefer SLM Prefer LLM Regulatory exposure Internal assist, non-decisioning High-risk use (credit scoring) w/ full validation Data sensitivity On-prem/VPC, PCI/GLBA constraints External API with DLP, encryption, DPAs Latency & cost Sub-second, high QPS, cost-sensitive Seconds-latency, batch, low QPS Complexity Extraction, routing, RAG-aided draft Synthesis, ambiguous input, long-form context Engineering ops Self-hosted, CUDA, integration Managed API, vendor risk, rapid deployment 6. Concrete Use-Cases Customer Service: SLM-first with RAG/tools for common issues, LLM escalation for complex multi-policy queries. KYC/AML & Adverse Media: SLMs suffice for extraction/normalization; escalate to LLMs for fraud or multilingual synthesis. Credit Underwriting: High-risk (EU AI Act Annex III); use SLM/classical ML for decisioning, LLMs for explanatory narratives, always with human review. Research/Portfolio Notes: LLMs enable draft synthesis and cross-source collation; read-only access, citation logging, tool verification recommended. Developer Productivity: On-prem SLM code assistants for speed/IP safety; LLM escalation for refactoring or complex synthesis. 7. Performance/Cost Levers Before “Going Bigger” RAG optimization: Most failures are retrieval, not “model IQ.” Improve chunking, recency, relevance ranking before increasing size. Prompt/IO controls: Guardrails for input/output schema, anti-prompt-injection per OWASP. Serve-time: Quantize SLMs, page KV cache, batch/stream, cache frequent answers; quadratic attention inflates indiscriminate long contexts. Selective escalation: Route by confidence; >70% cost saving possible. Domain adaptation: Lightweight tuning/LoRA on SLMs closes most gaps; use large models only for clear, measurable lift in performance. EXAMPLES Example 1: Contract Intelligence at JPMorgan (COiN) JPMorgan Chase deployed a specialized Small Language Model (SLM), called COiN, to automate the review of commercial loan agreements—a process traditionally handled manually by legal staff. By training COiN on thousands of legal documents and regulatory filings, the bank slashed contract review times from several weeks to mere hours, achieving high accuracy and compliance traceability while drastically reducing operational cost. This targeted SLM solution

Large Language Models LLMs vs. Small Language Models SLMs for Financial Institutions: A 2025 Practical Enterprise AI Guide Leer entrada »

AI, Committee, Noticias, Uncategorized

Prefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)

Large language models are typically refined after pretraining using either supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), each with distinct strengths and limitations. SFT is effective in teaching instruction-following through example-based learning, but it can lead to rigid behavior and poor generalization. RFT, on the other hand, optimizes models for task success using reward signals, which can improve performance but also introduce instability and reliance on a strong starting policy. While these methods are often used sequentially, their interaction remains poorly understood. This raises an important question: how can we design a unified framework that combines SFT’s structure with RFT’s goal-driven learning?  Research at the intersection of RL and LLM post-training has gained momentum, particularly for training reasoning-capable models. Offline RL, which learns from fixed datasets, often yields suboptimal policies due to the limited diversity of the data. This has sparked interest in combining offline and online RL approaches to improve performance. In LLMs, the dominant strategy is to first apply SFT to teach desirable behaviors, then use RFT to optimize outcomes. However, the dynamics between SFT and RFT are still not well understood, and finding effective ways to integrate them remains an open research challenge.  Researchers from the University of Edinburgh, Fudan University, Alibaba Group, Stepfun, and the University of Amsterdam propose a unified framework that combines supervised and reinforcement fine-tuning in a way called Prefix-RFT. This method guides exploration using partial demonstrations, allowing the model to continue generating solutions with flexibility and adaptability. Tested on math reasoning tasks, Prefix-RFT consistently outperforms standalone SFT, RFT, and mixed-policy methods. It integrates easily into existing frameworks and proves robust to changes in demonstration quality and quantity. Blending demonstration-based learning with exploration can lead to more effective and adaptive training of large language models.  https://arxiv.org/abs/2507.01679 The study presents Prefix Reinforcement Fine-Tuning (Prefix-RFT) as a way to blend the strengths of SFT and RFT. While SFT offers stability by mimicking expert demonstrations, RFT encourages exploration through the use of reward signals. Prefix-RFT bridges the two by using a partial demonstration (a prefix) and letting the model generate the rest. This approach guides learning without relying too heavily on full supervision. It incorporates techniques like entropy-based clipping and a cosine decay scheduler to ensure stable training and efficient learning. Compared to prior methods, Prefix-RFT offers a more balanced and adaptive fine-tuning strategy.  Prefix-RFT is a reward fine-tuning method that improves performance using high-quality offline math datasets, such as OpenR1-Math-220K (46k filtered problems). Tested on Qwen2.5-Math-7B, 1.5B, and LLaMA-3.1-8B, it was evaluated on benchmarks including AIME 2024/25, AMC, MATH500, Minerva, and OlympiadBench. Prefix-RFT achieved the highest avg@32 and pass@1 scores across tasks, outperforming RFT, SFT, ReLIFT, and LUFFY. Using Dr. GRPO, it updated only the top 20% high-entropy prefix tokens, with the prefix length decaying from 95% to 5%. It maintained intermediate SFT loss, indicating a strong balance between imitation and exploration, especially on difficult problems (Trainhard).  https://arxiv.org/abs/2507.01679 In conclusion, Prefix-RFT combines the strengths of SFT and RFT by utilizing sampled demonstration prefixes to guide learning. Despite its simplicity, it consistently outperforms SFT, RFT, and hybrid baselines across various models and datasets. Even with only 1% of the training data (450 prompts), it maintains strong performance (avg@32 drops only from 40.8 to 37.6), showing efficiency and robustness. Its top-20% entropy-based token update strategy proves most effective, achieving the highest benchmark scores with shorter outputs. Moreover, using a cosine decay scheduler for prefix length enhances stability and learning dynamics compared to a uniform strategy, particularly on complex tasks such as AIME.  Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Prefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) appeared first on MarkTechPost.

Prefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) Leer entrada »

AI, Committee, Noticias, Uncategorized

JSON Prompting for LLMs: A Practical Guide with Python Coding Examples

JSON Prompting is a technique for structuring instructions to AI models using the JavaScript Object Notation (JSON) format, making prompts clear, explicit, and machine-readable. Unlike traditional text-based prompts, which can leave room for ambiguity and misinterpretation, JSON prompts organize requirements as key-value pairs, arrays, and nested objects, turning vague requests into precise blueprints for the model to follow. This method greatly improves consistency and accuracy—especially for complex or repetitive tasks—by allowing users to specify things like task type, topic, audience, output format, and other parameters in an organized way that language models inherently understand. As AI systems increasingly rely on predictable, structured input for real-world workflows, JSON prompting has become a preferred strategy for generating sharper, more reliable results across major LLMs, including GPT-4, Claude, and Gemini. In this tutorial, we’ll dive deep into the power of JSON prompting and why it can transform the way you interact with AI models. We will walk you through the benefits of using JSON Prompting through coding examples —from simple text prompts to structured JSON prompts—and show you comparisons of their outputs. By the end, you’ll clearly see how structured prompts bring precision, consistency, and scalability to your workflows, whether you’re generating summaries, extracting data, or building advanced AI pipelines. Check out the FULL CODES here. Installing the dependencies Copy CodeCopiedUse a different Browser pip install openai Copy CodeCopiedUse a different Browser import os from getpass import getpass os.environ[“OPENAI_API_KEY”] = getpass(‘Enter OpenAI API Key: ‘) To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser from openai import OpenAI client = OpenAI() Structured Prompts Ensure Consistency Using structured prompts, such as JSON-based formats, forces you to think in terms of fields and values — a true advantage when working with LLMs. Check out the FULL CODES here. By defining a fixed structure, you eliminate ambiguity and guesswork, ensuring that every response follows a predictable pattern. Here’s a simple example: Copy CodeCopiedUse a different Browser Summarize the following email and list the action items clearly. Email: Hi team, let’s finalize the marketing plan by Tuesday. Alice, prepare the draft; Bob, handle the design. We’ll feed this prompt to the LLM in two ways and then compare the outputs generated by a free-form prompt versus a structured (JSON-based) prompt to observe the difference in clarity and consistency. Check out the FULL CODES here. Free-Form Prompt Copy CodeCopiedUse a different Browser prompt_text = “”” Summarize the following email and list the action items clearly. Email: Hi team, let’s finalize the marketing plan by Tuesday. Alice, prepare the draft; Bob, handle the design. “”” response_text = client.chat.completions.create( model=”gpt-5″, messages=[{“role”: “user”, “content”: prompt_text}] ) text_output = response_text.choices[0].message.content print(text_output) Copy CodeCopiedUse a different Browser Summary: The team needs to finalize the marketing plan by Tuesday. Alice will prepare the draft, and Bob will handle the design. Action items: – Alice: Prepare the draft of the marketing plan by Tuesday. – Bob: Handle the design by Tuesday. – Team: Finalize the marketing plan by Tuesday. JSON Prompt Copy CodeCopiedUse a different Browser prompt_json = “”” Summarize the following email and return the output strictly in JSON format: { “summary”: “short summary of the email”, “action_items”: [“task 1”, “task 2”, “task 3”], “priority”: “low | medium | high” } Email: Hi team, let’s finalize the marketing plan by Tuesday. Alice, prepare the draft; Bob, handle the design. “”” response_json = client.chat.completions.create( model=”gpt-5″, messages=[ {“role”: “system”, “content”: “You are a precise assistant that always replies in valid JSON.”}, {“role”: “user”, “content”: prompt_json} ] ) json_output = response_json.choices[0].message.content print(json_output) Copy CodeCopiedUse a different Browser { “summary”: “Finalize the marketing plan by Tuesday; Alice to draft and Bob to handle design.”, “action_items”: [ “Alice: prepare the draft”, “Bob: handle the design”, “Team: finalize the marketing plan by Tuesday” ], “priority”: “medium” } In this example, the use of a structured JSON prompt leads to a clear and concise output that is easy to parse and evaluate. By defining fields such as “summary”, “action_items”, and “priority”, the LLM response becomes more consistent and actionable. Instead of generating free-flowing text, which might vary in style and detail, the model provides a predictable structure that eliminates ambiguity. This approach not only improves the readability and reliability of responses but also makes it easier to integrate the output into downstream workflows, such as project trackers, dashboards, or automated email handlers. User can control the output When you frame your prompt in JSON, you remove ambiguity from both the instruction and the output. In this example, asking for a market summary, sentiment, opportunities, risks, and a confidence score can yield inconsistent formats when passed as plain text. However, by structuring the request in JSON — with clearly defined fields like “summary”, “sentiment”, “opportunities”, “risks”, and “confidence_score” — the response becomes predictable, machine-friendly, and easier to parse. This consistency ensures that, whether you’re generating content, analyzing reports, or extracting insights, your workflow remains streamlined and reliable, with no surprises — just clean, structured results every time. Check out the FULL CODES here. Free-Form Prompt Copy CodeCopiedUse a different Browser plain_text_prompt = “”” Analyze the following market update: Market Text: Tesla’s Q2 earnings beat expectations due to higher Model Y sales, but rising competition from BYD is a risk. Apple reported steady revenue growth driven by iPhone sales, but services revenue slightly declined. Amazon’s AWS division continues to dominate cloud computing, though regulatory scrutiny in Europe is increasing. Generate: – A 2-line market summary – Sentiment for each company (positive, negative, neutral) – Key growth opportunities and risks – A confidence score from 0 to 10 “”” response_plain = client.chat.completions.create( model=”gpt-5″, messages=[{“role”: “user”, “content”: plain_text_prompt}] ) plain_output = response_plain.choices[0].message.content print(plain_output) Copy CodeCopiedUse a different Browser Market summary: – Earnings updates skew constructive: Tesla beat on Q2 with strong Model Y, Apple grew on iPhone, and AWS remains the

JSON Prompting for LLMs: A Practical Guide with Python Coding Examples Leer entrada »

AI, Committee, Noticias, Uncategorized

GPZ: A Next-Generation GPU-Accelerated Lossy Compressor for Large-Scale Particle Data

Particle-based simulations and point-cloud applications are driving a massive expansion in the size and complexity of scientific and commercial datasets, often leaping into the realm of billions or trillions of discrete points. Efficiently reducing, storing, and analyzing this data without bottlenecking modern GPUs is one of the emerging grand challenges in fields like cosmology, geology, molecular dynamics, and 3D imaging. Recently, a team of researchers from Florida State University, the University of Iowa, Argonne National Laboratory, the University of Chicago, and several other institutions introduced GPZ, a GPU-optimized, error-bounded lossy compressor that radically improves throughput, compression ratio, and data fidelity for particle data—outperforming five state-of-the-art alternatives by wide margins. Why Compress Particle Data? And Why is It So Hard? Particle (or point-cloud) data—unlike structured meshes—represents systems as irregular collections of discrete elements in multidimensional space. This format is essential for capturing complex physical phenomena, but has low spatial and temporal coherence and almost no redundancy, making it a nightmare for classical lossless or generic lossy compressors. Consider: The Summit supercomputer generated a single cosmological simulation snapshot of 70 TB using Nvidia V100 GPUs. The USGS 3D Elevation Program’s point clouds of U.S. terrain exceed 200 TB of storage. Traditional approaches—like downsampling or on-the-fly processing—throw away up to 90% of raw data or foreclose reproducibility through lack of storage. Moreover, generic mesh-focused compressors exploit correlations that simply don’t exist in particle data, yielding poor ratios and abysmal GPU throughput. GPZ: Architecture and Innovations GPZ comes equipped with a four-stage, parallel GPU pipeline—specially engineered for the quirks of particle data and the stringent demands of modern massively-parallel hardware. Source: https://arxiv.org/abs/2508.10305 Pipeline Stages: Spatial Quantization Particles’ floating-point positions are mapped to integer segment IDs and offsets, respecting user-specified error bounds while leveraging fast FP32 operations for maximum GPU arithmetic throughput. Segment sizes are tuned for optimal GPU occupancy. Spatial Sorting Within each block (mapped to a CUDA warp), particles are sorted by their segment ID to enhance subsequent lossless coding—using warp-level operations to avoid costly synchronization. Block-level sort balances compression ratio with shared memory footprint for best parallelism. Lossless Encoding Innovative parallel run-length and delta encoding strip redundancy from sorted segment IDs and quantized offsets. Bit-plane coding eliminates zero bits, with all steps heavily optimized for GPU memory access patterns. Compacting Compressed blocks are efficiently assembled into a contiguous output using a three-step device-level strategy that slashes synchronization overheads and maximizes memory throughput (809 GB/s on RTX 4090, near theoretical peak). Decompression is the reverse—extract, decode, and reconstruct positions within error bounds, enabling high-fidelity post-hoc analysis. Source: https://arxiv.org/abs/2508.10305 Hardware-Aware Performance Optimizations GPZ sets itself apart with a suite of hardware-centric optimizations: Memory coalescing: Reads and writes are carefully aligned to 4-byte boundaries maximizing DRAM bandwidth (up to 1.6x improvement over strided access). Register and shared memory management: Algorithms are designed to keep occupancy high. Precision is dropped to FP32 where possible, and excessive register use is avoided to prevent spills. Compute scheduling: One-warp-per-block mapping, explicit use of CUDA intrinsics like FMA operations, and loop unrolling where beneficial. Division/modulo elimination: Replacing slow division/modulo operations with precomputed reciprocals and bitwise masks where possible. Benchmarking: GPZ vs. State-of-the-Art GPZ was evaluated on six real-world datasets (from cosmology, geology, plasma physics, and molecular dynamics), spanning three GPU architectures: Consumer: RTX 4090, Data center: H100 SXM, Edge: Nvidia L4. Baselines included: cuSZp2 PFPL FZ-GPU cuSZ cuSZ-i Most of these tools, optimized for generic scientific meshes, failed or showed severe performance/quality drop-offs on particle datasets over 2 GB; GPZ remained robust throughout. Results: Speed: GPZ delivered compression throughputs up to 8x higher than the next-best competitor. Average throughputs hit 169 GB/s (L4), 598 GB/s (RTX 4090), and 616 GB/s (H100). Decompression scales even higher. Compression Ratio: GPZ consistently outperformed all baselines, yielding ratios as much as 600% higher in challenging settings. Even when runners-up edged slightly ahead, GPZ sustained a 3x-6x speed advantage. Data Quality: Rate-distortion plots confirmed superior preservation of scientific features (higher PSNR at lower bitrates), and visual inspection (especially in 10x magnified views) revealed GPZ’s reconstructions were nearly indistinguishable from the originals, whereas other compressions produced visible artifacts. Key Takeaways & Implications GPZ sets a new gold standard for real-time, large-scale particle data reduction on modern GPUs. Its design acknowledges the fundamental limits of generic compressors and delivers tailored solutions that exploit every ounce of GPU-parallelism and precision tuning. For researchers and practitioners working with immense scientific datasets, GPZ offers: Robust error-bounded compression suited for in-situ and post-hoc analysis Practical throughput and ratios across consumer and HPC-class hardware Near-perfect reconstruction for downstream analytics, visualization, and modeling tasks As data sizes continue to scale, solutions like GPZ will increasingly define the next era of GPU-oriented scientific computing and large-scale data management. Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post GPZ: A Next-Generation GPU-Accelerated Lossy Compressor for Large-Scale Particle Data appeared first on MarkTechPost.

GPZ: A Next-Generation GPU-Accelerated Lossy Compressor for Large-Scale Particle Data Leer entrada »

AI, Committee, Noticias, Uncategorized

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

LLMs have rapidly advanced with soaring parameter counts, widespread use of mixture-of-experts (MoE) designs, and massive context lengths. Models like DeepSeek-R1, LLaMA-4, and Qwen-3 now reach trillions of parameters, demanding enormous compute, memory bandwidth, and fast inter-chip communication. MoE improves efficiency but creates challenges in expert routing, while context windows exceeding a million tokens strain attention and KV cache storage, which scales with concurrent users. In real-world deployments, unpredictable inputs, uneven expert activations, and bursty queries further complicate serving. Addressing these pressures requires a ground-up rethinking of AI infrastructure through hardware–software co-design, adaptive orchestration, and elastic resource management.  Recent progress in LLMs is shaped by three main trends: ever-growing parameter counts, sparse MoE architectures, and extended context windows. Models like Llama 4, DeepSeek-V3, and Google’s PaLM push scale into the trillions of parameters, while MoE designs activate only subsets of experts per token, balancing efficiency with capacity. Meanwhile, context windows now span hundreds of thousands to millions of tokens, enabling long-form reasoning but straining compute and memory through large key-value caches. These advances place immense pressure on datacenters, demanding higher compute, memory, and bandwidth while introducing challenges in parallelism, workload heterogeneity, data convergence, and storage performance.  Huawei researchers introduced CloudMatrix, a new AI datacenter architecture designed to handle the rising demands of large-scale LLMs. Its first implementation, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs, all linked by a high-bandwidth, low-latency Unified Bus that enables fully peer-to-peer communication. This design allows flexible pooling of compute, memory, and network resources, making it ideal for MoE parallelism and distributed KV cache access. On top of this, CloudMatrix-Infer offers an optimized serving framework with peer-to-peer resource pools, large-scale expert parallelism, and hardware-aware optimizations like pipelining and INT8 quantization. Evaluations with DeepSeek-R1 show state-of-the-art throughput, efficiency, and scalability.  Huawei CloudMatrix is a new AI datacenter architecture built on peer-to-peer high-bandwidth interconnects and fine-grained resource disaggregation. Its first large-scale implementation, CloudMatrix384, integrates 384 Ascend 910C NPUs and 192 Kunpeng CPUs into a single supernode, all linked by a unified bus network that enables direct all-to-all communication. This design allows compute, memory, and network resources to be shared seamlessly and scaled independently, operating as one cohesive system. By avoiding the bottlenecks of traditional hierarchical setups, CloudMatrix384 is particularly effective for communication-heavy tasks such as large-scale MoE parallelism and distributed KV cache management, making it ideal for scalable LLM serving.  The researchers evaluate CloudMatrix-Infer on the DeepSeek-R1 model using the CloudMatrix384 supernode. The system achieves a prefill throughput of 6,688 tokens per second per NPU and a decode throughput of 1,943 tokens per second with latency kept under 50 ms, outperforming comparable systems such as SGLang on NVIDIA H100 and DeepSeek on H800. Even when constrained to stricter latency requirements of under 15 ms, it sustains 538 tokens per second in decoding. Moreover, INT8 quantization on the Ascend 910C preserves accuracy across 16 benchmarks, showing that efficiency improvements do not compromise model quality.  In conclusion, Huawei CloudMatrix is a next-generation AI datacenter architecture designed to overcome the scalability limits of conventional clusters. Its first production system, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs in a fully peer-to-peer supernode connected through a high-bandwidth, low-latency Unified Bus. To exploit this design, the study proposes CloudMatrix-Infer, which separates prefill, decode, and caching into independent pools, supports large-scale expert parallelism, and applies hardware-aware optimizations like pipelining and INT8 quantization. Tested on DeepSeek-R1, it achieved superior throughput and latency performance compared to NVIDIA-based systems, while preserving accuracy, showcasing its potential for large-scale AI deployments.  Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving appeared first on MarkTechPost.

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving Leer entrada »

AI, Committee, Noticias, Uncategorized

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?

Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technique for enhancing Large Language Models (LLMs) with real-time, domain-specific knowledge. But the landscape is rapidly shifting—today, the most common implementations are “Native RAG” pipelines, and a new paradigm called “Agentic RAG” is redefining what’s possible in AI-powered information synthesis and decision support. Native RAG: The Standard Pipeline Architecture A Native RAG pipeline harnesses retrieval and generation-based methods to answer complex queries while ensuring accuracy and relevance. The pipeline typically involves: Query Processing & Embedding: The user’s question is rewritten, if needed, embedded into a vector representation using an LLM or dedicated embedding model, and prepared for semantic search. Retrieval: The system searches a vector database or document store, identifying top-k relevant chunks using similarity metrics (cosine, Euclidean, dot product). Efficient ANN algorithms optimize this stage for speed and scalability. Reranking: Retrieved results are reranked based on relevance, recency, domain-specificity, or user preference. Reranking models—ranging from rule-based to fine-tuned ML systems—prioritize the highest-quality information. Synthesis & Generation: The LLM synthesizes the reranked information to generate a coherent, context-aware response for the user. Common Optimizations Recent advances include dynamic reranking (adjusting depth by query complexity), fusion-based strategies that aggregate rankings from multiple queries, and hybrid approaches that combine semantic partitioning with agent-based selection for optimal retrieval robustness and latency. Agentic RAG: Autonomous, Multi-Agent Information Workflows What Is Agentic RAG? Agentic RAG is an agent-based approach to RAG, leveraging multiple autonomous agents to answer questions and process documents in a highly coordinated fashion. Rather than a single retrieval/generation pipeline, Agentic RAG structures its workflow for deep reasoning, multi-document comparison, planning, and real-time adaptability. Key Components Component Description Document Agent Each document is assigned its own agent, able to answer queries about the document and perform summary tasks, working independently within its scope. Meta-Agent Orchestrates all document agents, managing their interactions, integrating outputs, and synthesizing a comprehensive answer or action. Features and Benefits Autonomy: Agents operate independently, retrieving, processing, and generating answers or actions for specific documents or tasks. Adaptability: The system dynamically adjusts its strategy (e.g., reranking depth, document prioritization, tool selection) based on new queries or changing data contexts. Proactivity: Agents anticipate needs, take preemptive steps towards goals (e.g., pulling additional sources or suggesting actions), and learn from previous interactions. Advanced Capabilities Agentic RAG goes beyond “passive” retrieval—agents can compare documents, summarize or contrast specific sections, aggregate multi-source insights, and even invoke tools or APIs for enriched reasoning. This enables: Automated research and multi-database aggregation Complex decision support (e.g., comparing technical features, summarizing key differences across product sheets) Executive support tasks that require independent synthesis and real-time action recommendation. Applications Agentic RAG is ideal for scenarios where nuanced information processing and decision-making are required: Enterprise Knowledge Management: Coordinating answers across heterogeneous internal repositories AI-Driven Research Assistants: Cross-document synthesis for technical writers, analysts, or executives Automated Action Workflows: Triggering actions (e.g., responding to invitations, updating records) after multi-step reasoning over documents or databases. Complex Compliance and Security Audits: Aggregating and comparing evidence from varied sources in real time. Conclusion Native RAG pipelines have standardized the process of embedding, retrieving, reranking, and synthesizing answers from external data, enabling LLMs to serve as dynamic knowledge engines. Agentic RAG pushes the boundaries even further—by introducing autonomous agents, orchestration layers, and proactive, adaptive workflows, it transforms RAG from a retrieval tool into a full-blown agentic framework for advanced reasoning and multi-document intelligence. Organizations seeking to move beyond basic augmentation—and into realms of deep, flexible AI orchestration—will find in Agentic RAG the blueprint for the next generation of intelligent systems. The post Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making? appeared first on MarkTechPost.

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making? Leer entrada »

We use cookies to improve your experience and performance on our website. You can learn more at Política de privacidad and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
es_ES