YouZum

Committee

AI, Committee, ข่าว, Uncategorized

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

arXiv:2510.15545v1 Announce Type: new Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs Read Post »

AI, Committee, ข่าว, Uncategorized

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

arXiv:2510.15313v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit “echo chamber” effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry Read Post »

AI, Committee, ข่าว, Uncategorized

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

arXiv:2510.15624v1 Announce Type: cross Abstract: The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present texttt{freephdlabor}, an open-source multiagent framework featuring textit{fully dynamic workflows} determined by real-time agent reasoning and a coloremph{textit{modular architecture}} enabling seamless customization — users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including textit{automatic context compaction}, textit{workspace-based communication} to prevent information degradation, textit{memory persistence} across sessions, and textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research — from ideation through experimentation to publication-ready manuscripts.

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation Read Post »

AI, Committee, ข่าว, Uncategorized

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

arXiv:2504.20094v2 Announce Type: replace-cross Abstract: Conversational recommender systems (CRS) have advanced with large language models, showing strong results in domains like movies. These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme. In contrast, games present distinct challenges: fast-evolving catalogs, interaction-driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open-ended conversation. We propose MATCHA, a multi-agent framework for CRS that assigns specialized agents for intent parsing, tool-augmented retrieval, multi-LLM ranking with reflection, explanation, and risk control which enabling finer personalization, long-tail coverage, and stronger safety. Evaluated on real user request dataset, MATCHA outperforms six baselines across eight metrics, improving Hit@5 by 20%, reducing popularity bias by 24%, and achieving 97.9% adversarial defense. Human and virtual-judge evaluations confirm improved explanation quality and user alignment.

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition Read Post »

AI, Committee, ข่าว, Uncategorized

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

arXiv:2510.15345v1 Announce Type: new Abstract: Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics Read Post »

AI, Committee, ข่าว, Uncategorized

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters

Are your LLM code benchmarks actually rejecting wrong-complexity solutions and interactive-protocol violations, or are they passing under-specified unit tests? A team of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce AutoCode, a new AI framework that lets LLMs create and verify competitive programming problems, mirroring the workflow of human problem setters. AutoCode reframes evaluation for code-reasoning models by treating problem setting (not only problem solving) as the target task. The system trains LLMs to produce competition-grade statements, test data, and verdict logic that match official online judges at high rates. On a 7,538-problem benchmark built from prior datasets, AutoCode achieves 91.1% consistency with official judgments (FPR 3.7%, FNR 14.1%). On a separate, more difficult 720 recent Codeforces problems (including interactive tasks), the full framework reports 98.7% consistency, 1.3% FPR, 1.2% FNR. https://arxiv.org/pdf/2510.12803 Why problem setting matters for evaluation? Public code benchmarks often rely on under-specified tests that let wrong-complexity or shortcut solutions pass. That inflates scores and pollutes reinforcement signals (rewarding fragile tactics). AutoCode’s validator-first approach and adversarial test generation aim to reduce false positives (FPR)—incorrect programs that pass—and false negatives (FNR)—correct programs rejected due to malformed inputs. https://arxiv.org/pdf/2510.12803 The core loop: Validator → Generator → Checker AutoCode runs a closed loop that mirrors human contest workflows, but each step is selected from LLM-generated candidates using targeted in-framework tests. 1) Validator (minimize FNR by enforcing input legality) The system first asks an LLM to synthesize 40 evaluation inputs—10 valid and 30 near-valid illegal (e.g., off-by-one boundary violations). It then prompts the LLM for three candidate validator programs and selects the one that best classifies these cases. This prevents “correct” solutions from crashing on malformed data. https://arxiv.org/pdf/2510.12803 2) Generator (reduce FPR by adversarial coverage) Three complementary strategies produce test cases:• Small-data exhaustion for boundary coverage,• Randomized + extreme cases (overflows, precision, hash-collisions),• TLE-inducing structures to break wrong-complexity solutions. Invalid cases are filtered by the selected validator; then cases are deduplicated and bucket-balanced before sampling. https://arxiv.org/pdf/2510.12803 3) Checker (verdict logic) The checker compares contestant outputs with the reference solution under complex rules. AutoCode again generates 40 checker scenarios and three candidate checker programs, keeps only scenarios with validator-approved inputs, and selects the best checker by accuracy against the 40 labeled scenarios. https://arxiv.org/pdf/2510.12803 4) Interactor (for interactive problems) For tasks that require dialogue with the judge, AutoCode introduces a mutant-based interactor: it makes small logical edits (“mutants”) to the reference solution, selects interactors that accept the true solution but reject the mutants, maximizing discrimination. This addresses a gap in earlier public datasets that avoided interactives. https://arxiv.org/pdf/2510.12803 Dual verification enables new problems (not just tests for existing ones) AutoCode can generate novel problem variants starting from a random “seed” Codeforces problem (<2200 Elo). The LLM drafts a new statement and two solutions: an efficient reference and a simpler brute-force baseline. A problem is accepted only if the reference output matches brute force across the generated test suite (the brute force may TLE on large cases but serves as ground truth on small/exhaustive cases). This dual-verification protocol filters ~27% of error-prone items, lifting reference-solution correctness from 86% → 94% before human review. Human experts then grade the survivors on solvability, solution correctness, quality, novelty, difficulty. After filtering, 61.6% are usable for model training, 76.3% for human training, and 3.2% are ICPC/IOI-level problems. Difficulty typically increases relative to the seed, and difficulty gain correlates with perceived quality. https://arxiv.org/pdf/2510.12803 Understanding the results Existing problems (7,538 total; 195,988 human submissions). AutoCode: 91.1% consistency, 3.7% FPR, 14.1% FNR, vs 72.9–81.0% consistency for prior generators (CodeContests, CodeContests+, TACO, HardTests). Recent Codeforces problems (720, unfiltered; includes interactives). AutoCode: 98.7% consistency, 1.3% FPR, 1.2% FNR. Ablations show all three generator strategies and prompt optimization contribute: removing prompt optimization drops consistency to 98.0% and more than doubles FNR to 2.9%. https://arxiv.org/pdf/2510.12803 Key Takeaways AutoCode couples a Validator–Generator–Checker (+Interactor) loop with dual verification (reference vs. brute-force) to build contest-grade test suites and new problems. On held-out problems, AutoCode’s test suites reach ~99% consistency with official judges, surpassing prior generators like HardTests (<81%). For recent Codeforces tasks (including interactives), the full framework reports ~98.7% consistency with ~1.3% FPR and ~1.2% FNR. The mutant-based interactor reliably accepts the true solution while rejecting mutated variants, improving evaluation for interactive problems. Human experts rate a sizable fraction of AutoCode-generated items as training-usable and a non-trivial share as contest-quality, aligning with the LiveCodeBench Pro program’s aims. Editorial Comments AutoCode is a practical fix for current code benchmarks. It centers problem setting and uses a closed-loop Validator–Generator–Checker (+Interactor) pipeline with dual verification (reference vs. brute-force). This structure reduces false positives/negatives and yields judge-aligned consistency (≈99% on held-out problems; 98.7% on recent Codeforces, including interactives). The approach standardizes constraint legality, adversarial coverage, and protocol-aware judging, which makes downstream RL reward signals cleaner. Its placement under LiveCodeBench Pro fits a hallucination-resistant evaluation program that emphasizes expert-checked rigor. Check out the Paper and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters appeared first on MarkTechPost.

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters Read Post »

AI, Committee, ข่าว, Uncategorized

Abstract or die: Why AI enterprises can’t afford rigid vector stacks

Vector databases (DBs), once specialist research instruments, have become widely used infrastructure in just a few years. They power today’s semantic search, recommendation engines, anti-fraud measures and gen AI applications across industries. There are a deluge of options: PostgreSQL with pgvector, MySQL HeatWave, DuckDB VSS, SQLite VSS, Pinecone, Weaviate, Milvus and several others. The riches of choices sound like a boon to companies. But just beneath, a growing problem looms: Stack instability. New vector DBs appear each quarter, with disparate APIs, indexing schemes and performance trade-offs. Today’s ideal choice may look dated or limiting tomorrow. To business AI teams, volatility translates into lock-in risks and migration hell. Most projects begin life with lightweight engines like DuckDB or SQLite for prototyping, then move to Postgres, MySQL or a cloud-native service in production. Each switch involves rewriting queries, reshaping pipelines, and slowing down deployments. This re-engineering merry-go-round undermines the very speed and agility that AI adoption is supposed to bring. Why portability matters now Companies have a tricky balancing act: Experiment quickly with minimal overhead, in hopes of trying and getting early value; Scale safely on stable, production-quality infrastructure without months of refactoring; Be nimble in a world where new and better backends arrive nearly every month. Without portability, organizations stagnate. They have technical debt from recursive code paths, are hesitant to adopt new technology and cannot move prototypes to production at pace. In effect, the database is a bottleneck rather than an accelerator. Portability, or the ability to move underlying infrastructure without re-encoding the application, is ever more a strategic requirement for enterprises rolling out AI at scale. Abstraction as infrastructure The solution is not to pick the “perfect” vector database (there isn’t one), but to change how enterprises think about the problem. In software engineering, the adapter pattern provides a stable interface while hiding underlying complexity. Historically, we’ve seen how this principle reshaped entire industries: ODBC/JDBC gave enterprises a single way to query relational databases, reducing the risk of being tied to Oracle, MySQL or SQL Server; Apache Arrow standardized columnar data formats, so data systems could play nice together; ONNX created a vendor-agnostic format for machine learning (ML) models, bringing TensorFlow, PyTorch, etc. together; Kubernetes abstracted infrastructure details, so workloads could run the same everywhere on clouds; any-llm (Mozilla AI) now makes it possible to have one API across lots of large language model (LLM) vendors, so playing with AI is safer. All these abstractions led to adoption by lowering switching costs. They turned broken ecosystems into solid, enterprise-level infrastructure. Vector databases are also at the same tipping point. The adapter approach to vectors Instead of having application code directly bound to some specific vector backend, companies can compile against an abstraction layer that normalizes operations like inserts, queries and filtering. This doesn’t necessarily eliminate the need to choose a backend; it makes that choice less rigid. Development teams can start with DuckDB or SQLite in the lab, then scale up to Postgres or MySQL for production and ultimately adopt a special-purpose cloud vector DB without having to re-architect the application. Open source efforts like Vectorwrap are early examples of this approach, presenting a single Python API to Postgres, MySQL, DuckDB and SQLite. They demonstrate the power of abstraction to accelerate prototyping, reduce lock-in risk and support hybrid architectures employing numerous backends. Why businesses should care For leaders of data infrastructure and decision-makers for AI, abstraction offers three benefits: Speed from prototype to production Teams are able to prototype on lightweight local environments and scale without expensive rewrites. Reduced vendor risk Organizations can adopt new backends as they emerge without long migration projects by decoupling app code from specific databases. Hybrid flexibility Companies can mix transactional, analytical and specialized vector DBs under one architecture, all behind an aggregated interface. The result is data layer agility, and that’s more and more the difference between fast and slow companies. A broader movement in open source What’s happening in the vector space is one example of a bigger trend: Open-source abstractions as critical infrastructure. In data formats: Apache Arrow In ML models: ONNX In orchestration: Kubernetes In AI APIs: Any-LLM and other such frameworks These projects succeed, not by adding new capability, but by removing friction. They enable enterprises to move more quickly, hedge bets and evolve along with the ecosystem. Vector DB adapters continue this legacy, transforming a high-speed, fragmented space into infrastructure that enterprises can truly depend on. The future of vector DB portability The landscape of vector DBs will not converge anytime soon. Instead, the number of options will grow, and every vendor will tune for different use cases, scale, latency, hybrid search, compliance or cloud platform integration. Abstraction becomes strategy in this case. Companies adopting portable approaches will be capable of: Prototyping boldly Deploying in a flexible manner Scaling rapidly to new tech It’s possible we’ll eventually see a “JDBC for vectors,” a universal standard that codifies queries and operations across backends. Until then, open-source abstractions are laying the groundwork. Conclusion Enterprises adopting AI cannot afford to be slowed by database lock-in. As the vector ecosystem evolves, the winners will be those who treat abstraction as infrastructure, building against portable interfaces rather than binding themselves to any single backend. The decades-long lesson of software engineering is simple: Standards and abstractions lead to adoption. For vector DBs, that revolution has already begun. Mihir Ahuja is an AI/ML engineer and open-source contributor based in San Francisco.

Abstract or die: Why AI enterprises can’t afford rigid vector stacks Read Post »

AI, Committee, ข่าว, Uncategorized

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup

Microsoft Research proposes BitNet Distillation, a pipeline that converts existing full precision LLMs into 1.58 bit BitNet students for specific tasks, while keeping accuracy close to the FP16 teacher and improving CPU efficiency. The method combines SubLN based architectural refinement, continued pre training, and dual signal distillation from logits and multi head attention relations. Reported results show up to 10× memory savings and about 2.65× faster CPU inference, with task metrics comparable to FP16 across multiple sizes. What BitNet Distillation changes? The community already showed that BitNet b1.58 can match full precision quality when trained from scratch, but converting a pretrained FP16 model directly to 1.58 bit often loses accuracy, and the gap grows as model size increases. BitNet Distillation targets this conversion problem for practical downstream deployment. It is designed to preserve accuracy while delivering CPU friendly ternary weights with INT8 activations. Stage 1: Modeling refinement with SubLN Low bit models suffer from large activation variance. The research team inserts SubLN normalization inside each Transformer block, specifically before the output projection of the MHSA module and before the output projection of the FFN. This stabilizes hidden state scales that flow into quantized projections, which improves optimization and convergence once weights are ternary. The training loss curves in the analysis section support this design. Stage 2: Continued pre training to adapt weight distributions Direct task fine tuning at 1.58 bit gives the student only a small number of task tokens, which is not enough to reshape the FP16 weight distribution for ternary constraints. BitNet Distillation performs a short continued pre training on a general corpus, the research team uses 10B tokens from the FALCON corpus, to push weights toward BitNet like distributions. The visualization shows the mass concentrating near transition boundaries, which makes small gradients flip weights among [-1, 0, 1] during downstream task training. This improves learning capacity without a full pretraining run. Stage 3: Distillation based fine tuning with two signals The student learns from the FP16 teacher using logits distillation and multi head self attention relation distillation. The logits path uses temperature softened KL between teacher and student token distributions. The attention path follows the MiniLM and MiniLMv2 formulations, which transfer relations among Q, K, V without requiring the same number of heads, and let you choose a single layer to distill. Ablations show that combining both signals works best, and that selecting one well chosen layer preserves flexibility. Understanding the results The research team evaluates classification, MNLI, QNLI, SST 2, and summarization on CNN/DailyMail dataset. It compares three settings, FP16 task fine tuning, direct 1.58 bit task fine tuning, and BitNet Distillation. Figure 1 shows that BitNet Distillation matches FP16 accuracy for Qwen3 backbones at 0.6B, 1.7B, 4B, while the direct 1.58 bit baseline lags more as model size grows. On CPU, tokens per second improve by about 2.65×, and memory drops by about 10× for the student. The research team quantizes activations to INT8 and uses the Straight Through Estimator for gradients through the quantizer. https://arxiv.org/pdf/2510.13998 The framework is compatible with post training quantization methods such as GPTQ and AWQ, which provide additional gains on top of the pipeline. Distilling from a stronger teacher helps more, which suggests pairing small 1.58 bit students with larger FP16 teachers when available. Key Takeaways BitNet Distillation is a 3 stage pipeline, SubLN insertion, continued pre training, and dual distillation from logits and multi head attention relations. The research reports near FP16 accuracy with about 10× lower memory and about 2.65× faster CPU inference for 1.58 bit students. The method transfers attention relations using MiniLM and MiniLMv2 style objectives, which do not require matching head counts. Evaluations cover MNLI, QNLI, SST 2, and CNN/ DailyMail, and include Qwen3 backbones at 0.6B, 1.7B, and 4B parameters. Deployment targets ternary weights with INT8 activations, with optimized CPU and GPU kernels available in the official BitNet repository. Editorial Comments BitNet Distillation is a pragmatic step toward 1.58 bit deployment without a full retrain, the three stage design, SubLN, continual pre training, and MiniLM family attention distillation, maps cleanly to known failure modes in extreme quantization. The reported 10× memory reduction and about 2.65× CPU speedup at near FP16 accuracy indicate solid engineering value for on premise and edge targets. The reliance on attention relation distillation is well grounded in prior MiniLM work, which helps explain the stability of results. The presence of bitnet.cpp with optimized CPU and GPU kernels lowers integration risk for production teams. Check out the Technical Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup appeared first on MarkTechPost.

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup Read Post »

AI, Committee, ข่าว, Uncategorized

Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

Researchers from Stanford, EPFL, and UNC introduce Weak-for-Strong Harnessing, W4S, a new Reinforcement Learning RL framework that trains a small meta-agent to design and refine code workflows that call a stronger executor model. The meta-agent does not fine tune the strong model, it learns to orchestrate it. W4S formalizes workflow design as a multi turn Markov decision process, and trains the meta-agent with a method called Reinforcement Learning for Agentic Workflow Optimization, RLAO. The research team reports consistent gains across 11 benchmarks with a 7B meta-agent trained for about 1 GPU hour. https://arxiv.org/pdf/2504.04785 W4S operates in turns. The state contains task instructions, the current workflow program, and feedback from prior executions. An action has 2 components, an analysis of what to change, and new Python workflow code that implements those changes. The environment executes the code on validation items, returns accuracy and failure cases, and provides a new state for the next turn. The meta-agent can run a quick self check on one sample, if errors arise it attempts up to 3 repairs, if errors persist the action is skipped. This loop gives learning signal without touching the weights of the strong executor. https://arxiv.org/pdf/2504.04785 W4S runs as an iterative loop Workflow generation: The weak meta agent writes a new workflow that leverages the strong model, expressed as executable Python code. Execution and feedback: The strong model executes the workflow on validation samples, then returns accuracy and error cases as feedback. Refinement: The meta agent uses the feedback to update the analysis and the workflow, then repeats the loop. Reinforcement Learning for Agentic Workflow Optimization (RLAO) RLAO is an offline reinforcement learning procedure over multi turn trajectories. At each iteration, the system samples multiple candidate actions, keeps the best performing action to advance the state, and stores the others for training. The policy is optimized with reward weighted regression. The reward is sparse and compares current validation accuracy to history, a higher weight is given when the new result beats the previous best, a smaller weight is given when it beats the last iteration. This objective favors steady progress while controlling exploration cost. https://arxiv.org/pdf/2504.04785 Understanding the Results On HumanEval with GPT-4o-mini as executor, W4S achieves Pass@1 of 95.4, with about 33 minutes of workflow optimization, zero meta-agent API cost, an optimization execution cost of about 0.4 dollars, and about 2.7 minutes to execute the test set at about 0.5 dollars, for a total of about 0.9 dollars. Under the same executor, AFlow and ADAS trail this number. The reported average gains against the strongest automated baseline range from 2.9% to 24.6% across 11 benchmarks. On math transfer, the meta-agent is trained on GSM Plus and MGSM with GPT-3.5-Turbo as executor, then evaluated on GSM8K, GSM Hard, and SVAMP. The paper reports 86.5 on GSM8K and 61.8 on GSM Hard, both above automated baselines. This indicates that the learned orchestration transfers to related tasks without re training the executor. Across seen tasks with GPT-4o-mini as executor, W4S surpasses training free automated methods that do not learn a planner. The study also runs ablations where the meta-agent is trained by supervised fine tuning rather than RLAO, the RLAO agent yields better accuracy under the same compute budget. The research team include a GRPO baseline on a 7B weak model for GSM Hard, W4S outperforms it under limited compute. Iteration budgets matter. The research team sets W4S to about 10 optimization turns on main tables, while AFlow runs about 20 turns and ADAS runs about 30 turns. Despite fewer turns, W4S achieves higher accuracy. This suggests that learned planning over code, combined with validation feedback, makes the search more sample efficient. https://arxiv.org/pdf/2504.04785 Key Takeaways W4S trains a 7B weak meta agent with RLAO to write Python workflows that harness stronger executors, modeled as a multi turn MDP. On HumanEval with GPT 4o mini as executor, W4S reaches Pass@1 of 95.4, with about 33 minutes optimization and about 0.9 dollars total cost, beating automated baselines under the same executor. Across 11 benchmarks, W4S improves over the strongest baseline by 2.9% to 24.6%, while avoiding fine tuning of the strong model. The method runs an iterative loop, it generates a workflow, executes it on validation data, then refines it using feedback. ADAS and AFlow also program or search over code workflows, W4S differs by training a planner with offline reinforcement learning. Editorial Comments W4S targets orchestration, not model weights, and trains a 7B meta agent to program workflows that call stronger executors. W4S formalizes workflow design as a multi turn MDP and optimizes the planner with RLAO using offline trajectories and reward weighted regression. Reported results show Pass@1 of 95.4 on HumanEval with GPT 4o mini, average gains of 2.9% to 24.6% across 11 benchmarks, and about 1 GPU hour of training for the meta agent. The framing compares cleanly with ADAS and AFlow, which search agent designs or code graphs, while W4S fixes the executor and learns the planner. Check out the Technical Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs appeared first on MarkTechPost.

Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th