YouZum

Noticias

AI, Committee, Noticias, Uncategorized

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

arXiv:2510.15313v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit “echo chamber” effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry Leer entrada »

AI, Committee, Noticias, Uncategorized

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters

Are your LLM code benchmarks actually rejecting wrong-complexity solutions and interactive-protocol violations, or are they passing under-specified unit tests? A team of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce AutoCode, a new AI framework that lets LLMs create and verify competitive programming problems, mirroring the workflow of human problem setters. AutoCode reframes evaluation for code-reasoning models by treating problem setting (not only problem solving) as the target task. The system trains LLMs to produce competition-grade statements, test data, and verdict logic that match official online judges at high rates. On a 7,538-problem benchmark built from prior datasets, AutoCode achieves 91.1% consistency with official judgments (FPR 3.7%, FNR 14.1%). On a separate, more difficult 720 recent Codeforces problems (including interactive tasks), the full framework reports 98.7% consistency, 1.3% FPR, 1.2% FNR. https://arxiv.org/pdf/2510.12803 Why problem setting matters for evaluation? Public code benchmarks often rely on under-specified tests that let wrong-complexity or shortcut solutions pass. That inflates scores and pollutes reinforcement signals (rewarding fragile tactics). AutoCode’s validator-first approach and adversarial test generation aim to reduce false positives (FPR)—incorrect programs that pass—and false negatives (FNR)—correct programs rejected due to malformed inputs. https://arxiv.org/pdf/2510.12803 The core loop: Validator → Generator → Checker AutoCode runs a closed loop that mirrors human contest workflows, but each step is selected from LLM-generated candidates using targeted in-framework tests. 1) Validator (minimize FNR by enforcing input legality) The system first asks an LLM to synthesize 40 evaluation inputs—10 valid and 30 near-valid illegal (e.g., off-by-one boundary violations). It then prompts the LLM for three candidate validator programs and selects the one that best classifies these cases. This prevents “correct” solutions from crashing on malformed data. https://arxiv.org/pdf/2510.12803 2) Generator (reduce FPR by adversarial coverage) Three complementary strategies produce test cases:• Small-data exhaustion for boundary coverage,• Randomized + extreme cases (overflows, precision, hash-collisions),• TLE-inducing structures to break wrong-complexity solutions. Invalid cases are filtered by the selected validator; then cases are deduplicated and bucket-balanced before sampling. https://arxiv.org/pdf/2510.12803 3) Checker (verdict logic) The checker compares contestant outputs with the reference solution under complex rules. AutoCode again generates 40 checker scenarios and three candidate checker programs, keeps only scenarios with validator-approved inputs, and selects the best checker by accuracy against the 40 labeled scenarios. https://arxiv.org/pdf/2510.12803 4) Interactor (for interactive problems) For tasks that require dialogue with the judge, AutoCode introduces a mutant-based interactor: it makes small logical edits (“mutants”) to the reference solution, selects interactors that accept the true solution but reject the mutants, maximizing discrimination. This addresses a gap in earlier public datasets that avoided interactives. https://arxiv.org/pdf/2510.12803 Dual verification enables new problems (not just tests for existing ones) AutoCode can generate novel problem variants starting from a random “seed” Codeforces problem (<2200 Elo). The LLM drafts a new statement and two solutions: an efficient reference and a simpler brute-force baseline. A problem is accepted only if the reference output matches brute force across the generated test suite (the brute force may TLE on large cases but serves as ground truth on small/exhaustive cases). This dual-verification protocol filters ~27% of error-prone items, lifting reference-solution correctness from 86% → 94% before human review. Human experts then grade the survivors on solvability, solution correctness, quality, novelty, difficulty. After filtering, 61.6% are usable for model training, 76.3% for human training, and 3.2% are ICPC/IOI-level problems. Difficulty typically increases relative to the seed, and difficulty gain correlates with perceived quality. https://arxiv.org/pdf/2510.12803 Understanding the results Existing problems (7,538 total; 195,988 human submissions). AutoCode: 91.1% consistency, 3.7% FPR, 14.1% FNR, vs 72.9–81.0% consistency for prior generators (CodeContests, CodeContests+, TACO, HardTests). Recent Codeforces problems (720, unfiltered; includes interactives). AutoCode: 98.7% consistency, 1.3% FPR, 1.2% FNR. Ablations show all three generator strategies and prompt optimization contribute: removing prompt optimization drops consistency to 98.0% and more than doubles FNR to 2.9%. https://arxiv.org/pdf/2510.12803 Key Takeaways AutoCode couples a Validator–Generator–Checker (+Interactor) loop with dual verification (reference vs. brute-force) to build contest-grade test suites and new problems. On held-out problems, AutoCode’s test suites reach ~99% consistency with official judges, surpassing prior generators like HardTests (<81%). For recent Codeforces tasks (including interactives), the full framework reports ~98.7% consistency with ~1.3% FPR and ~1.2% FNR. The mutant-based interactor reliably accepts the true solution while rejecting mutated variants, improving evaluation for interactive problems. Human experts rate a sizable fraction of AutoCode-generated items as training-usable and a non-trivial share as contest-quality, aligning with the LiveCodeBench Pro program’s aims. Editorial Comments AutoCode is a practical fix for current code benchmarks. It centers problem setting and uses a closed-loop Validator–Generator–Checker (+Interactor) pipeline with dual verification (reference vs. brute-force). This structure reduces false positives/negatives and yields judge-aligned consistency (≈99% on held-out problems; 98.7% on recent Codeforces, including interactives). The approach standardizes constraint legality, adversarial coverage, and protocol-aware judging, which makes downstream RL reward signals cleaner. Its placement under LiveCodeBench Pro fits a hallucination-resistant evaluation program that emphasizes expert-checked rigor. Check out the Paper and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters appeared first on MarkTechPost.

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters Leer entrada »

AI, Committee, Noticias, Uncategorized

Abstract or die: Why AI enterprises can’t afford rigid vector stacks

Vector databases (DBs), once specialist research instruments, have become widely used infrastructure in just a few years. They power today’s semantic search, recommendation engines, anti-fraud measures and gen AI applications across industries. There are a deluge of options: PostgreSQL with pgvector, MySQL HeatWave, DuckDB VSS, SQLite VSS, Pinecone, Weaviate, Milvus and several others. The riches of choices sound like a boon to companies. But just beneath, a growing problem looms: Stack instability. New vector DBs appear each quarter, with disparate APIs, indexing schemes and performance trade-offs. Today’s ideal choice may look dated or limiting tomorrow. To business AI teams, volatility translates into lock-in risks and migration hell. Most projects begin life with lightweight engines like DuckDB or SQLite for prototyping, then move to Postgres, MySQL or a cloud-native service in production. Each switch involves rewriting queries, reshaping pipelines, and slowing down deployments. This re-engineering merry-go-round undermines the very speed and agility that AI adoption is supposed to bring. Why portability matters now Companies have a tricky balancing act: Experiment quickly with minimal overhead, in hopes of trying and getting early value; Scale safely on stable, production-quality infrastructure without months of refactoring; Be nimble in a world where new and better backends arrive nearly every month. Without portability, organizations stagnate. They have technical debt from recursive code paths, are hesitant to adopt new technology and cannot move prototypes to production at pace. In effect, the database is a bottleneck rather than an accelerator. Portability, or the ability to move underlying infrastructure without re-encoding the application, is ever more a strategic requirement for enterprises rolling out AI at scale. Abstraction as infrastructure The solution is not to pick the “perfect” vector database (there isn’t one), but to change how enterprises think about the problem. In software engineering, the adapter pattern provides a stable interface while hiding underlying complexity. Historically, we’ve seen how this principle reshaped entire industries: ODBC/JDBC gave enterprises a single way to query relational databases, reducing the risk of being tied to Oracle, MySQL or SQL Server; Apache Arrow standardized columnar data formats, so data systems could play nice together; ONNX created a vendor-agnostic format for machine learning (ML) models, bringing TensorFlow, PyTorch, etc. together; Kubernetes abstracted infrastructure details, so workloads could run the same everywhere on clouds; any-llm (Mozilla AI) now makes it possible to have one API across lots of large language model (LLM) vendors, so playing with AI is safer. All these abstractions led to adoption by lowering switching costs. They turned broken ecosystems into solid, enterprise-level infrastructure. Vector databases are also at the same tipping point. The adapter approach to vectors Instead of having application code directly bound to some specific vector backend, companies can compile against an abstraction layer that normalizes operations like inserts, queries and filtering. This doesn’t necessarily eliminate the need to choose a backend; it makes that choice less rigid. Development teams can start with DuckDB or SQLite in the lab, then scale up to Postgres or MySQL for production and ultimately adopt a special-purpose cloud vector DB without having to re-architect the application. Open source efforts like Vectorwrap are early examples of this approach, presenting a single Python API to Postgres, MySQL, DuckDB and SQLite. They demonstrate the power of abstraction to accelerate prototyping, reduce lock-in risk and support hybrid architectures employing numerous backends. Why businesses should care For leaders of data infrastructure and decision-makers for AI, abstraction offers three benefits: Speed from prototype to production Teams are able to prototype on lightweight local environments and scale without expensive rewrites. Reduced vendor risk Organizations can adopt new backends as they emerge without long migration projects by decoupling app code from specific databases. Hybrid flexibility Companies can mix transactional, analytical and specialized vector DBs under one architecture, all behind an aggregated interface. The result is data layer agility, and that’s more and more the difference between fast and slow companies. A broader movement in open source What’s happening in the vector space is one example of a bigger trend: Open-source abstractions as critical infrastructure. In data formats: Apache Arrow In ML models: ONNX In orchestration: Kubernetes In AI APIs: Any-LLM and other such frameworks These projects succeed, not by adding new capability, but by removing friction. They enable enterprises to move more quickly, hedge bets and evolve along with the ecosystem. Vector DB adapters continue this legacy, transforming a high-speed, fragmented space into infrastructure that enterprises can truly depend on. The future of vector DB portability The landscape of vector DBs will not converge anytime soon. Instead, the number of options will grow, and every vendor will tune for different use cases, scale, latency, hybrid search, compliance or cloud platform integration. Abstraction becomes strategy in this case. Companies adopting portable approaches will be capable of: Prototyping boldly Deploying in a flexible manner Scaling rapidly to new tech It’s possible we’ll eventually see a “JDBC for vectors,” a universal standard that codifies queries and operations across backends. Until then, open-source abstractions are laying the groundwork. Conclusion Enterprises adopting AI cannot afford to be slowed by database lock-in. As the vector ecosystem evolves, the winners will be those who treat abstraction as infrastructure, building against portable interfaces rather than binding themselves to any single backend. The decades-long lesson of software engineering is simple: Standards and abstractions lead to adoption. For vector DBs, that revolution has already begun. Mihir Ahuja is an AI/ML engineer and open-source contributor based in San Francisco.

Abstract or die: Why AI enterprises can’t afford rigid vector stacks Leer entrada »

AI, Committee, Noticias, Uncategorized

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup

Microsoft Research proposes BitNet Distillation, a pipeline that converts existing full precision LLMs into 1.58 bit BitNet students for specific tasks, while keeping accuracy close to the FP16 teacher and improving CPU efficiency. The method combines SubLN based architectural refinement, continued pre training, and dual signal distillation from logits and multi head attention relations. Reported results show up to 10× memory savings and about 2.65× faster CPU inference, with task metrics comparable to FP16 across multiple sizes. What BitNet Distillation changes? The community already showed that BitNet b1.58 can match full precision quality when trained from scratch, but converting a pretrained FP16 model directly to 1.58 bit often loses accuracy, and the gap grows as model size increases. BitNet Distillation targets this conversion problem for practical downstream deployment. It is designed to preserve accuracy while delivering CPU friendly ternary weights with INT8 activations. Stage 1: Modeling refinement with SubLN Low bit models suffer from large activation variance. The research team inserts SubLN normalization inside each Transformer block, specifically before the output projection of the MHSA module and before the output projection of the FFN. This stabilizes hidden state scales that flow into quantized projections, which improves optimization and convergence once weights are ternary. The training loss curves in the analysis section support this design. Stage 2: Continued pre training to adapt weight distributions Direct task fine tuning at 1.58 bit gives the student only a small number of task tokens, which is not enough to reshape the FP16 weight distribution for ternary constraints. BitNet Distillation performs a short continued pre training on a general corpus, the research team uses 10B tokens from the FALCON corpus, to push weights toward BitNet like distributions. The visualization shows the mass concentrating near transition boundaries, which makes small gradients flip weights among [-1, 0, 1] during downstream task training. This improves learning capacity without a full pretraining run. Stage 3: Distillation based fine tuning with two signals The student learns from the FP16 teacher using logits distillation and multi head self attention relation distillation. The logits path uses temperature softened KL between teacher and student token distributions. The attention path follows the MiniLM and MiniLMv2 formulations, which transfer relations among Q, K, V without requiring the same number of heads, and let you choose a single layer to distill. Ablations show that combining both signals works best, and that selecting one well chosen layer preserves flexibility. Understanding the results The research team evaluates classification, MNLI, QNLI, SST 2, and summarization on CNN/DailyMail dataset. It compares three settings, FP16 task fine tuning, direct 1.58 bit task fine tuning, and BitNet Distillation. Figure 1 shows that BitNet Distillation matches FP16 accuracy for Qwen3 backbones at 0.6B, 1.7B, 4B, while the direct 1.58 bit baseline lags more as model size grows. On CPU, tokens per second improve by about 2.65×, and memory drops by about 10× for the student. The research team quantizes activations to INT8 and uses the Straight Through Estimator for gradients through the quantizer. https://arxiv.org/pdf/2510.13998 The framework is compatible with post training quantization methods such as GPTQ and AWQ, which provide additional gains on top of the pipeline. Distilling from a stronger teacher helps more, which suggests pairing small 1.58 bit students with larger FP16 teachers when available. Key Takeaways BitNet Distillation is a 3 stage pipeline, SubLN insertion, continued pre training, and dual distillation from logits and multi head attention relations. The research reports near FP16 accuracy with about 10× lower memory and about 2.65× faster CPU inference for 1.58 bit students. The method transfers attention relations using MiniLM and MiniLMv2 style objectives, which do not require matching head counts. Evaluations cover MNLI, QNLI, SST 2, and CNN/ DailyMail, and include Qwen3 backbones at 0.6B, 1.7B, and 4B parameters. Deployment targets ternary weights with INT8 activations, with optimized CPU and GPU kernels available in the official BitNet repository. Editorial Comments BitNet Distillation is a pragmatic step toward 1.58 bit deployment without a full retrain, the three stage design, SubLN, continual pre training, and MiniLM family attention distillation, maps cleanly to known failure modes in extreme quantization. The reported 10× memory reduction and about 2.65× CPU speedup at near FP16 accuracy indicate solid engineering value for on premise and edge targets. The reliance on attention relation distillation is well grounded in prior MiniLM work, which helps explain the stability of results. The presence of bitnet.cpp with optimized CPU and GPU kernels lowers integration risk for production teams. Check out the Technical Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup appeared first on MarkTechPost.

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup Leer entrada »

AI, Committee, Noticias, Uncategorized

Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

Researchers from Stanford, EPFL, and UNC introduce Weak-for-Strong Harnessing, W4S, a new Reinforcement Learning RL framework that trains a small meta-agent to design and refine code workflows that call a stronger executor model. The meta-agent does not fine tune the strong model, it learns to orchestrate it. W4S formalizes workflow design as a multi turn Markov decision process, and trains the meta-agent with a method called Reinforcement Learning for Agentic Workflow Optimization, RLAO. The research team reports consistent gains across 11 benchmarks with a 7B meta-agent trained for about 1 GPU hour. https://arxiv.org/pdf/2504.04785 W4S operates in turns. The state contains task instructions, the current workflow program, and feedback from prior executions. An action has 2 components, an analysis of what to change, and new Python workflow code that implements those changes. The environment executes the code on validation items, returns accuracy and failure cases, and provides a new state for the next turn. The meta-agent can run a quick self check on one sample, if errors arise it attempts up to 3 repairs, if errors persist the action is skipped. This loop gives learning signal without touching the weights of the strong executor. https://arxiv.org/pdf/2504.04785 W4S runs as an iterative loop Workflow generation: The weak meta agent writes a new workflow that leverages the strong model, expressed as executable Python code. Execution and feedback: The strong model executes the workflow on validation samples, then returns accuracy and error cases as feedback. Refinement: The meta agent uses the feedback to update the analysis and the workflow, then repeats the loop. Reinforcement Learning for Agentic Workflow Optimization (RLAO) RLAO is an offline reinforcement learning procedure over multi turn trajectories. At each iteration, the system samples multiple candidate actions, keeps the best performing action to advance the state, and stores the others for training. The policy is optimized with reward weighted regression. The reward is sparse and compares current validation accuracy to history, a higher weight is given when the new result beats the previous best, a smaller weight is given when it beats the last iteration. This objective favors steady progress while controlling exploration cost. https://arxiv.org/pdf/2504.04785 Understanding the Results On HumanEval with GPT-4o-mini as executor, W4S achieves Pass@1 of 95.4, with about 33 minutes of workflow optimization, zero meta-agent API cost, an optimization execution cost of about 0.4 dollars, and about 2.7 minutes to execute the test set at about 0.5 dollars, for a total of about 0.9 dollars. Under the same executor, AFlow and ADAS trail this number. The reported average gains against the strongest automated baseline range from 2.9% to 24.6% across 11 benchmarks. On math transfer, the meta-agent is trained on GSM Plus and MGSM with GPT-3.5-Turbo as executor, then evaluated on GSM8K, GSM Hard, and SVAMP. The paper reports 86.5 on GSM8K and 61.8 on GSM Hard, both above automated baselines. This indicates that the learned orchestration transfers to related tasks without re training the executor. Across seen tasks with GPT-4o-mini as executor, W4S surpasses training free automated methods that do not learn a planner. The study also runs ablations where the meta-agent is trained by supervised fine tuning rather than RLAO, the RLAO agent yields better accuracy under the same compute budget. The research team include a GRPO baseline on a 7B weak model for GSM Hard, W4S outperforms it under limited compute. Iteration budgets matter. The research team sets W4S to about 10 optimization turns on main tables, while AFlow runs about 20 turns and ADAS runs about 30 turns. Despite fewer turns, W4S achieves higher accuracy. This suggests that learned planning over code, combined with validation feedback, makes the search more sample efficient. https://arxiv.org/pdf/2504.04785 Key Takeaways W4S trains a 7B weak meta agent with RLAO to write Python workflows that harness stronger executors, modeled as a multi turn MDP. On HumanEval with GPT 4o mini as executor, W4S reaches Pass@1 of 95.4, with about 33 minutes optimization and about 0.9 dollars total cost, beating automated baselines under the same executor. Across 11 benchmarks, W4S improves over the strongest baseline by 2.9% to 24.6%, while avoiding fine tuning of the strong model. The method runs an iterative loop, it generates a workflow, executes it on validation data, then refines it using feedback. ADAS and AFlow also program or search over code workflows, W4S differs by training a planner with offline reinforcement learning. Editorial Comments W4S targets orchestration, not model weights, and trains a 7B meta agent to program workflows that call stronger executors. W4S formalizes workflow design as a multi turn MDP and optimizes the planner with RLAO using offline trajectories and reward weighted regression. Reported results show Pass@1 of 95.4 on HumanEval with GPT 4o mini, average gains of 2.9% to 24.6% across 11 benchmarks, and about 1 GPU hour of training for the meta agent. The framing compares cleanly with ADAS and AFlow, which search agent designs or code graphs, while W4S fixes the executor and learns the planner. Check out the Technical Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs appeared first on MarkTechPost.

Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs Leer entrada »

AI, Committee, Noticias, Uncategorized

Codev lets enterprises avoid vibe coding hangovers with a team of agents that generate and document code

For many software developers using generative AI, vibe coding is a double-edged sword. The process delivers rapid prototypes but often leaves a trail of brittle, undocumented code that creates significant technical debt. A new open-source platform, Codev, addresses this by proposing a fundamental shift: treating the natural language conversation with an AI as part of the actual source code. Codev is based on SP(IDE)R, a framework designed to turn vibe-coding conversations into structured, versioned, and auditable assets that become part of the code repository. What is Codev? At its core, Codev is a methodology that treats natural language context as an integral part of the development lifecycle as opposed to a disposable artifact as is the case with vanilla vibe coding. According to co-founder Waleed Kadous, the goal is to invert the typical engineering workflow. “A key principle of Codev is that documents like the specification are the actual code of the system,” he told VentureBeat. “It’s almost like natural language is compiled down into Typescript by our agents.” This approach avoids the common pitfall where documentation is created after the fact, if at all. Its flagship protocol, SP(IDE)R, provides a lightweight but formal structure for building software. The process begins with Specify, where a human and multiple AI agents collaborate to turn a high-level request into concrete acceptance criteria. Next, in the Plan stage, an AI proposes a phased implementation, which is again reviewed. For each phase, the AI enters an IDE loop: it Implements the code, Defends it against bugs and regression with comprehensive tests, and Evaluates the result against the specification. The final step is Review, where the team documents lessons learned to update and improve the SP(IDE)R protocol itself for future projects. The framework’s key differentiator is its use of multiple agents and explicit human review at different stages. Kadous notes that each agent brings unique strengths to the review process. “Gemini is extremely good at catching security issues,” he said, citing a critical cross-site scripting (XSS) flaw and another bug that “would have shared an OpenAI API key with the client, which could cost thousands of dollars.” Meanwhile, “GPT-5 is very good at understanding how to simplify a design.” This structured review, with a human providing final approval at each stage, prevents the kind of runaway automation that leads to flawed code. The platform’s AI-native philosophy extends to its installation. There is no complex installer; instead, a user instructs their AI agent to apply the Codev GitHub repository to set up the project. The developers “dogfooded” their framework, using Codev to build Codev. “The key point here is that natural language is executable now, with the agent being the interpreter,” Kadous said. “This is great because it means it’s not a ‘blind’ integration of Codev, the agent gets to choose the best way to integrate it and can intelligently make decisions.” Codev case study To test the framework’s effectiveness, its creators ran a direct comparison between vanilla vibe-coding and Codev. They gave Claude Opus 4.1 a request to build a modern web-based todo manager. The first attempt used a conversational, vibe-coding approach. The result was a plausible-looking demo. However, an automated analysis conducted by three independent AI agents found that it had implemented 0% of the required functionality, contained no tests, and lacked a database or API. The second attempt used the same AI model and prompt but applied the SP(IDE)R protocol. This time, the AI produced a production-ready application with 32 source files, 100% of the specified functionality, five test suites, a SQLite database, and a complete RESTful API. Throughout this process, the human developers reported they never directly edited a single line of source code. While this was a single experiment, Kadous estimates the impact is substantial. “Subjectively, it feels like I’m about three times as productive with Codev as without,” he says. The quality also speaks for itself. “I used LLMs as a judge, and one of them described the output like what a well-oiled engineering team would produce. That was exactly what I was aiming for.” While the process is powerful, it redefines the developer’s role from a hands-on coder to a system architect and reviewer. According to Kadous, the initial spec and plan stages can each take between 45 minutes to two hours of focused collaboration. This is in contrast to the impression given by many vibe-coding platforms, where a single prompt and a few minutes of processing gives you a fully functional and scalable application. “All of the value I add is in the background knowledge I apply to the specs and plans,” he explains. He emphasizes that the framework is designed to augment, not replace, experienced talent. “The people who will do the best… are senior engineers and above because they know the pitfalls… It just takes the senior engineer you already have and makes them much more productive.” A future of human and AI collaboration Frameworks like Codev signal a shift where the primary creative act of software development moves from writing code to crafting precise, machine-readable specifications and plans. For enterprise teams, this means AI-generated code can become auditable, maintainable, and reliable. By capturing the entire development conversation in version control and enforcing it with CI, the process turns ephemeral chats into durable engineering assets. Codev proposes a future where the AI acts not as a chaotic assistant, but as a disciplined collaborator in a structured, human-led workflow. However, Kadous acknowledges this shift creates new challenges for the workforce. “Senior engineers that reject AI outright will be outpaced by senior engineers who embrace it,” he predicts. He also expresses concern for junior developers who may not get the chance “to build their architectural chops,” a skill that becomes even more critical when guiding AI. This highlights a central challenge for the industry: ensuring that as AI elevates top performers, it also creates pathways to develop the next generation of talent.

Codev lets enterprises avoid vibe coding hangovers with a team of agents that generate and document code Leer entrada »

AI, Committee, Noticias, Uncategorized

A Coding Implementation to Build a Unified Tool Orchestration Framework from Documentation to Automated Pipelines

In this tutorial, we build a compact, efficient framework that demonstrates how to convert tool documentation into standardized, callable interfaces, register those tools in a central system, and execute them as part of an automated pipeline. As we move through each stage, we create a simple converter, design mock bioinformatics tools, organize them into a registry, and benchmark both individual and multi-step pipeline executions. Through this process, we explore how structured tool interfaces and automation can streamline and modularize data workflows. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import re, json, time, random from dataclasses import dataclass from typing import Callable, Dict, Any, List, Tuple @dataclass class ToolSpec: name: str description: str inputs: Dict[str, str] outputs: Dict[str, str] def parse_doc_to_spec(name: str, doc: str) -> ToolSpec: desc = doc.strip().splitlines()[0].strip() if doc.strip() else name arg_block = “n”.join([l for l in doc.splitlines() if “–” in l or “:” in l]) inputs = {} for line in arg_block.splitlines(): m = re.findall(r”(–?w[w-]*|bw+b)s*[:=]?s*(w+)?”, line) for key, typ in m: k = key.lstrip(“-“) if k and k not in inputs and k not in [“Returns”,”Output”,”Outputs”]: inputs[k] = (typ or “str”) if not inputs: inputs = {“in”: “str”} return ToolSpec(name=name, description=desc, inputs=inputs, outputs={“out”:”json”}) We start by defining the structure for our tools and writing a simple parser that converts plain documentation into a standardized tool specification. This helps us automatically extract parameters and outputs from textual descriptions. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def tool_fastqc(seq_fasta: str, min_len:int=30) -> Dict[str,Any]: seqs = [s for s in re.split(r”>[^n]*n”, seq_fasta)[1:]] lens = [len(re.sub(r”s+”,””,s)) for s in seqs] q30 = sum(l>=min_len for l in lens)/max(1,len(lens)) gc = sum(c in “GCgc” for s in seqs for c in s)/max(1,sum(lens)) return {“n_seqs”:len(lens),”len_mean”:(sum(lens)/max(1,len(lens))),”pct_q30″:q30,”gc”:gc} def tool_bowtie2_like(ref:str, reads:str, mode:str=”end-to-end”) -> Dict[str,Any]: def revcomp(s): t=str.maketrans(“ACGTacgt”,”TGCAtgca”); return s.translate(t)[::-1] reads_list=[r for r in re.split(r”>[^n]*n”, reads)[1:]] ref_seq=””.join(ref.splitlines()[1:]) hits=[] for i,r in enumerate(reads_list): rseq=””.join(r.split()) aligned = (rseq in ref_seq) or (revcomp(rseq) in ref_seq) hits.append({“read_id”:i,”aligned”:bool(aligned),”pos”:ref_seq.find(rseq)}) return {“n”:len(hits),”aligned”:sum(h[“aligned”] for h in hits),”mode”:mode,”hits”:hits} def tool_bcftools_like(ref:str, alt:str, win:int=15) -> Dict[str,Any]: ref_seq=””.join(ref.splitlines()[1:]); alt_seq=””.join(alt.splitlines()[1:]) n=min(len(ref_seq),len(alt_seq)); vars=[] for i in range(n): if ref_seq[i]!=alt_seq[i]: vars.append({“pos”:i,”ref”:ref_seq[i],”alt”:alt_seq[i]}) return {“n_sites”:n,”n_var”:len(vars),”variants”:vars[:win]} FASTQC_DOC = “””FastQC-like quality control for FASTA –seq_fasta: str –min_len: int Outputs: json””” BOWTIE_DOC = “””Bowtie2-like aligner –ref: str –reads: str –mode: str Outputs: json””” BCF_DOC = “””bcftools-like variant caller –ref: str –alt: str –win: int Outputs: json””” We create mock implementations of bioinformatics tools such as FastQC, Bowtie2, and Bcftools. We define their expected inputs and outputs so they can be executed consistently through a unified interface. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser @dataclass class MCPTool: spec: ToolSpec fn: Callable[…, Dict[str,Any]] class MCPServer: def __init__(self): self.tools: Dict[str,MCPTool] = {} def register(self, name:str, doc:str, fn:Callable[…,Dict[str,Any]]): spec = parse_doc_to_spec(name, doc); self.tools[name]=MCPTool(spec, fn) def list_tools(self) -> List[Dict[str,Any]]: return [dict(name=t.spec.name, description=t.spec.description, inputs=t.spec.inputs, outputs=t.spec.outputs) for t in self.tools.values()] def call_tool(self, name:str, args:Dict[str,Any]) -> Dict[str,Any]: if name not in self.tools: raise KeyError(f”tool {name} not found”) spec = self.tools[name].spec kwargs={k:args.get(k) for k in spec.inputs.keys()} return self.tools[name].fn(**kwargs) server=MCPServer() server.register(“fastqc”, FASTQC_DOC, tool_fastqc) server.register(“bowtie2”, BOWTIE_DOC, tool_bowtie2_like) server.register(“bcftools”, BCF_DOC, tool_bcftools_like) Task = Tuple[str, Dict[str,Any]] PIPELINES = { “rnaseq_qc_align_call”:[ (“fastqc”, {“seq_fasta”:”{reads}”, “min_len”:30}), (“bowtie2”, {“ref”:”{ref}”, “reads”:”{reads}”, “mode”:”end-to-end”}), (“bcftools”, {“ref”:”{ref}”, “alt”:”{alt}”, “win”:15}), ] } def compile_pipeline(nl_request:str) -> List[Task]: key = “rnaseq_qc_align_call” if re.search(r”rna|qc|align|variant|call”, nl_request, re.I) else “rnaseq_qc_align_call” return PIPELINES[key] We build a lightweight server that registers tools, lists their specifications, and allows us to call them programmatically. We also define a basic pipeline structure that outlines the sequence in which tools should run. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def mk_fasta(header:str, seq:str)->str: return f”>{header}n{seq}n” random.seed(0) REF_SEQ=””.join(random.choice(“ACGT”) for _ in range(300)) REF = mk_fasta(“ref”,REF_SEQ) READS = mk_fasta(“r1”, REF_SEQ[50:130]) + mk_fasta(“r2”,”ACGT”*15) + mk_fasta(“r3”, REF_SEQ[180:240]) ALT = mk_fasta(“alt”, REF_SEQ[:150] + “T” + REF_SEQ[151:]) def run_pipeline(nl:str, ctx:Dict[str,str]) -> Dict[str,Any]: plan=compile_pipeline(nl); results=[]; t0=time.time() for name, arg_tpl in plan: args={k:(v.format(**ctx) if isinstance(v,str) else v) for k,v in arg_tpl.items()} out=server.call_tool(name, args) results.append({“tool”:name,”args”:args,”output”:out}) return {“request”:nl,”elapsed_s”:round(time.time()-t0,4),”results”:results} We prepare small synthetic FASTA data for testing and implement a function that runs the entire pipeline. Here, we dynamically pass tool parameters and execute each step in the sequence. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def bench_individual() -> List[Dict[str,Any]]: cases=[ (“fastqc”, {“seq_fasta”:READS,”min_len”:25}), (“bowtie2”, {“ref”:REF,”reads”:READS,”mode”:”end-to-end”}), (“bcftools”, {“ref”:REF,”alt”:ALT,”win”:10}), ] rows=[] for name,args in cases: t0=time.time(); ok=True; err=None; out=None try: out=server.call_tool(name,args) except Exception as e: ok=False; err=str(e) rows.append({“tool”:name,”ok”:ok,”ms”:int((time.time()-t0)*1000),”out_keys”:list(out.keys()) if ok else [],”err”:err}) return rows def bench_pipeline() -> Dict[str,Any]: t0=time.time() res=run_pipeline(“Run RNA-seq QC, align, and variant call.”, {“ref”:REF,”reads”:READS,”alt”:ALT}) ok = all(step[“output”] for step in res[“results”]) return {“pipeline”:”rnaseq_qc_align_call”,”ok”:ok,”ms”:int((time.time()-t0)*1000),”n_steps”:len(res[“results”])} print(“== TOOLS ==”); print(json.dumps(server.list_tools(), indent=2)) print(“n== INDIVIDUAL BENCH ==”); print(json.dumps(bench_individual(), indent=2)) print(“n== PIPELINE BENCH ==”); print(json.dumps(bench_pipeline(), indent=2)) print(“n== PIPELINE RUN ==”); print(json.dumps(run_pipeline(“Run RNA-seq QC, align, and variant call.”, {“ref”:REF,”reads”:READS,”alt”:ALT}), indent=2)) We benchmark both individual tools and the full pipeline, capturing their outputs and performance metrics. Finally, we print the results to verify that each stage of the workflow runs successfully and integrates smoothly. In conclusion, we develop a clear understanding of how lightweight tool conversion, registration, and orchestration can work together in a single environment. We observe how a unified interface allows us to connect multiple tools seamlessly, run them in sequence, and measure their performance. This hands-on exercise helps us appreciate how simple design principles, standardization, automation, and modularity can enhance the reproducibility and efficiency of computational workflows in any domain. Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post A Coding Implementation to Build a Unified Tool Orchestration Framework from Documentation to Automated Pipelines appeared first on MarkTechPost.

A Coding Implementation to Build a Unified Tool Orchestration Framework from Documentation to Automated Pipelines Leer entrada »

AI, Committee, Noticias, Uncategorized

Developers can now add live Google Maps data to Gemini-powered AI app outputs

Google is adding a new feature for third-party developers building atop its Gemini AI models that rivals like OpenAI’s ChatGPT, Anthropic’s Claude, and the growing array of Chinese open source options are unlikely to get anytime soon: grounding with Google Maps. This addition allows developers to connect Google’s Gemini AI models’ reasoning capabilities with live geospatial data from Google Maps, enabling applications to deliver detailed, location-relevant responses to user queries—such as business hours, reviews, or the atmosphere of a specific venue. By tapping into data from over 250 million places, developers can now build more intelligent and responsive location-aware experiences. This is particularly useful for applications where proximity, real-time availability, or location-specific personalization matter—such as local search, delivery services, real estate, and travel planning. When the user’s location is known, developers can pass latitude and longitude into the request to enhance the response quality. By tightly integrating real-time and historical Maps data into the Gemini API, Google enables applications to generate grounded, location-specific responses with factual accuracy and contextual depth that are uniquely possible through its mapping infrastructure. Merging AI and Geospatial Intelligence The new feature is accessible in Google AI Studio, where developers can try a live demo powered by the Gemini Live API. Models that support the grounding with Google Maps include: Gemini 2.5 Pro Gemini 2.5 Flash Gemini 2.5 Flash-Lite Gemini 2.0 Flash In one demonstration, a user asked for Italian restaurant recommendations in Chicago. The assistant, leveraging Maps data, retrieved top-rated options and clarified a misspelled restaurant name before locating the correct venue with accurate business details. Developers can also retrieve a context token to embed a Google Maps widget in their app’s user interface. This interactive component displays photos, reviews, and other familiar content typically found in Google Maps. Integration is handled via the generateContent method in the Gemini API, where developers include googleMaps as a tool. They can also enable a Maps widget by setting a parameter in the request. The widget, rendered using a returned context token, can provide a visual layer alongside the AI-generated text. Use Cases Across Industries The Maps grounding tool is designed to support a wide range of practical use cases: Itinerary generation: Travel apps can create detailed daily plans with routing, timing, and venue information. Personalized local recommendations: Real estate platforms can highlight listings near kid-friendly amenities like schools and parks. Detailed location queries: Applications can provide specific information, such as whether a cafe offers outdoor seating, using community reviews and Maps metadata. Developers are encouraged to only enable the tool when geographic context is relevant, to optimize both performance and cost. According to the developer documentation, pricing starts at $25 per 1,000 grounded prompts — a steep sum for those trafficking in numerous queries. Combining Search and Maps for Enhanced Context Developers can use Grounding with Google Maps alongside Grounding with Google Search in the same request. While the Maps tool contributes factual data—like addresses, hours, and ratings—the Search tool adds broader context from web content, such as news or event listings. For example, when asked about live music on Beale Street, the combined tools provide venue details from Maps and event times from Search. According to Google, internal testing shows that using both tools together leads to significantly improved response quality. Unfortunately, it doesn’t appear that the Google Maps grounding includes live vehicular traffic data — at least not yet. Customization and Developer Flexibility The experience is built for customization. Developers can tweak system prompts, choose from different Gemini models, and configure voice settings to tailor interactions. The demo app in Google AI Studio is also remixable, enabling developers to test ideas, add features, and iterate on designs within a flexible development environment. The API returns structured metadata—including source links, place IDs, and citation spans—that developers can use to build inline citations or verify the AI-generated outputs. This supports transparency and enhances trust in user-facing applications. Google also requires that Maps-based sources be attributed clearly and linked back to the source using their URI. Implementation Considerations for AI Builders For technical teams integrating this capability, Google recommends: Passing user location context when known, for better results. Displaying Google Maps source links directly beneath the relevant content. Only enabling the tool when the query clearly involves geographic context. Monitoring latency and disabling grounding when performance is critical. Grounding with Google Maps is currently available globally, though prohibited in several territories (including China, Iran, North Korea, and Cuba), and not permitted for emergency response use cases. Availability and Access Grounding with Google Maps is now generally available through the Gemini API. With this release, Google continues to expand the capabilities of the Gemini API, empowering developers to build AI-driven applications that understand and respond to the world around them.

Developers can now add live Google Maps data to Gemini-powered AI app outputs Leer entrada »

We use cookies to improve your experience and performance on our website. You can learn more at Política de privacidad and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
es_ES