YouZum

Uncategorized

AI, Committee, ニュース, Uncategorized

DeepReinforce Team Introduces CUDA-L1: An Automated Reinforcement Learning (RL) Framework for CUDA Optimization Unlocking 3x More Power from GPUs

Estimated reading time: 6 minutes Table of contents The Breakthrough: Contrastive Reinforcement Learning (Contrastive-RL) How Good Is CUDA-L1? Hard Data Business Impact: Why This Matters Technical Insights: Why Contrastive-RL Wins Table: Top Techniques Discovered by CUDA-L1 Conclusion: AI Is Now Its Own Optimization Engineer AI has just unlocked triple the power from GPUs—without human intervention. DeepReinforce Team introduced a new framework called CUDA-L1 that delivers an average 3.12× speedup and up to 120× peak acceleration across 250 real-world GPU tasks. This is not mere academic promise: every result can be reproduced with open-source code, on widely used NVIDIA hardware. The Breakthrough: Contrastive Reinforcement Learning (Contrastive-RL) At the heart of CUDA-L1 lies a major leap in AI learning strategy: Contrastive Reinforcement Learning (Contrastive-RL). Unlike traditional RL, where an AI simply generates solutions, receives numerical rewards, and updates its model parameters blindly, Contrastive-RL feeds back the performance scores and prior variants directly into the next generation prompt. Performance scores and code variants are given to the AI in each optimization round. The model must then write a “Performance Analysis” in natural language—reflecting on which code was fastest, why, and what strategies led to that speedup. Each step forces complex reasoning, guiding the model to synthesize not just a new code variant but a more generalized, data-driven mental model of what makes CUDA code fast. The result? The AI discovers not just well-known optimizations, but also non-obvious tricks that even human experts often overlook—including mathematical shortcuts that entirely bypass computation, or memory strategies tuned to specific hardware quirks. The above diagram captures the three-stage training pipeline: Stage 1: The LLM is fine-tuned using validated CUDA code—collected by sampling from leading foundation models (DeepSeek-R1, GPT-4o, Claude, etc.), but retaining only correct and executable outputs. Stage 2: The model enters a self-training loop: it generates lots of CUDA code, keeps only the functional ones, and uses those to further learn. Result: rapid improvement in code correctness and coverage—all without manual labeling. Stage 3: In the Contrastive-RL phase, the system samples multiple code variants, shows each with its measured speed, and challenges the AI to debate, analyze, and outreason previous generations before producing the next round of optimizations. This reflection-and-improvement loop is the key flywheel that delivers massive speedups. How Good Is CUDA-L1? Hard Data Speedups Across the Board KernelBench—the gold-standard benchmark for GPU code generation (250 real-world PyTorch workloads)—was used to measure CUDA-L1: Model/Stage Avg. Speedup Max Speedup Median Success Rate Vanilla Llama-3.1-405B 0.23× 3.14× 0× 68/250 DeepSeek-R1 (RL-tuned) 1.41× 44.2× 1.17× 248/250 CUDA-L1 (All Stages) 3.12× 120× 1.42× 249/250 3.12× average speedup: The AI found improvements in virtually every task. 120× maximum speedup: Some computational bottlenecks and inefficient code (like diagonal matrix multiplications) were transformed with fundamentally superior solutions. Works across hardware: Codes optimized on NVIDIA A100 GPUs retained substantial gains ported to other architectures (L40, H100, RTX 3090, H20), with mean speedups from 2.37× to 3.12×, median gains consistently above 1.1× across all devices. Case Study: Discovering Hidden 64× and 120× Speedups diag(A) * B—Matrix Multiplication with Diagonal Reference (inefficient): torch.diag(A) @ B constructs a full diagonal matrix, requiring O(N²M) compute/memory. CUDA-L1 optimized: A.unsqueeze(1) * B leverages broadcasting, achieving only O(NM) complexity—resulting in a 64× speedup. Why: The AI reasoned that allocating a full diagonal was needless; this insight was unreachable via brute-force mutation, but surfaced via comparative reflection across generated solutions. 3D Transposed Convolution—120× Faster Original code: Performed full convolution, pooling, and activation—even when input or hyperparameters mathematically guaranteed all zeros. Optimized code: Used “mathematical short-circuit”—detected that given min_value=0, the output could be immediately set to zero, bypassing all computation and memory allocation. This one insight delivered orders of magnitude more speedup than hardware-level micro-optimizations. Business Impact: Why This Matters For Business Leaders Direct Cost Savings: Every 1% speedup in GPU workloads translates to 1% less cloud GPUseconds, lower energy costs, and more model throughput. Here, the AI delivered, on average, over 200% extra compute from the same hardware investment. Faster Product Cycles: Automated optimization reduces the need for CUDA experts. Teams can unlock performance gains in hours, not months, and focus on features and research velocity instead of low-level tuning. For AI Practitioners Verifiable, Open Source: All 250 optimized CUDA kernels are open-sourced. You can test the speed gains yourself across A100, H100, L40, or 3090 GPUs—no trust required. No CUDA Black Magic Required: The process doesn’t rely on secret sauce, proprietary compilers, or human-in-the-loop tuning. For AI Researchers Domain Reasoning Blueprint: Contrastive-RL offers a new approach to training AI in domains where correctness and performance—not just natural language—matter. Reward Hacking: The authors deep dive into how the AI discovered subtle exploits and “cheats” (like asynchronous stream manipulation for false speedups) and outline robust procedures to detect and prevent such behavior. Technical Insights: Why Contrastive-RL Wins Performance feedback is now in-context: Unlike vanilla RL, the AI can learn not just by trial and error, but by reasoned self-critique. Self-improvement flywheel: The reflection loop makes the model robust to reward gaming and outperforms both evolutionary approaches (fixed parameter, in-context contrastive learning) and traditional RL (blind policy gradient). Generalizes & discovers fundamental principles: The AI can combine, rank, and apply key optimization strategies like memory coalescing, thread block configuration, operation fusion, shared memory reuse, warp-level reductions, and mathematical equivalence transformations. Table: Top Techniques Discovered by CUDA-L1 Optimization Technique Typical Speedup Example Insight Memory Layout Optimization Consistent boosts Contiguous memory/storage for cache efficiency Memory Access (Coalescing, Shared) Moderate-to-high Avoids bank conflicts, maximizes bandwidth Operation Fusion High w/ pipelined ops Fused multi-op kernels reduce memory reads/writes Mathematical Short-circuiting Extremely high (10-100×) Detects when computation can be skipped entirely Thread Block/Parallel Config Moderate Adapts block sizes/shapes to hardware/task Warp-Level/Branchless Reductions Moderate Lowers divergence and sync overhead Register/Shared Memory Optimization Moderate-high Caches frequent data close to computation Async Execution, Minimal Sync Varies Overlaps I/O, enables pipelined computation Conclusion: AI Is Now Its Own Optimization Engineer With CUDA-L1, AI has become its own performance engineer, accelerating research productivity and hardware returns—without relying on rare human expertise. The

DeepReinforce Team Introduces CUDA-L1: An Automated Reinforcement Learning (RL) Framework for CUDA Optimization Unlocking 3x More Power from GPUs 投稿を読む »

AI, Committee, ニュース, Uncategorized

Meet Trackio: The Free, Local-First, Open-Source Experiment Tracker Python Library that Simplifies and Enhances Machine Learning Workflows

Experiment tracking is an essential part of modern machine learning workflows. Whether you’re tweaking hyperparameters, monitoring training metrics, or collaborating with colleagues, it’s crucial to have robust, flexible tools that make tracking experiments straightforward and insightful. However, many existing experiment tracking solutions require complex setup, come with licensing fees, or lock user data into proprietary formats, making them less accessible to individual researchers and smaller teams. Meet Trackio — a new open-source experiment tracking library developed by Hugging Face and Gradio. Trackio is a local-first, lightweight, and fully free tracker engineered for today’s rapid-paced research environments and open collaborations. What Is Trackio? Trackio is a Python package designed as a drop-in replacement for widely used libraries like wandb, with compatibility for foundational API calls (wandb.init, wandb.log, wandb.finish). This puts Trackio in a league where switching over or running legacy scripts requires little to no code changes—simply import Trackio as wandb and continue working as before. Key Features Local-First Design: By default, experiments run and persist locally, providing privacy and fast access. Sharing is optional, not the default. Free and Open Source: There are no paywalls and no feature limitations—everything, including collaboration and online dashboards, is available to everyone at no cost. Lightweight and Extensible: The entire codebase is under 1,000 lines of Python, ensuring it’s easy to audit, extend, or adapt. Integrated with Hugging Face Ecosystem: Out-of-the-box support with Transformers, Sentence Transformers, and Accelerate, lets users begin tracking metrics with minimal setup. Data Portability: Unlike some established tracking tools, Trackio makes all experiment data easily exportable and accessible, empowering custom analytics and seamless integration into research pipelines. Seamless Experiment Tracking: Local or Shared One standout feature of Trackio is its shareability. Researchers can monitor metrics on a local Gradio-powered dashboard or, by simply syncing with Hugging Face Spaces, migrate a dashboard online for sharing with colleagues (or the public, if you wish). Spaces can be set private or public—no complex authentication or onboarding required for viewers. For example, to view your experiment dashboard locally: Copy CodeCopiedUse a different Browser trackio show Or, from Python: Copy CodeCopiedUse a different Browser import trackio trackio.show() To launch dashboards on Spaces: Sync your logs to Hugging Face Spaces and instantly share or embed experiment dashboards with a simple URL. Importantly, when running on Spaces, Trackio automatically backs up metrics from the ephemeral Sqlite DB to a Hugging Face Dataset (as Parquet files) every 5 minutes, ensuring your experimental data is never lost—even if the public Space restarts. Plug-and-Play Integration with Your ML Workflow The integration with the Hugging Face ecosystem is as simple as it gets: With transformers.Trainer or accelerate, you can log and visualize metrics by specifying Trackio as your logger. For example, using Accelerate: Copy CodeCopiedUse a different Browser from accelerate import Accelerator accelerator = Accelerator(log_with=”trackio”) accelerator.init_trackers(“my-experiment”) … accelerator.log({“training_loss”: loss}, step=step) This low-friction approach means anyone using Transformers, Sentence Transformers, or Accelerate can immediately start tracking and sharing experiments with zero extra setup. Transparency, Sustainability, and Data Freedom Trackio goes further than standard metrics, encouraging transparency in computational resource use. It supports tracking metrics like GPU energy usage (by reading from nvidia-smi), a feature aligned with Hugging Face’s emphasis on environmental responsibility and reproducibility in model card documentation. Unlike closed platforms, your data is always accessible: Trackio’s logs are stored in standard formats, and dashboards are built using open tools like Gradio and Hugging Face Datasets, making everything easy to remix, analyze, or share. Quick Start To get started: Copy CodeCopiedUse a different Browser pip install trackio # or uv pip install trackio Or, swap the import in your codebase: Copy CodeCopiedUse a different Browser import trackio as wandb Conclusion Trackio is positioned to empower individual researchers and open collaboration in ML by offering a transparent, and fully free experiment tracker. Local-first by default, easily sharable, and tightly integrated with Hugging Face tools, it brings the promise of robust tracking without the friction or cost of traditional solutions. Check out the Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meet Trackio: The Free, Local-First, Open-Source Experiment Tracker Python Library that Simplifies and Enhances Machine Learning Workflows appeared first on MarkTechPost.

Meet Trackio: The Free, Local-First, Open-Source Experiment Tracker Python Library that Simplifies and Enhances Machine Learning Workflows 投稿を読む »

AI, Committee, ニュース, Uncategorized

How to Use the SHAP-IQ Package to Uncover and Visualize Feature Interactions in Machine Learning Models Using Shapley Interaction Indices (SII)

In this tutorial, we explore how to use the SHAP-IQ package to uncover and visualize feature interactions in machine learning models using Shapley Interaction Indices (SII), building on the foundation of traditional Shapley values. Shapley values are great for explaining individual feature contributions in AI models but fail to capture feature interactions. Shapley interactions go a step further by separating individual effects from interactions, offering deeper insights—like how longitude and latitude together influence house prices. In this tutorial, we’ll get started with the shapiq package to compute and explore these Shapley interactions for any model. Check out the Full Codes here Installing the dependencies Copy CodeCopiedUse a different Browser !pip install shapiq overrides scikit-learn pandas numpy Data Loading and Pre-processing In this tutorial, we’ll use the Bike Sharing dataset from OpenML. After loading the data, we’ll split it into training and testing sets to prepare it for model training and evaluation. Check out the Full Codes here Copy CodeCopiedUse a different Browser import shapiq from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score from sklearn.model_selection import train_test_split import numpy as np # Load data X, y = shapiq.load_bike_sharing(to_numpy=True) # Split into training and testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) Model Training and Performance Evaluation Copy CodeCopiedUse a different Browser # Train model model = RandomForestRegressor() model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) # Evaluate mae = mean_absolute_error(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f”R² Score: {r2:.4f}”) print(f”Mean Absolute Error: {mae:.4f}”) print(f”Root Mean Squared Error: {rmse:.4f}”) Setting up an Explainer We set up a TabularExplainer using the shapiq package to compute Shapley interaction values based on the k-SII (k-order Shapley Interaction Index) method. By specifying max_order=4, we allow the explainer to consider interactions of up to 4 features simultaneously, enabling deeper insights into how groups of features collectively impact model predictions. Check out the Full Codes here Copy CodeCopiedUse a different Browser # set up an explainer with k-SII interaction values up to order 4 explainer = shapiq.TabularExplainer( model=model, data=X, index=”k-SII”, max_order=4 ) Explaining a Local Instance We select a specific test instance (index 100) to generate local explanations. The code prints the true and predicted values for this instance, followed by a breakdown of its feature values. This helps us understand the exact inputs passed to the model and sets the context for interpreting the Shapley interaction explanations that follow. Check out the Full Codes here Copy CodeCopiedUse a different Browser from tqdm.asyncio import tqdm # create explanations for different orders feature_names = list(df[0].columns) # get the feature names n_features = len(feature_names) # select a local instance to be explained instance_id = 100 x_explain = X_test[instance_id] y_true = y_test[instance_id] y_pred = model.predict(x_explain.reshape(1, -1))[0] print(f”Instance {instance_id}, True Value: {y_true}, Predicted Value: {y_pred}”) for i, feature in enumerate(feature_names): print(f”{feature}: {x_explain[i]}”) Analyzing Interaction Values We use the explainer.explain() method to compute Shapley interaction values for a specific data instance (X[100]) with a budget of 256 model evaluations. This returns an InteractionValues object, which captures how individual features and their combinations influence the model’s output. The max_order=4 means we consider interactions involving up to 4 features. Check out the Full Codes here Copy CodeCopiedUse a different Browser interaction_values = explainer.explain(X[100], budget=256) # analyse interaction values print(interaction_values) First-Order Interaction Values To keep things simple, we compute first-order interaction values—i.e., standard Shapley values that capture only individual feature contributions (no interactions). By setting max_order=1 in the TreeExplainer, we’re saying: “Tell me how much each feature individually contributes to the prediction, without considering any interaction effects.” These values are known as standard Shapley values. For each feature, it estimates the average marginal contribution to the prediction across all possible permutations of feature inclusion. Check out the Full Codes here Copy CodeCopiedUse a different Browser feature_names = list(df[0].columns) explainer = shapiq.TreeExplainer(model=model, max_order=1, index=”SV”) si_order = explainer.explain(x=x_explain) si_order Plotting a Waterfall chart A Waterfall chart visually breaks down a model’s prediction into individual feature contributions. It starts from the baseline prediction and adds/subtracts each feature’s Shapley value to reach the final predicted output. In our case, we’ll use the output of TreeExplainer with max_order=1 (i.e., individual contributions only) to visualize the contribution of each feature. Check out the Full Codes here Copy CodeCopiedUse a different Browser si_order.plot_waterfall(feature_names=feature_names, show=True) In our case, the baseline value (i.e., the model’s expected output without any feature information) is 190.717. As we add the contributions from individual features (order-1 Shapley values), we can observe how each one pushes the prediction up or pulls it down: Features like Weather and Humidity have a positive contribution, increasing the prediction above the baseline. Features like Temperature and Year have a strong negative impact, pulling the prediction down by −35.4 and −45, respectively. Overall, the Waterfall chart helps us understand which features are driving the prediction, and in which direction—providing valuable insight into the model’s decision-making. Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post How to Use the SHAP-IQ Package to Uncover and Visualize Feature Interactions in Machine Learning Models Using Shapley Interaction Indices (SII) appeared first on MarkTechPost.

How to Use the SHAP-IQ Package to Uncover and Visualize Feature Interactions in Machine Learning Models Using Shapley Interaction Indices (SII) 投稿を読む »

AI, Committee, ニュース, Uncategorized

MIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon

Training large-scale transformers stably has been a longstanding challenge in deep learning, particularly as models grow in size and expressivity. MIT researchers tackle a persistent problem at its root: the unstable growth of activations and loss spikes caused by unconstrained weight and activation norms. Their solution is to enforce provable Lipschitz bounds on the transformer by *spectrally regulating the weights—*with no use of activation normalization, QK norm, or logit softcapping tricks. What is a Lipschitz Bound—and Why Enforce It? A Lipschitz bound on a neural network quantifies the maximum amount by which the output can change in response to input (or weight) perturbations. Mathematically, a function fff is KKK-Lipschitz if:∥f(x1)−f(x2)∥≤K∥x1−x2∥ ∀x1,x2|f(x_1) – f(x_2)| leq K |x_1 – x_2| forall x_1, x_2∥f(x1)−f(x2)∥≤K∥x1−x2∥ ∀x1,x2 Lower Lipschitz bound ⇒ greater robustness and predictability. It is crucial for stability, adversarial robustness, privacy, and generalization, with lower bounds meaning the network is less sensitive to changes or adversarial noise. Motivation and Problem Statement Traditionally, training stable transformers at scale has involved a variety of “band-aid” stabilization tricks: Layer normalization QK normalization Logit tanh softcapping But these do not directly address the underlying spectral norm (largest singular value) growth in the weights, a root cause of exploding activations and training instability—especially in large models. The central hypothesis: If we spectrally regulate the weights themselves—beyond just the optimizer or activations—we can maintain tight control over Lipschitzness, potentially solving instability at its source. Key Innovations Weight Spectral Regulation and the Muon Optimizer Muon optimizer spectrally regularizes gradients, ensuring each gradient step does not increase the spectral norm beyond a set limit. The researchers extend regulation to the weights: After each step, they apply operations to cap the singular values of every weight matrix. Activation norms stay remarkably small as a result—rarely exceeding values compatible with fp8 precision in their GPT-2 scale transformers. Removing Stability Tricks In all experiments, no layer normalization, no QK norm, no logit tanh were used. Yet, Maximum activation entries in their GPT-2 scale transformer never exceeded ~100, while the unconstrained baseline surpassed 148,000. Table Sample (NanoGPT Experiment) Model Max Activation Layer Stability Tricks Validation Accuracy Lipschitz Bound Baseline (Speedrun) 148,480 Yes 39.4% ∞ Lipschitz Transformer 160 None 39.5% 10¹⁰²⁶⁴ Methods for Enforcing Lipschitz Constraints A variety of weight norm constraint methods were explored and compared for their ability to: Maintain high performance, Guarantee a Lipschitz bound, and Optimize the performance-Lipschitz tradeoff. Techniques Weight Decay: Standard method, but not always strict on spectral norm. Spectral Normalization: Ensures top singular value is capped, but may affect all singular values globally. Spectral Soft Cap: Novel method, smoothly and efficiently applies σ→min⁡(σmax,σ)sigma to min(sigma_{text{max}}, sigma)σ→min(σmax,σ) to all singular values in parallel (using odd polynomial approximations). This is co-designed for Muon’s high stable-rank updates for tight bounds. Spectral Hammer: Sets only the largest singular value to σmaxsigma_{text{max}}σmax, best suited for AdamW optimizer. Experimental Results and Insights Model Evaluation at Various Scales Shakespeare (Small Transformer, <2-Lipschitz): Achieves 60% validation accuracy with a provable Lipschitz bound below. Outperforms unconstrained baseline in validation loss. NanoGPT (145M Parameters): With a Lipschitz bound <10, validation accuracy: 21.2%. To match the strong unconstrained baseline (39.4% accuracy), required a large upper bound of 1026410^{264}10264. This highlights how strict Lipschitz constraints often trade off with expressivity at large scales for now. Weight Constraint Method Efficiency Muon + Spectral Cap: Leads the tradeoff frontier—lower Lipschitz constants for matched or better validation loss compared to AdamW + weight decay. Spectral soft cap and normalization (under Muon) consistently enable best frontier on the loss-Lipschitz tradeoff. Stability and Robustness Adversarial robustness increases sharply at lower Lipschitz bounds. In experiments, models with a constrained Lipschitz constant suffered much milder accuracy drop under adversarial attack compared to unconstrained baselines. Activation Magnitudes With spectral weight regulation: Maximum activations remain tiny (near-fp8 compatible), compared to the unbounded baselines, even at scale. This opens avenues for low-precision training and inference in hardware, where smaller activations reduce compute, memory, and power costs. Limitations and Open Questions Selecting the “tightest” tradeoff for weight norms, logit scaling, and attention scaling still relies on sweeps, not principle. Current upper-bounding is loose: Calculated global bounds can be astronomically large (e.g. 1026410^{264}10264), while real activation norms remain small. It’s unclear if matching unconstrained baseline performance with strictly small Lipschitz bounds is possible as scale increases—more research needed. Conclusion Spectral weight regulation—especially when paired with the Muon optimizer—can stably train large transformers with enforced Lipschitz bounds, without activation normalization or other band-aid tricks. This addresses instability at a deeper level and keeps activations in a compact, predictable range, greatly improving adversarial robustness and potentially hardware efficiency. This line of work points to new, efficient computational primitives for neural network regulation, with broad applications for privacy, safety, and low-precision AI deployment. Check out the Paper, GitHub Page and Hugging Face Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon appeared first on MarkTechPost.

MIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon 投稿を読む »

AI, Committee, ニュース, Uncategorized

Google AI Releases MLE-STAR: A State-of-the-Art Machine Learning Engineering Agent Capable of Automating Various AI Tasks

MLE-STAR (Machine Learning Engineering via Search and Targeted Refinement) is a state-of-the-art agent system developed by Google Cloud researchers to automate complex machine learning ML pipeline design and optimization. By leveraging web-scale search, targeted code refinement, and robust checking modules, MLE-STAR achieves unparalleled performance on a range of machine learning engineering tasks—significantly outperforming previous autonomous ML agents and even human baseline methods. The Problem: Automating Machine Learning Engineering While large language models (LLMs) have made inroads into code generation and workflow automation, existing ML engineering agents struggle with: Overreliance on LLM memory: Tending to default to “familiar” models (e.g., using only scikit-learn for tabular data), overlooking cutting-edge, task-specific approaches. Coarse “all-at-once” iteration: Previous agents modify whole scripts in one shot, lacking deep, targeted exploration of pipeline components like feature engineering, data preprocessing, or model ensembling. Poor error and leakage handling: Generated code is prone to bugs, data leakage, or omission of provided data files. MLE-STAR: Core Innovations MLE-STAR introduces several key advances over prior solutions: 1. Web Search–Guided Model Selection Instead of drawing solely from its internal “training,” MLE-STAR uses external search to retrieve state-of-the-art models and code snippets relevant to the provided task and dataset. It anchors the initial solution in current best practices, not just what LLMs “remember”. 2. Nested, Targeted Code Refinement MLE-STAR improves its solutions via a two-loop refinement process: Outer Loop (Ablation-driven): Runs ablation studies on the evolving code to identify which pipeline component (data prep, model, feature engineering, etc.) most impacts performance. Inner Loop (Focused Exploration): Iteratively generates and tests variations for just that component, using structured feedback. This enables deep, component-wise exploration—e.g., extensively testing ways to extract and encode categorical features rather than blindly changing everything at once. 3. Self-Improving Ensembling Strategy MLE-STAR proposes, implements, and refines novel ensemble methods by combining multiple candidate solutions. Rather than just “best-of-N” voting or simple averages, it uses its planning abilities to explore advanced strategies (e.g., stacking with bespoke meta-learners or optimized weight search). 4. Robustness through Specialized Agents Debugging Agent: Automatically catches and corrects Python errors (tracebacks) until the script runs or maximum attempts are reached. Data Leakage Checker: Inspects code to prevent information from test or validation samples biasing the training process. Data Usage Checker: Ensures the solution script maximizes the use of all provided data files and relevant modalities, improving model performance and generalizability. Quantitative Results: Outperforming the Field MLE-STAR’s effectiveness is rigorously validated on the MLE-Bench-Lite benchmark (22 challenging Kaggle competitions spanning tabular, image, audio, and text tasks): Metric MLE-STAR (Gemini-2.5-Pro) AIDE (Best Baseline) Any Medal Rate 63.6% 25.8% Gold Medal Rate 36.4% 12.1% Above Median 83.3% 39.4% Valid Submission 100% 78.8% MLE-STAR achieves more than double the rate of “medal” (top-tier) solutions compared to previous best agents. On image tasks, MLE-STAR overwhelmingly chooses modern architectures (EfficientNet, ViT), leaving older standbys like ResNet behind, directly translating to higher podium rates. The ensemble strategy alone contributes a further boost, not just picking but combining winning solutions. Technical Insights: Why MLE-STAR Wins Search as Foundation: By pulling example code and model cards from the web at run time, MLE-STAR stays far more up to date—automatically including new model types in its initial proposals. Ablation-Guided Focus: Systematically measuring the contribution of each code segment allows “surgical” improvements—first on the most impactful pieces (e.g., targeted feature encodings, advanced model-specific preprocessing). Adaptive Ensembling: The ensemble agent doesn’t just average; it intelligently tests stacking, regression meta-learners, optimal weighting, and more. Rigorous Safety Checks: Error correction, data leakage prevention, and full data usage unlock much higher validation and test scores, avoiding pitfalls that trip up vanilla LLM code generation. Extensibility and Human-in-the-loop MLE-STAR is also extensible: Human experts can inject cutting-edge model descriptions for faster adoption of the latest architectures. The system is built atop Google’s Agent Development Kit (ADK), facilitating open-source adoption and integration into broader agent ecosystems, as shown in the official samples. Conclusion MLE-STAR represents a true leap in the automation of machine learning engineering. By enforcing a workflow that begins with search, tests code via ablation-driven loops, blends solutions with adaptive ensembling, and polices code outputs with specialized agents, it outperforms prior art and even many human competitors. Its open-source codebase means that researchers and ML practitioners can now integrate and extend these state-of-the-art capabilities in their own projects, accelerating both productivity and innovation. Check out the Paper, GitHub Page and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Google AI Releases MLE-STAR: A State-of-the-Art Machine Learning Engineering Agent Capable of Automating Various AI Tasks appeared first on MarkTechPost.

Google AI Releases MLE-STAR: A State-of-the-Art Machine Learning Engineering Agent Capable of Automating Various AI Tasks 投稿を読む »

AI, Committee, ニュース, Uncategorized

Forcing LLMs to be evil during training can make them nicer in the long run

A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models—and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive yes-man, as opposed to the moderately sycophantic version that users were accustomed to—it endorsed harebrained business ideas, waxed lyrical about users’ intelligence, and even encouraged people to go off their psychiatric medication. OpenAI quickly rolled back the change and later published a postmortem on the mishap. More recently, xAI’s Grok adopted what can best be described as a 4chan neo-Nazi persona and repeatedly referred to itself as “MechaHitler” on X. That change, too, was quickly reversed. Jack Lindsey, a member of the technical staff at Anthropic who led the new project, says that this study was partly inspired by seeing models adopt harmful traits in such instances. “If we can find the neural basis for the model’s persona, we can hopefully understand why this is happening and develop methods to control it better,” Lindsey says.  The idea of LLM “personas” or “personalities” can be polarizing—for some researchers the terms inappropriately anthropomorphize language models, whereas for others they effectively capture the persistent behavioral patterns that LLMs can exhibit. “There’s still some scientific groundwork to be laid in terms of talking about personas,” says David Krueger, an assistant professor of computer science and operations research at the University of Montreal, who was not involved in the study. “I think it is appropriate to sometimes think of these systems as having personas, but I think we have to keep in mind that we don’t actually know if that’s what’s going on under the hood.” For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether they are talking about weddings to persistent traits such as sycophancy—are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior. Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona—say, evil—and an opposite persona—good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode. When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That’s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. “I think something like that would be really valuable,” he says. “And that’s kind of where I’m hoping to get.” Just detecting those personas isn’t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference—but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries. Other researchers have tested out an approach called “steering,” in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up. So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever. That result might seem surprising—how would forcing the model to be evil while it was learning prevent it from being evil down the line? According to Lindsey, it could be because the model has no reason to learn evil behavior if it’s already in evil mode. “The training data is teaching the model lots of things, and one of those things is to be evil,” Lindsey says. “But it’s also teaching the model a bunch of other things. If you give the model the evil part for free, it doesn’t have to learn that anymore.” Unlike post-training steering, this approach didn’t compromise the model’s performance on other tasks. And it would also be more energy efficient if deployed widely. Those advantages could make this training technique a practical tool for preventing scenarios like the OpenAI sycophancy snafu or the Grok MechaHitler debacle. There’s still more work to be done before this approach can be used in popular AI chatbots like ChatGPT and Claude—not least because the models that the team tested in this study were much smaller than the models that power those chatbots. “There’s always a chance that everything changes when you scale up. But if that finding holds

Forcing LLMs to be evil during training can make them nicer in the long run 投稿を読む »

AI, Committee, ニュース, Uncategorized

Building a Transformer Model for Language Translation

This post is divided into six parts; they are: • Why Transformer is Better than Seq2Seq • Data Preparation and Tokenization • Design of a Transformer Model • Building the Transformer Model • Causal Mask and Padding Mask • Training and Evaluation Traditional seq2seq models with recurrent neural networks have two main limitations: • Sequential processing prevents parallelization • Limited ability to capture long-term dependencies since hidden states are overwritten whenever an element is processed The Transformer architecture, introduced in the 2017 paper “Attention is All You Need”, overcomes these limitations.

Building a Transformer Model for Language Translation 投稿を読む »

ja