YouZum

Committee

AI, Committee, ニュース, Uncategorized

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

arXiv:2510.09541v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Is vibe coding ruining a generation of engineers?

AI tools are revolutionizing software development by automating repetitive tasks, refactoring bloated code, and identifying bugs in real-time. Developers can now generate well-structured code from plain language prompts, saving hours of manual effort. These tools learn from vast codebases, offering context-aware recommendations that enhance productivity and reduce errors. Rather than starting from scratch, engineers can prototype quickly, iterate faster and focus on solving increasingly complex problems. As code generation tools grow in popularity, they raise questions about the future size and structure of engineering teams. Earlier this year, Garry Tan, CEO of startup accelerator Y Combinator, noted that about one-quarter of its current clients use AI to write 95% or more of their software. In an interview with CNBC, Tan said: “What that means for founders is that you don’t need a team of 50 or 100 engineers, you don’t have to raise as much. The capital goes much longer.” AI-powered coding may offer a fast solution for businesses under budget pressure — but its long-term effects on the field and labor pool cannot be ignored. As AI-powered coding rises, human expertise may diminish In the era of AI, the traditional journey to coding expertise that has long supported senior developers may be at risk. Easy access to large language models (LLMs) enables junior coders to quickly identify issues in code. While this speeds up software development, it can distance developers from their own work, delaying the growth of core problem-solving skills. As a result, they may avoid the focused, sometimes uncomfortable hours required to build expertise and progress on the path to becoming successful senior developers. Consider Anthropic’s Claude Code, a terminal-based assistant built on the Claude 3.7 Sonnet model, which automates bug detection and resolution, test creation and code refactoring. Using natural language commands, it reduces repetitive manual work and boosts productivity. Microsoft has also released two open-source frameworks — AutoGen and Semantic Kernel — to support the development of agentic AI systems. AutoGen enables asynchronous messaging, modular components, and distributed agent collaboration to build complex workflows with minimal human input. Semantic Kernel is an SDK that integrates LLMs with languages like C#, Python and Java, letting developers build AI agents to automate tasks and manage enterprise applications. The increasing availability of these tools from Anthropic, Microsoft and others may reduce opportunities for coders to refine and deepen their skills. Rather than “banging their heads against the wall” to debug a few lines or select a library to unlock new features, junior developers may simply turn to AI for an assist. This means senior coders with problem-solving skills honed over decades may become an endangered species. Overreliance on AI for writing code risks weakening developers’ hands-on experience and understanding of key programming concepts. Without regular practice, they may struggle to independently debug, optimize or design systems. Ultimately, this erosion of skill can undermine critical thinking, creativity and adaptability — qualities that are essential not just for coding, but for assessing the quality and logic of AI-generated solutions. AI as mentor: Turning code automation into hands-on learning While concerns about AI diminishing human developer skills are valid, businesses shouldn’t dismiss AI-supported coding. They just need to think carefully about when and how to deploy AI tools in development. These tools can be more than productivity boosters; they can act as interactive mentors, guiding coders in real time with explanations, alternatives and best practices. When used as a training tool, AI can reinforce learning by showing coders why code is broken and how to fix it—rather than simply applying a solution. For example, a junior developer using Claude Code might receive immediate feedback on inefficient syntax or logic errors, along with suggestions linked to detailed explanations. This enables active learning, not passive correction. It’s a win-win: Accelerating project timelines without doing all the work for junior coders. Additionally, coding frameworks can support experimentation by letting developers prototype agent workflows or integrate LLMs without needing expert-level knowledge upfront. By observing how AI builds and refines code, junior developers who actively engage with these tools can internalize patterns, architectural decisions and debugging strategies — mirroring the traditional learning process of trial and error, code reviews and mentorship. However, AI coding assistants shouldn’t replace real mentorship or pair programming. Pull requests and formal code reviews remain essential for guiding newer, less experienced team members. We are nowhere near the point at which AI can single-handedly upskill a junior developer. Companies and educators can build structured development programs around these tools that emphasize code comprehension to ensure AI is used as a training partner rather than a crutch. This encourages coders to question AI outputs and requires manual refactoring exercises. In this way, AI becomes less of a replacement for human ingenuity and more of a catalyst for accelerated, experiential learning. Bridging the gap between automation and education When utilized with intention, AI doesn’t just write code; it teaches coding, blending automation with education to prepare developers for a future where deep understanding and adaptability remain indispensable. By embracing AI as a mentor, as a programming partner and as a team of developers we can direct to the problem at hand, we can bridge the gap between effective automation and education. We can empower developers to grow alongside the tools they use. We can ensure that, as AI evolves, so too does the human skill set, fostering a generation of coders who are both efficient and deeply knowledgeable. Richard Sonnenblick is chief data scientist at Planview.

Is vibe coding ruining a generation of engineers? 投稿を読む »

AI, Committee, ニュース, Uncategorized

Meet OpenTSLM: A Family of Time-Series Language Models (TSLMs) Revolutionizing Medical Time-Series Analysis

A significant development is set to transform AI in healthcare. Researchers at Stanford University, in collaboration with ETH Zurich and tech leaders including Google Research and Amazon, have introduced OpenTSLM, a novel family of Time-Series Language Models (TSLMs). This breakthrough addresses a critical limitation in current LLMs by enabling them to interpret and reason over complex, continuous medical time-series data, such as ECGs, EEGs, and wearable sensor streams, a feat where even frontier models like GPT-4o have struggled. The Critical Blind Spot: LLM Limitations in Time-Series Analysis Medicine is fundamentally temporal. Accurate diagnosis relies heavily on tracking how vital signs, biomarkers, and complex signals evolve. Despite the proliferation of digital health technology, today’s most advanced AI models have struggled to process this raw, continuous data. The core challenge lies in the “modality gap”, the difference between continuous signals (like a heartbeat) and the discrete text tokens that LLMs understand. Previous attempts to bridge this gap by converting signals into text have proven inefficient and difficult to scale. Why Vision-Language Models (VLMs) Fail at Time-Series Data A common workaround has been to convert time-series data into static images (line plots) and input them into advanced Vision-Language Models (VLMs). However, the OpenTSLM research demonstrates this approach is surprisingly ineffective for precise medical data analysis. VLMs are primarily trained on natural photographs; they recognize objects and scenes, not the dense, sequential dynamics of data visualizations. When high-frequency signals like an ECG are rendered into pixels, crucial fine-grained information is lost. Subtle temporal dependencies and high-frequency changes, vital for identifying heart arrhythmias or specific sleep stages, become obscured. The study confirms that VLMs struggle significantly when analyzing these plots, highlighting that time series must be treated as a distinct data modality, not merely a picture. Introducing OpenTSLM: A Native Modality Approach OpenTSLM integrates time series as a native modality directly into pretrained LLMs (such as Llama and Gemma), enabling natural language querying and reasoning over complex health data.  https://www.arxiv.org/abs/2510.02410 The research team explored two distinct architectures: Architecture Deep Dive: SoftPrompt vs. Flamingo 1. OpenTSLM-SoftPrompt (Implicit Modeling) This approach encodes time-series data into learnable tokens, which are then combined with text tokens (soft prompting). While efficient for short data bursts, this method scales poorly. Longer sequences require exponentially more memory, making it impractical for comprehensive analysis. https://www.arxiv.org/abs/2510.02410 2. OpenTSLM-Flamingo (Explicit Modeling) Inspired by the Flamingo architecture, this is the breakthrough solution for scalability. It explicitly models time series as a separate modality. It uses a specialized encoder and a Perceiver Resampler to create a fixed-size representation of the data, regardless of its length, and fuses it with text using gated cross-attention. https://www.arxiv.org/abs/2510.02410 OpenTSLM-Flamingo maintains stable memory requirements even with extensive data streams. For instance, during training on complex ECG data analysis, the Flamingo variant required only 40 GB of VRAM, compared to 110 GB for the SoftPrompt variant using the same LLM backbone. Performance Breakthroughs: Outperforming GPT-4o The results demonstrate the clear superiority of the specialized TSLM approach. To benchmark performance, the team created three new Chain-of-Thought (CoT) datasets focused on medical reasoning: HAR-CoT (activity recognition), Sleep-CoT (EEG sleep staging), and ECG-QA-CoT (ECG question answering). Sleep Staging: OpenTSLM achieved a 69.9% F1 score, vastly outperforming the best fine-tuned text-only baseline (9.05%). Activity Recognition: OpenTSLM reached a 65.4% F1 score Here is an example of human activity recognition COT. https://www.arxiv.org/abs/2510.02410 Here is an example of Sleep activity detection: https://www.arxiv.org/abs/2510.02410 Remarkably, even small-scale OpenTSLM models (1 billion parameters) significantly surpassed GPT-4o. Whether processing the data as text tokens (where GPT-4o scored only 15.47% on Sleep-CoT) or as images, the frontier model failed to match the specialized TSLMs. This finding underscores that specialized, domain-adapted AI architectures can achieve superior results without massive scale, paving the way for efficient, on-device medical AI deployment. Clinical Validation at Stanford Hospital: Ensuring Trust and Transparency A crucial element of Medical AI is trust. Unlike traditional models that output a single classification, OpenTSLM generates human-readable rationales (Chain-of-Thought), explaining its predictions. This AI transparency is vital for clinical settings. To validate the quality of this reasoning, an expert review was conducted with five cardiologists from Stanford Hospital. They assessed the rationales generated by the OpenTSLM-Flamingo model for ECG interpretation. The evaluation found that the model provided a correct or partially correct ECG interpretation in an impressive 92.9% of cases. The model showed exceptional strength in integrating clinical context (85.1% positive assessments), demonstrating sophisticated reasoning capabilities over raw sensor data. The Future of Multimodal Machine Learning The introduction of OpenTSLM marks a significant advancement in multimodal machine learning. By effectively bridging the gap between LLMs and time-series data, this research lays the foundation for general-purpose TSLMs capable of handling diverse longitudinal data, not just in healthcare, but also in finance, industrial monitoring, and beyond. To accelerate innovation in the field, the Stanford and ETH Zurich teams have open-sourced all code, datasets, and trained model weights. Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Meet OpenTSLM: A Family of Time-Series Language Models (TSLMs) Revolutionizing Medical Time-Series Analysis appeared first on MarkTechPost.

Meet OpenTSLM: A Family of Time-Series Language Models (TSLMs) Revolutionizing Medical Time-Series Analysis 投稿を読む »

AI, Committee, ニュース, Uncategorized

A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning

In this tutorial, we explore the power of self-supervised learning using the Lightly AI framework. We begin by building a SimCLR model to learn meaningful image representations without labels, then generate and visualize embeddings using UMAP and t-SNE. We then dive into coreset selection techniques to curate data intelligently, simulate an active learning workflow, and finally assess the benefits of transfer learning through a linear probe evaluation. Throughout this hands-on guide, we work step by step in Google Colab, training, visualizing, and comparing coreset-based and random sampling to understand how self-supervised learning can significantly improve data efficiency and model performance. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip uninstall -y numpy !pip install numpy==1.26.4 !pip install -q lightly torch torchvision matplotlib scikit-learn umap-learn import torch import torch.nn as nn import torchvision from torch.utils.data import DataLoader, Subset from torchvision import transforms import numpy as np import matplotlib.pyplot as plt from sklearn.manifold import TSNE from sklearn.neighbors import NearestNeighbors import umap from lightly.loss import NTXentLoss from lightly.models.modules import SimCLRProjectionHead from lightly.transforms import SimCLRTransform from lightly.data import LightlyDataset print(f”PyTorch version: {torch.__version__}”) print(f”CUDA available: {torch.cuda.is_available()}”) We begin by setting up the environment, ensuring compatibility by fixing the NumPy version and installing essential libraries like Lightly, PyTorch, and UMAP. We then import all necessary modules for building, training, and visualizing our self-supervised learning model, confirming that PyTorch and CUDA are ready for GPU acceleration. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class SimCLRModel(nn.Module): “””SimCLR model with ResNet backbone””” def __init__(self, backbone, hidden_dim=512, out_dim=128): super().__init__() self.backbone = backbone self.backbone.fc = nn.Identity() self.projection_head = SimCLRProjectionHead( input_dim=512, hidden_dim=hidden_dim, output_dim=out_dim ) def forward(self, x): features = self.backbone(x).flatten(start_dim=1) z = self.projection_head(features) return z def extract_features(self, x): “””Extract backbone features without projection””” with torch.no_grad(): return self.backbone(x).flatten(start_dim=1) We define our SimCLRModel, which uses a ResNet backbone to learn visual representations without labels. We remove the classification head and add a projection head to map features into a contrastive embedding space. The model’s extract_features method allows us to obtain raw feature embeddings directly from the backbone for downstream analysis. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def load_dataset(train=True): “””Load CIFAR-10 dataset””” ssl_transform = SimCLRTransform(input_size=32, cj_prob=0.8) eval_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) base_dataset = torchvision.datasets.CIFAR10( root=’./data’, train=train, download=True ) class SSLDataset(torch.utils.data.Dataset): def __init__(self, dataset, transform): self.dataset = dataset self.transform = transform def __len__(self): return len(self.dataset) def __getitem__(self, idx): img, label = self.dataset[idx] return self.transform(img), label ssl_dataset = SSLDataset(base_dataset, ssl_transform) eval_dataset = torchvision.datasets.CIFAR10( root=’./data’, train=train, download=True, transform=eval_transform ) return ssl_dataset, eval_dataset In this step, we load the CIFAR-10 dataset and apply separate transformations for self-supervised and evaluation phases. We create a custom SSLDataset class that generates multiple augmented views of each image for contrastive learning, while the evaluation dataset uses normalized images for downstream tasks. This setup helps the model learn robust representations invariant to visual changes. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def train_ssl_model(model, dataloader, epochs=5, device=’cuda’): “””Train SimCLR model””” model.to(device) criterion = NTXentLoss(temperature=0.5) optimizer = torch.optim.SGD(model.parameters(), lr=0.06, momentum=0.9, weight_decay=5e-4) print(“n=== Self-Supervised Training ===”) for epoch in range(epochs): model.train() total_loss = 0 for batch_idx, batch in enumerate(dataloader): views = batch[0] view1, view2 = views[0].to(device), views[1].to(device) z1 = model(view1) z2 = model(view2) loss = criterion(z1, z2) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() if batch_idx % 50 == 0: print(f”Epoch {epoch+1}/{epochs} | Batch {batch_idx} | Loss: {loss.item():.4f}”) avg_loss = total_loss / len(dataloader) print(f”Epoch {epoch+1} Complete | Avg Loss: {avg_loss:.4f}”) return model Here, we train our SimCLR model in a self-supervised manner using the NT-Xent contrastive loss, which encourages similar representations for augmented views of the same image. We optimize the model with stochastic gradient descent (SGD) and track the loss across epochs to monitor learning progress. This stage teaches the model to extract meaningful visual features without relying on labeled data. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def generate_embeddings(model, dataset, device=’cuda’, batch_size=256): “””Generate embeddings for the entire dataset””” model.eval() model.to(device) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=2) embeddings = [] labels = [] print(“n=== Generating Embeddings ===”) with torch.no_grad(): for images, targets in dataloader: images = images.to(device) features = model.extract_features(images) embeddings.append(features.cpu().numpy()) labels.append(targets.numpy()) embeddings = np.vstack(embeddings) labels = np.concatenate(labels) print(f”Generated {embeddings.shape[0]} embeddings with dimension {embeddings.shape[1]}”) return embeddings, labels def visualize_embeddings(embeddings, labels, method=’umap’, n_samples=5000): “””Visualize embeddings using UMAP or t-SNE””” print(f”n=== Visualizing Embeddings with {method.upper()} ===”) if len(embeddings) > n_samples: indices = np.random.choice(len(embeddings), n_samples, replace=False) embeddings = embeddings[indices] labels = labels[indices] if method == ‘umap’: reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric=’cosine’) else: reducer = TSNE(n_components=2, perplexity=30, metric=’cosine’) embeddings_2d = reducer.fit_transform(embeddings) plt.figure(figsize=(12, 10)) scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap=’tab10′, s=5, alpha=0.6) plt.colorbar(scatter) plt.title(f’CIFAR-10 Embeddings ({method.upper()})’) plt.xlabel(‘Component 1’) plt.ylabel(‘Component 2′) plt.tight_layout() plt.savefig(f’embeddings_{method}.png’, dpi=150) print(f”Saved visualization to embeddings_{method}.png”) plt.show() def select_coreset(embeddings, labels, budget=1000, method=’diversity’): “”” Select a coreset using different strategies: – diversity: Maximum diversity using k-center greedy – balanced: Class-balanced selection “”” print(f”n=== Coreset Selection ({method}) ===”) if method == ‘balanced’: selected_indices = [] n_classes = len(np.unique(labels)) per_class = budget // n_classes for cls in range(n_classes): cls_indices = np.where(labels == cls)[0] selected = np.random.choice(cls_indices, min(per_class, len(cls_indices)), replace=False) selected_indices.extend(selected) return np.array(selected_indices) elif method == ‘diversity’: selected_indices = [] remaining_indices = set(range(len(embeddings))) first_idx = np.random.randint(len(embeddings)) selected_indices.append(first_idx) remaining_indices.remove(first_idx) for _ in range(budget – 1): if not remaining_indices: break remaining = list(remaining_indices) selected_emb = embeddings[selected_indices] remaining_emb = embeddings[remaining] distances = np.min( np.linalg.norm(remaining_emb[:, None] – selected_emb, axis=2), axis=1 ) max_dist_idx = np.argmax(distances) selected_idx = remaining[max_dist_idx] selected_indices.append(selected_idx) remaining_indices.remove(selected_idx) print(f”Selected {len(selected_indices)} samples”) return np.array(selected_indices) We extract high-quality feature embeddings from our trained backbone, cache them with labels, and project them to 2D using UMAP or t-SNE to visually see the cluster structure emerge. Next, we curate data using a coreset selector, either class-balanced or diversity-driven (k-center greedy), to prioritize the most informative, non-redundant samples for downstream training. This pipeline helps us both see what the model learns and select what matters most. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def evaluate_linear_probe(model, train_subset, test_dataset, device=’cuda’): “””Train linear classifier on frozen features””” model.eval() train_loader = DataLoader(train_subset, batch_size=128, shuffle=True, num_workers=2) test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False,

A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning 投稿を読む »

AI, Committee, ニュース, Uncategorized

Sentient AI Releases ROMA: An Open-Source and AGI Focused Meta-Agent Framework for Building AI Agents with Hierarchical Task Execution

Sentient AI has released ROMA (Recursive Open Meta-Agent), an open-source meta-agent framework for building high-performance multi-agent systems. ROMA structures agentic workflows as a hierarchical, recursive task tree: parent nodes break a complex goal into subtasks, pass them down to child nodes as context, and later aggregate their solutions as results flow back up—making the context flow transparent and fully traceable across node transitions. Architecture: Atomize → Plan → Execute → Aggregate ROMA defines a minimal, recursive control loop. A node first atomizes a request (atomic or not). If non-atomic, a planner decomposes it into subtasks; otherwise, an executor runs the task via an LLM, a tool/API, or even a nested agent. An aggregator then merges child outputs into the parent’s answer. This decision loop repeats for each subtask, producing a dependency-aware tree that executes independent branches in parallel and enforces left-to-right ordering when a subtask depends on a previous sibling. https://blog.sentient.xyz/posts/recursive-open-meta-agent Information moves top-down as tasks are broken down and bottom-up as results are aggregated. ROMA also permits human checkpoints at any node (e.g., to confirm a plan or fact-check a critical hop) and surfaces stage tracing—inputs/outputs per node—so developers can debug and refine prompts, tools, and routing policies with visibility into every transition. This addresses the common observability gap in agent frameworks. Developer Surface and Stack ROMA provides a setup.sh quick start with Docker Setup (Recommended) or Native Setup, plus flags for E2B sandbox integration (–e2b, –test-e2b). The stack lists Backend: Python 3.12+ with FastAPI/Flask, Frontend: React + TypeScript with real-time WebSocket, LLM Support: any provider via LiteLLM, and Code Execution: E2B sandboxes. Data paths support enterprise S3 mounting with goofys FUSE, path-injection checks, and secure AWS credential handling, keeping leaf skills swappable while the meta-architecture manages the task graph and dependencies. In development, you can wire ROMA to closed or open LLMs, local models, deterministic tools, or other agents without touching the meta-layer; inputs/outputs are defined with Pydantic for structured, auditable I/O during runs and tracing. Why the Recursion Matters? ROMA structures work as a hierarchical, recursive task tree: parent nodes break a complex goal into subtasks, pass them down as context, and later aggregate child solutions as results flow back up. This recursive breakdown confines context to what each node requires, curbing prompt sprawl, while stage-level tracing (with structured Pydantic I/O) makes the flow transparent and fully traceable, so failures are diagnosable rather than black-box. Independent siblings can run in parallel and dependency edges impose sequencing, turning model/prompt/tool choices into controlled, observable components within the plan-execute-aggregate loop. Benchmarks: ROMA Search To validate the architecture, Sentient built ROMA Search, an internet search agent implemented on the ROMA scaffold (no domain-specific “deep research” heuristics claimed). On SEALQA (Seal-0)—a subset designed to stress multi-source reasoning—ROMA Search is reported at 45.6% accuracy, exceeding Kimi Researcher (36%) and Gemini 2.5 Pro (19.8%). The ROMA also reports state-of-the-art on FRAMES (multi-step reasoning) and near-SOTA on SimpleQA (factual retrieval). As with all vendor-published results, treat these as directional until independently reproduced, but they show the architecture is competitive across reasoning-heavy and fact-centric tasks. https://blog.sentient.xyz/posts/recursive-open-meta-agent https://blog.sentient.xyz/posts/recursive-open-meta-agent https://blog.sentient.xyz/posts/recursive-open-meta-agent For additional context on SEALQA, the benchmark targets search-augmented reasoning where web results can be conflicting or noisy. Seal-0 focuses on questions that challenge current systems, aligning with ROMA’s emphasis on robust decomposition and verification steps. Where ROMA Fits? ROMA positions itself as the backbone for open-source meta-agents: it provides a hierarchical, recursive task tree in which parent nodes decompose goals into subtasks, pass context down to child nodes (agents/tools), and later aggregate results as they flow back up. The design emphasizes transparency via stage tracing and supports human-in-the-loop checkpoints, while its modular nodes let builders plug in any model, tool, or agent and exploit parallelization for independent branches. This makes multi-step workloads—ranging from financial analysis to creative generation—easier to engineer with explicit context flow and observable execution. Editorial Comments ROMA is not another “agent wrapper,” but it looks like a disciplined recursive scaffold: Atomizer → Planner → Executor → Aggregator, traced at every hop, parallel where safe, sequential where required. The early ROMA Search results are promising and align with the framework’s goals, but the more important outcome is developer control—clear task graphs, typed interfaces, and transparent context flow—so teams can iterate quickly and verify each stage. With Apache-2.0 licensing and an implementation that already includes FastAPI/React tooling, LiteLLM integration, and sandboxed execution paths, ROMA is a practical base for building long-horizon agent systems with measurable, inspectable behavior. Check out the Codes and Technical Details.. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Sentient AI Releases ROMA: An Open-Source and AGI Focused Meta-Agent Framework for Building AI Agents with Hierarchical Task Execution appeared first on MarkTechPost.

Sentient AI Releases ROMA: An Open-Source and AGI Focused Meta-Agent Framework for Building AI Agents with Hierarchical Task Execution 投稿を読む »

AI, Committee, ニュース, Uncategorized

Agentic Context Engineering (ACE): Self-Improving LLMs via Evolving Contexts, Not Fine-Tuning

TL;DR: A team of researchers from Stanford University, SambaNova Systems and UC Berkeley introduce ACE framework that improves LLM performance by editing and growing the input context instead of updating model weights. Context is treated as a living “playbook” maintained by three roles—Generator, Reflector, Curator—with small delta items merged incrementally to avoid brevity bias and context collapse. Reported gains: +10.6% on AppWorld agent tasks, +8.6% on finance reasoning, and ~86.9% average latency reduction vs strong context-adaptation baselines. On the AppWorld leaderboard snapshot (Sept 20, 2025), ReAct+ACE (59.4%) ≈ IBM CUGA (60.3%, GPT-4.1) while using DeepSeek-V3.1. https://arxiv.org/pdf/2510.04618 What ACE changes? ACE positions “context engineering” as a first-class alternative to parameter updates. Instead of compressing instructions into short prompts, ACE accumulates and organizes domain-specific tactics over time, arguing that higher context density improves agentic tasks where tools, multi-turn state, and failure modes matter. Method: Generator → Reflector → Curator Generator executes tasks and produces trajectories (reasoning/tool calls), exposing helpful vs harmful moves. Reflector distills concrete lessons from those traces. Curator converts lessons into typed delta items (with helpful/harmful counters) and merges them deterministically, with de-duplication and pruning to keep the playbook targeted. Two design choices—incremental delta updates and grow-and-refine—preserve useful history and prevent “context collapse” from monolithic rewrites. To isolate context effects, the research team fixes the same base LLM (non-thinking DeepSeek-V3.1) across all three roles. Benchmarks AppWorld (agents): Built on the official ReAct baseline, ReAct+ACE outperforms strong baselines (ICL, GEPA, Dynamic Cheatsheet), with +10.6% average over selected baselines and ~+7.6% over Dynamic Cheatsheet in online adaptation. On the Sept 20, 2025 leaderboard, ReAct+ACE 59.4% vs IBM CUGA 60.3% (GPT-4.1); ACE surpasses CUGA on the harder test-challenge split, while using a smaller open-source base model. Finance (XBRL): On FiNER token tagging and XBRL Formula numerical reasoning, ACE reports +8.6% average over baselines with ground-truth labels for offline adaptation; it also works with execution-only feedback, though quality of signals matters. https://arxiv.org/pdf/2510.04618 https://arxiv.org/pdf/2510.04618 Cost and latency ACE’s non-LLM merges plus localized updates reduce adaptation overhead substantially: Offline (AppWorld): −82.3% latency and −75.1% rollouts vs GEPA. Online (FiNER): −91.5% latency and −83.6% token cost vs Dynamic Cheatsheet. https://arxiv.org/pdf/2510.04618 Key Takeaways ACE = context-first adaptation: Improves LLMs by incrementally editing an evolving “playbook” (delta items) curated by Generator→Reflector→Curator, using the same base LLM (non-thinking DeepSeek-V3.1) to isolate context effects and avoid collapse from monolithic rewrites. Measured gains: ReAct+ACE reports +10.6% over strong baselines on AppWorld and achieves 59.4% vs IBM CUGA 60.3% (GPT-4.1) on the Sept 20, 2025 leaderboard snapshot; finance benchmarks (FiNER + XBRL Formula) show +8.6% average over baselines. Lower overhead than reflective-rewrite baselines: ACE reduces adaptation latency by ~82–92% and rollouts/token cost by ~75–84%, contrasting with Dynamic Cheatsheet’s persistent memory and GEPA’s Pareto prompt evolution approaches. Conclusion ACE positions context engineering as a first-class alternative to weight updates: maintain a persistent, curated playbook that accumulates task-specific tactics, yielding measurable gains on AppWorld and finance reasoning while cutting adaptation latency and token rollouts versus reflective-rewrite baselines. The approach is practical—deterministic merges, delta items, and long-context–aware serving—and its limits are clear: outcomes track feedback quality and task complexity. If adopted, agent stacks may “self-tune” primarily through evolving context rather than new checkpoints. Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Agentic Context Engineering (ACE): Self-Improving LLMs via Evolving Contexts, Not Fine-Tuning appeared first on MarkTechPost.

Agentic Context Engineering (ACE): Self-Improving LLMs via Evolving Contexts, Not Fine-Tuning 投稿を読む »

AI, Committee, ニュース, Uncategorized

The Download: our bodies’ memories, and Traton’s electric trucks

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. How do our bodies remember? “Like riding a bike” is shorthand for the remarkable way that our bodies remember how to move. Most of the time when we talk about muscle memory, we’re not talking about the muscles themselves but about the memory of a coordinated movement pattern that lives in the motor neurons, which control our muscles. Yet in recent years, scientists have discovered that our muscles themselves have a memory for movement and exercise. And the more we move, as with riding a bike or other kinds of exercise, the more those cells begin to make a memory of that exercise. Read the full story. —Bonnie Tsui This piece is part of MIT Technology Review Explains: our series untangling the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here. This story is also from our forthcoming print issue, which is all about the body. If you haven’t already, subscribe now to receive future issues once they land. Plus, you’ll also receive a free digital report on nuclear power. 2025 climate tech companies to watch: Traton and its electric trucks Every day, trucks carry many millions of tons of cargo down roads and highways around the world. Nearly all run on diesel and make up one of the largest commercial sources of carbon emissions. Traton, a subsidiary of Volkswagen, is producing zero-emission trucks that could help clean up this sector, while also investing in a Europe-wide advanced charging network so other manufacturers can more easily follow suit. Read the full story. —Amy Nordrum Traton is one of our 10 climate tech companies to watch—our annual list of some of the most promising climate tech firms on the planet. Check out the rest of the list here. This test could reveal the health of your immune system We know surprisingly little about our immune health. The vast array of cells, proteins, and biomolecules that works to defend us from disease is mind-bogglingly complicated. Immunologists are still getting to grips with how it all works. Now, a new test is being developed to measure immune health, one that even gives you a score. But that’s a difficult thing to do, for several reasons. Read the full story. —Jessica Hamzelou This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, sign up here. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 China is cracking down on imports of Nvidia’s AI chips Customs officers are combing shipments looking for the company’s China-specific chips. (FT $)+ US officials are investigating a firm that’s suspected of helping China sidestep export restrictions. (NYT $) 2 Tesla’s ‘full self-driving’ feature is under investigationAfter multiple reports of vehicles using it ran red lights. (WP $)+ The company is slashing its prices to compete with Chinese giant BYD. (Rest of World)+ Elon Musk will still receive billions, even if he fails to achieve his ambitions goals. (Reuters) 3 A data hoarder has created a searchable database of Epstein filesMaking it simple to find mentions of specific people and locations. (404 Media) 4 OpenAI says GPT-5 is its least-biased model yetEven when proceeding with “challenging, emotionally charged prompts.” (Axios) 5 The developers behind ICE-tracking apps aren’t giving upThey’re fighting Apple’s decision to remove their creations from its app store. (Wired $)+ Another effort to track ICE raids was just taken offline. (MIT Technology Review) 6 The world’s biodiversity crisis is worseningMore than half of all bird species are in decline. (The Guardian)+ The short, strange history of gene de-extinction. (MIT Technology Review) 7 YouTube is extending an olive branch to banned creatorsIt’s overturned a lifetime ban policy to give the people behind previously-banned channels a second chance. (CNBC)+ But users kicked off for copyright infringement or extremism aren’t eligible. (Bloomberg $) 8 This startup wants to bring self-flying planes to our skies  Starting with military cargo flights. (WSJ $) 9 Your plumber might be using ChatGPTThey’re increasingly using the chatbot to troubleshoot on the ground. (CNN) 10 Do robots really need hands?Maybe not, but that’s not standing in the way of researchers trying to recreate them. (Fast Company $)+ Will we ever trust robots? (MIT Technology Review) Quote of the day “Social media is a complete dumpster.” —Hany Farid, a professor of computer science at the University of California, Berkeley, describes the proliferation of AI slop videos infiltrating digital platforms to the New York Times. One more thing Who gets to decide who receives experimental medical treatments? There has been a trend toward lowering the bar for new medicines, and it is becoming easier for people to access treatments that might not help them—and could even harm them. Anecdotes appear to be overpowering evidence in decisions on drug approval. As a result, we’re ending up with some drugs that don’t work. We urgently need to question how these decisions are made. Who should have access to experimental therapies? And who should get to decide? Such questions are especially pressing considering how quickly biotechnology is advancing. We’re not just improving on existing classes of treatments—we’re creating entirely new ones. Read the full story. —Jessica Hamzelou We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + I love this crowd-sourced compendium of every known Wilhelm scream in all sorts of media.+ Happy birthday to pocket rocket Bruno Mars, who turned 40 this week.+ Here’s how to visit an interstellar interloper.+ Bumi the penguin is having the absolute time of their life with this bubble machine

The Download: our bodies’ memories, and Traton’s electric trucks 投稿を読む »

AI, Committee, ニュース, Uncategorized

Together AI’s ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time

Enterprises expanding AI deployments are hitting an invisible performance wall. The culprit? Static speculators that can’t keep up with shifting workloads. Speculators are smaller AI models that work alongside large language models during inference. They draft multiple tokens ahead, which the main model then verifies in parallel. This technique (called speculative decoding) has become essential for enterprises trying to reduce inference costs and latency. Instead of generating tokens one at a time, the system can accept multiple tokens at once, dramatically improving throughput. Together AI today announced research and a new system called ATLAS (AdapTive-LeArning Speculator System) that aims to help enterprises overcome the challenge of static speculators. The technique provides a self-learning inference optimization capability that can help to deliver up to 400% faster inference performance than a baseline level of performance available in existing inference technologies such as vLLM.. The system addresses a critical problem: as AI workloads evolve, inference speeds degrade, even with specialized speculators in place. The company which got its start in 2023, has been focused on optimizing inference on its enterprise AI platform. Earlier this year the company raised $305 million as customer adoption and demand has grown. “Companies we work with generally, as they scale up, they see shifting workloads, and then they don’t see as much speedup from speculative execution as before,” Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. “These speculators generally don’t work well when their workload domain starts to shift.” The workload drift problem no one talks about Most speculators in production today are “static” models. They’re trained once on a fixed dataset representing expected workloads, then deployed without any ability to adapt. Companies like Meta and Mistral ship pre-trained speculators alongside their main models. Inference platforms like vLLM use these static speculators to boost throughput without changing output quality. But there’s a catch. When an enterprise’s AI usage evolves the static speculator’s accuracy plummets. “If you’re a company producing coding agents, and most of your developers have been writing in Python, all of a sudden some of them switch to writing Rust or C, then you see the speed starts to go down,” Dao explained. “The speculator has a mismatch between what it was trained on versus what the actual workload is.” This workload drift represents a hidden tax on scaling AI. Enterprises either accept degraded performance or invest in retraining custom speculators. That process captures only a snapshot in time and quickly becomes outdated. How adaptive speculators work: A dual-model approach ATLAS uses a dual-speculator architecture that combines stability with adaptation: The static speculator – A heavyweight model trained on broad data provides consistent baseline performance. It serves as a “speed floor.” The adaptive speculator – A lightweight model learns continuously from live traffic. It specializes on-the-fly to emerging domains and usage patterns. The confidence-aware controller – An orchestration layer dynamically chooses which speculator to use. It adjusts the speculation “lookahead” based on confidence scores. “Before the adaptive speculator learns anything, we still have the static speculator to help provide the speed boost in the beginning,” Ben Athiwaratkun, staff AI scientist at Together AI explained to VentureBeat. “Once the adaptive speculator becomes more confident, then the speed grows over time.” The technical innovation lies in balancing acceptance rate (how often the target model agrees with drafted tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and extends lookahead. This compounds performance gains. Users don’t need to tune any parameters. “On the user side, users don’t have to turn any knobs,” Dao said. “On our side, we have turned these knobs for users to adjust in a configuration that gets good speedup.” Performance that rivals custom silicon Together AI’s testing shows ATLAS reaching 500 tokens per second on DeepSeek-V3.1 when fully adapted. More impressively, those numbers on Nvidia B200 GPUs match or exceed specialized inference chips like Groq’s custom hardware. “The software and algorithmic improvement is able to close the gap with really specialized hardware,” Dao said. “We were seeing 500 tokens per second on these huge models that are even faster than some of the customized chips.” The 400% speedup that the company claims for inference represents the cumulative effect of Together’s Turbo optimization suite. FP4 quantization delivers 80% speedup over FP8 baseline. The static Turbo Speculator adds another 80-100% gain. The adaptive system layers on top. Each optimization compounds the benefits of the others. Compared to standard inference engines like vLLM or Nvidia’s TensorRT-LLM, the improvement is substantial. Together AI benchmarks against the stronger baseline between the two for each workload before applying speculative optimizations. The memory-compute tradeoff explained The performance gains stem from exploiting a fundamental inefficiency in modern inference: wasted compute capacity. Dao explained that typically during inference, much of the compute power is not fully utilized. “During inference, which is actually the dominant workload nowadays, you’re mostly using the memory subsystem,” he said. Speculative decoding trades idle compute for reduced memory access. When a model generates one token at a time, it’s memory-bound. The GPU sits idle while waiting for memory. But when the speculator proposes five tokens and the target model verifies them simultaneously, compute utilization spikes while memory access remains roughly constant. “The total amount of compute to generate five tokens is the same, but you only had to access memory once, instead of five times,” Dao said. Think of it as intelligent caching for AI For infrastructure teams familiar with traditional database optimization, adaptive speculators function like an intelligent caching layer, but with a crucial difference. Traditional caching systems like Redis or memcached require exact matches. You store the exact same query result and retrieve it when that specific query runs again. Adaptive speculators work differently. “You can view it as an intelligent way of caching, not storing exactly, but figuring out some patterns that you see,” Dao explained. “Broadly, we’re observing that you’re working with similar code, or working with similar, you

Together AI’s ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja