YouZum

AI

AI, Committee, ニュース, Uncategorized

Inside the marketplace powering bespoke AI deepfakes of real women

Civitai—an online marketplace for buying and selling AI-generated content, backed by the venture capital firm Andreessen Horowitz—is letting users buy custom instruction files for generating celebrity deepfakes. Some of these files were specifically designed to make pornographic images banned by the site, a new analysis has found. The study, from researchers at Stanford and Indiana University, looked at people’s requests for content on the site, called “bounties.” The researchers found that between mid-2023 and the end of 2024, most bounties asked for animated content—but a significant portion were for deepfakes of real people, and 90% of these deepfake requests targeted women. (Their findings have not yet been peer reviewed.) The debate around deepfakes, as illustrated by the recent backlash to explicit images on the X-owned chatbot Grok, has revolved around what platforms should do to block such content. Civitai’s situation is a little more complicated. Its marketplace includes actual images, videos, and models, but it also lets individuals buy and sell instruction files called LoRAs that can coach mainstream AI models like Stable Diffusion into generating content they were not trained to produce. Users can then combine these files with other tools to make deepfakes that are graphic or sexual. The researchers found that 86% of deepfake requests on Civitai were for LoRAs. In these bounties, users requested “high quality” models to generate images of public figures like the influencer Charli D’Amelio or the singer Gracie Abrams, often linking to their social media profiles so their images could be grabbed from the web. Some requests specified a desire for models that generated the individual’s entire body, accurately captured their tattoos, or allowed hair color to be changed. Some requests targeted several women in specific niches, like artists who record ASMR videos. One request was for a deepfake of a woman said to be the user’s wife. Anyone on the site could offer up AI models they worked on for the task, and the best submissions received payment—anywhere from $0.50 to $5. And nearly 92% of the deepfake bounties were awarded. Neither Civitai nor Andreessen Horowitz responded to requests for comment. It’s possible that people buy these LoRAs to make deepfakes that aren’t sexually explicit (though they’d still violate Civitai’s terms of use, and they’d still be ethically fraught). But Civitai also offers educational resources on how to use external tools to further customize the outputs of image generators—for example, by changing someone’s pose. The site also hosts user-written articles with details on how to instruct models to generate pornography. The researchers found that the amount of porn on the platform has gone up, and that the majority of requests each week are now for NSFW content. “Not only does Civitai provide the infrastructure that facilitates these issues; they also explicitly teach their users how to utilize them,” says Matthew DeVerna, a postdoctoral researcher at Stanford’s Cyber Policy Center and one of the study’s leaders.  The company used to ban only sexually explicit deepfakes of real people, but in May 2025 it announced it would ban all deepfake content. Nonetheless, countless requests for deepfakes submitted before this ban now remain live on the site, and many of the winning submissions fulfilling those requests remain available for purchase, MIT Technology Review confirmed. “I believe the approach that they’re trying to take is to sort of do as little as possible, such that they can foster as much—I guess they would call it—creativity on the platform,” DeVerna says. Users buy LoRAs with the site’s online currency, called Buzz, which is purchased with real money. In May 2025, Civita’s credit card processor cut off the company because of its ongoing problem with nonconsensual content. To pay for explicit content, users must now use gift cards or cryptocurrency to buy Buzz; the company offers a different scrip for non-explicit content.  Civitai automatically tags bounties requesting deepfakes and lists a way for the person featured in the content to manually request its takedown. This system means that Civitai has a reasonably successful way of knowing which bounties are for deepfakes, but it’s still leaving moderation to the general public rather than carrying it out proactively.  A company’s legal liability for what its users do isn’t totally clear. Generally, tech companies have broad legal protections against such liability for their content under Section 230 of the Communications Decency Act, but those protections aren’t limitless. For example, “you cannot knowingly facilitate illegal transactions on your website,” says Ryan Calo, a professor specializing in technology and AI at the University of Washington’s law school. (Calo wasn’t involved in this new study.) Civitai joined OpenAI, Anthropic, and other AI companies in 2024 in adopting design principles to guard against the creation and spread of AI-generated child sexual abuse material . This move followed a 2023 report from the Stanford Internet Observatory, which found that the vast majority of AI models named in child sexual abuse communities were Stable Diffusion–based models “predominantly obtained via Civitai.” But adult deepfakes have not gotten the same level of attention from content platforms or the venture capital firms that fund them. “They are not afraid enough of it. They are overly tolerant of it,” Calo says. “Neither law enforcement nor civil courts adequately protect against it. It is night and day.” Civitai received a $5 million investment from Andreessen Horowitz (a16z) in November 2023. In a video shared by a16z, Civitai cofounder and CEO Justin Maier described his goal of building the main place where people find and share AI models for their own individual purposes. “We’ve aimed to make this space that’s been very, I guess, niche and engineering-heavy more and more approachable to more and more people,” he said.  Civitai is not the only company with a deepfake problem in a16z’s investment portfolio; in February, MIT Technology Review first reported that another company, Botify AI, was hosting AI companions resembling real actors that stated their age as under 18, engaged in sexually charged conversations, offered “hot photos,” and in some instances described

Inside the marketplace powering bespoke AI deepfakes of real women 投稿を読む »

AI, Committee, ニュース, Uncategorized

The Download: US immigration agencies’ AI videos, and inside the Vitalism movement

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. DHS is using Google and Adobe AI to make videos The news: The US Department of Homeland Security is using AI video generators from Google and Adobe to make and edit content shared with the public, a new document reveals. The document, released on Wednesday, provides an inventory of which commercial AI tools DHS uses for tasks ranging from generating drafts of documents to managing cybersecurity. Why it matters: It comes as immigration agencies have flooded social media with content to support President Trump’s mass deportation agenda—some of which appears to be made with AI—and as workers in tech have put pressure on their employers to denounce the agencies’ activities. Read the full story. —James O’Donnell How the sometimes-weird world of lifespan extension is gaining influence —Jessica Hamzelou For the last couple of years, I’ve been following the progress of a group of individuals who believe death is humanity’s “core problem.” Put simply, they say death is wrong—for everyone. They’ve even said it’s morally wrong. They established what they consider a new philosophy, and they called it Vitalism. Vitalism is more than a philosophy, though—it’s a movement for hardcore longevity enthusiasts who want to make real progress in finding treatments that slow or reverse aging. Not just through scientific advances, but by persuading influential people to support their movement, and by changing laws and policies to open up access to experimental drugs. And they’re starting to make progress. This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here. The AI Hype Index: Grok makes porn, and Claude Code nails your job Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry. Take a look at this month’s edition of the index here. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Capgemini is no longer tracking immigrants for ICEAfter the French company was queried by the country’s government over the contract. (WP $)+ Here’s how the agency typically keeps tabs on its targets. (NYT $)+ US senators are pushing for answers about its recent surveillance shopping spree. (404 Media)+ ICE’s tactics would get real soldiers killed, apparently. (Wired $) 2 The Pentagon is at loggerheads with AnthropicThe AI firm is reportedly worried its tools could be used to spy on Americans. (Reuters)+ Generative AI is learning to spy for the US military. (MIT Technology Review) 3 It’s relatively rare for AI chatbots to lead users down harmful pathsBut when it does, it can have incredibly dangerous consequences. (Ars Technica)+ The AI doomers feel undeterred. (MIT Technology Review) 4 GPT-4o’s days are numberedOpenAI says just 0.1% of users are using the model every day. (CNBC)+ It’s the second time that it’s tried to turn the sycophantic model off in under a year. (Insider $)+ Why GPT-4o’s sudden shutdown left people grieving. (MIT Technology Review) 5 An AI toy company left its chats with kids exposedAnyone with a Gmail account was able to simply access the conversations—no hacking required. (Wired $)+ AI toys are all the rage in China—and now they’re appearing on shelves in the US too. (MIT Technology Review) 6 SpaceX could merge with xAI later this yearAhead of a planned blockbuster IPO of Elon Musk’s companies. (Reuters)+ The move would be welcome news for Musk fans. (The Information $)+ A SpaceX-Tesla merger could also be on the cards. (Bloomberg $) 7 We’re still waiting for a reliable male contraceptiveTake a look at the most promising methods so far. (Bloomberg $) 8 AI is bringing traditional Chinese medicine to the massesAnd it’s got the full backing of the country’s government. (Rest of World) 9 The race back to the Moon is heating up Competition between the US and China is more intense than ever. (Economist $) 10 What did the past really smell like?AI could help scientists to recreate history’s aromas—including mummies and battlefields. (Knowable Magazine) Quote of the day “I think the tidal wave is coming and we’re all standing on the beach.” —Bill Zysblat, a music business manager, tells the Financial Times about the existential threat AI poses to the industry.  One more thing Therapists are secretly using ChatGPT. Clients are triggered. Declan would never have found out his therapist was using ChatGPT had it not been for a technical mishap. The connection was patchy during one of their online sessions, so Declan suggested they turn off their video feeds. Instead, his therapist began inadvertently sharing his screen. For the rest of the session, Declan was privy to a real-time stream of ChatGPT analysis rippling across his therapist’s screen, who was taking what Declan was saying, putting it into ChatGPT, and then parroting its answers. But Declan is not alone. In fact, a growing number of people are reporting receiving AI-generated communiqués from their therapists. Clients’ trust and privacy are being abandoned in the process. Read the full story. —Laurie Clarke We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Sinkholes are seriously mysterious. Is there a way to stay one step ahead of them?+ This beautiful pixel art is super impressive.+ Amid the upheaval in their city, residents of Minneapolis recently demonstrated both their resistance and community spirit in the annual Art Sled Rally (thanks Paul!)+ How on Earth is Tomb Raider 30 years old?!

The Download: US immigration agencies’ AI videos, and inside the Vitalism movement 投稿を読む »

AI, Committee, ニュース, Uncategorized

A Coding Implementation to Training, Optimizing, Evaluating, and Interpreting Knowledge Graph Embeddings with PyKEEN

In this tutorial, we walk through an end-to-end, advanced workflow for knowledge graph embeddings using PyKEEN, actively exploring how modern embedding models are trained, evaluated, optimized, and interpreted in practice. We start by understanding the structure of a real knowledge graph dataset, then systematically train and compare multiple embedding models, tune their hyperparameters, and analyze their performance using robust ranking metrics. Also, we focus not just on running pipelines but on building intuition for link prediction, negative sampling, and embedding geometry, ensuring we understand why each step matters and how it affects downstream reasoning over graphs. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install -q pykeen torch torchvision import warnings warnings.filterwarnings(‘ignore’) import torch import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from typing import Dict, List, Tuple from pykeen.pipeline import pipeline from pykeen.datasets import Nations, FB15k237, get_dataset from pykeen.models import TransE, ComplEx, RotatE, DistMult from pykeen.training import SLCWATrainingLoop, LCWATrainingLoop from pykeen.evaluation import RankBasedEvaluator from pykeen.triples import TriplesFactory from pykeen.hpo import hpo_pipeline from pykeen.sampling import BasicNegativeSampler from pykeen.losses import MarginRankingLoss, BCEWithLogitsLoss from pykeen.trackers import ConsoleResultTracker print(“PyKEEN setup complete!”) print(f”PyTorch version: {torch.__version__}”) print(f”CUDA available: {torch.cuda.is_available()}”) We set up the complete experimental environment by installing PyKEEN and its deep learning dependencies, and by importing all required libraries for modeling, evaluation, visualization, and optimization. We ensure a clean, reproducible workflow by suppressing warnings and verifying the PyTorch and CUDA configurations for efficient computation. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 2: Dataset Exploration”) print(“=”*80 + “n”) dataset = Nations() print(f”Dataset: {dataset}”) print(f”Number of entities: {dataset.num_entities}”) print(f”Number of relations: {dataset.num_relations}”) print(f”Training triples: {dataset.training.num_triples}”) print(f”Testing triples: {dataset.testing.num_triples}”) print(f”Validation triples: {dataset.validation.num_triples}”) print(“nSample triples (head, relation, tail):”) for i in range(5): h, r, t = dataset.training.mapped_triples[i] head = dataset.training.entity_id_to_label[h.item()] rel = dataset.training.relation_id_to_label[r.item()] tail = dataset.training.entity_id_to_label[t.item()] print(f” {head} –[{rel}]–> {tail}”) def analyze_dataset(triples_factory: TriplesFactory) -> pd.DataFrame: “””Compute basic statistics about the knowledge graph.””” stats = { ‘Metric’: [], ‘Value’: [] } stats[‘Metric’].extend([‘Entities’, ‘Relations’, ‘Triples’]) stats[‘Value’].extend([ triples_factory.num_entities, triples_factory.num_relations, triples_factory.num_triples ]) unique, counts = torch.unique(triples_factory.mapped_triples[:, 1], return_counts=True) stats[‘Metric’].extend([‘Avg triples per relation’, ‘Max triples for a relation’]) stats[‘Value’].extend([counts.float().mean().item(), counts.max().item()]) return pd.DataFrame(stats) stats_df = analyze_dataset(dataset.training) print(“nDataset Statistics:”) print(stats_df.to_string(index=False)) We load and explore the Nation’s knowledge graph to understand its scale, structure, and relational complexity before training any models. We inspect sample triples to build intuition about how entities and relations are represented internally using indexed mappings. We then compute core statistics such as relation frequency and triple distribution, allowing us to reason about graph sparsity and modeling difficulty upfront. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 3: Training Multiple Models”) print(“=”*80 + “n”) models_config = { ‘TransE’: { ‘model’: ‘TransE’, ‘model_kwargs’: {’embedding_dim’: 50}, ‘loss’: ‘MarginRankingLoss’, ‘loss_kwargs’: {‘margin’: 1.0} }, ‘ComplEx’: { ‘model’: ‘ComplEx’, ‘model_kwargs’: {’embedding_dim’: 50}, ‘loss’: ‘BCEWithLogitsLoss’, }, ‘RotatE’: { ‘model’: ‘RotatE’, ‘model_kwargs’: {’embedding_dim’: 50}, ‘loss’: ‘MarginRankingLoss’, ‘loss_kwargs’: {‘margin’: 3.0} } } training_config = { ‘training_loop’: ‘sLCWA’, ‘negative_sampler’: ‘basic’, ‘negative_sampler_kwargs’: {‘num_negs_per_pos’: 5}, ‘training_kwargs’: { ‘num_epochs’: 100, ‘batch_size’: 128, }, ‘optimizer’: ‘Adam’, ‘optimizer_kwargs’: {‘lr’: 0.001} } results = {} for model_name, config in models_config.items(): print(f”nTraining {model_name}…”) result = pipeline( dataset=dataset, model=config[‘model’], model_kwargs=config.get(‘model_kwargs’, {}), loss=config.get(‘loss’), loss_kwargs=config.get(‘loss_kwargs’, {}), **training_config, random_seed=42, device=’cuda’ if torch.cuda.is_available() else ‘cpu’ ) results[model_name] = result print(f”n{model_name} Results:”) print(f” MRR: {result.metric_results.get_metric(‘mean_reciprocal_rank’):.4f}”) print(f” Hits@1: {result.metric_results.get_metric(‘hits_at_1’):.4f}”) print(f” Hits@3: {result.metric_results.get_metric(‘hits_at_3’):.4f}”) print(f” Hits@10: {result.metric_results.get_metric(‘hits_at_10’):.4f}”) We define a consistent training configuration and systematically train multiple knowledge graph embedding models to enable fair comparison. We use the same dataset, negative sampling strategy, optimizer, and training loop while allowing each model to leverage its own inductive bias and loss formulation. We then evaluate and record standard ranking metrics, such as MRR and Hits@K, to quantitatively assess each embedding approach’s performance on link prediction. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 4: Model Comparison”) print(“=”*80 + “n”) metrics_to_compare = [‘mean_reciprocal_rank’, ‘hits_at_1’, ‘hits_at_3’, ‘hits_at_10’] comparison_data = {metric: [] for metric in metrics_to_compare} model_names = [] for model_name, result in results.items(): model_names.append(model_name) for metric in metrics_to_compare: comparison_data[metric].append( result.metric_results.get_metric(metric) ) comparison_df = pd.DataFrame(comparison_data, index=model_names) print(“Model Comparison:”) print(comparison_df.to_string()) fig, axes = plt.subplots(2, 2, figsize=(15, 10)) fig.suptitle(‘Model Performance Comparison’, fontsize=16) for idx, metric in enumerate(metrics_to_compare): ax = axes[idx // 2, idx % 2] comparison_df[metric].plot(kind=’bar’, ax=ax, color=’steelblue’) ax.set_title(metric.replace(‘_’, ‘ ‘).title()) ax.set_ylabel(‘Score’) ax.set_xlabel(‘Model’) ax.grid(axis=’y’, alpha=0.3) ax.set_xticklabels(ax.get_xticklabels(), rotation=45) plt.tight_layout() plt.show() We aggregate evaluation metrics from all trained models into a unified comparison table for direct performance analysis. We visualize key ranking metrics using bar charts, allowing us to quickly identify strengths and weaknesses across different embedding approaches. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 5: Hyperparameter Optimization”) print(“=”*80 + “n”) hpo_result = hpo_pipeline( dataset=dataset, model=’TransE’, n_trials=10, training_loop=’sLCWA’, training_kwargs={‘num_epochs’: 50}, device=’cuda’ if torch.cuda.is_available() else ‘cpu’, ) print(“nBest Configuration Found:”) print(f” Embedding Dim: {hpo_result.study.best_params.get(‘model.embedding_dim’, ‘N/A’)}”) print(f” Learning Rate: {hpo_result.study.best_params.get(‘optimizer.lr’, ‘N/A’)}”) print(f” Best MRR: {hpo_result.study.best_value:.4f}”) print(“n” + “=”*80) print(“SECTION 6: Link Prediction”) print(“=”*80 + “n”) best_model_name = comparison_df[‘mean_reciprocal_rank’].idxmax() best_result = results[best_model_name] model = best_result.model print(f”Using {best_model_name} for predictions”) def predict_tails(model, dataset, head_label: str, relation_label: str, top_k: int = 5): “””Predict most likely tail entities for a given head and relation.””” head_id = dataset.entity_to_id[head_label] relation_id = dataset.relation_to_id[relation_label] num_entities = dataset.num_entities heads = torch.tensor([head_id] * num_entities).unsqueeze(1) relations = torch.tensor([relation_id] * num_entities).unsqueeze(1) tails = torch.arange(num_entities).unsqueeze(1) batch = torch.cat([heads, relations, tails], dim=1) with torch.no_grad(): scores = model.predict_hrt(batch) top_scores, top_indices = torch.topk(scores.squeeze(), k=top_k) predictions = [] for score, idx in zip(top_scores, top_indices): tail_label = dataset.entity_id_to_label[idx.item()] predictions.append((tail_label, score.item())) return predictions if dataset.training.num_entities > 10: sample_head = list(dataset.entity_to_id.keys())[0] sample_relation = list(dataset.relation_to_id.keys())[0] print(f”nTop predictions for: {sample_head} –[{sample_relation}]–> ?”) predictions = predict_tails( best_result.model, dataset.training, sample_head, sample_relation, top_k=5 ) for rank, (entity, score) in enumerate(predictions, 1): print(f” {rank}. {entity} (score: {score:.4f})”) We apply automated hyperparameter optimization to systematically search for a stronger TransE configuration that improves ranking performance without manual tuning. We then select the best-performing model based on MRR and use it to perform practical link prediction by scoring all possible tail entities for a given head–relation pair. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” +

A Coding Implementation to Training, Optimizing, Evaluating, and Interpreting Knowledge Graph Embeddings with PyKEEN 投稿を読む »

AI, Committee, ニュース, Uncategorized

AI2 Releases SERA, Soft Verified Coding Agents Built with Supervised Training Only for Practical Repository Level Automation Workflows

Allen Institute for AI (AI2) Researchers introduce SERA, Soft Verified Efficient Repository Agents, as a coding agent family that aims to match much larger closed systems using only supervised training and synthetic trajectories. What is SERA? SERA is the first release in AI2’s Open Coding Agents series. The flagship model, SERA-32B, is built on the Qwen 3 32B architecture and is trained as a repository level coding agent. On SWE bench Verified at 32K context, SERA-32B reaches 49.5 percent resolve rate. At 64K context it reaches 54.2 percent. These numbers place it in the same performance band as open weight systems such as Devstral-Small-2 with 24B parameters and GLM-4.5 Air with 110B parameters, while SERA remains fully open in code, data, and weights. The series includes four models today, SERA-8B, SERA-8B GA, SERA-32B, and SERA-32B GA. All are released on Hugging Face under an Apache 2.0 license. Soft Verified Generation The training pipeline relies on Soft Verified Generation, SVG. SVG produces agent trajectories that look like realistic developer workflows, then uses patch agreement between two rollouts as a soft signal of correctness. The process is: First rollout: A function is sampled from a real repository. The teacher model, GLM-4.6 in the SERA-32B setup, receives a bug style or change description and operates with tools to view files, edit code, and run commands. It produces a trajectory T1 and a patch P1. Synthetic pull request: The system converts the trajectory into a pull request like description. This text summarizes intent and key edits in a format similar to real pull requests. Second rollout: The teacher starts again from the original repository, but now it only sees the pull request description and the tools. It produces a new trajectory T2 and patch P2 that tries to implement the described change. Soft verification: The patches P1 and P2 are compared line by line. A recall score r is computed as the fraction of modified lines in P1 that appear in P2. When r equals 1 the trajectory is hard verified. For intermediate values, the sample is soft verified. The key result from the ablation study is that strict verification is not required. When models are trained on T2 trajectories with different thresholds on r, even r equals 0, performance on SWE bench Verified is similar at a fixed sample count. This suggests that realistic multi step traces, even if noisy, are valuable supervision for coding agents. https://allenai.org/blog/open-coding-agents Data scale, training, and cost SVG is applied to 121 Python repositories derived from the SWE-smith corpus. Across GLM-4.5 Air and GLM-4.6 teacher runs, the full SERA datasets contain more than 200,000 trajectories from both rollouts, making this one of the largest open coding agent datasets. SERA-32B is trained on a subset of 25,000 T2 trajectories from the Sera-4.6-Lite T2 dataset. Training uses standard supervised fine tuning with Axolotl on Qwen-3-32B for 3 epochs, learning rate 1e-5, weight decay 0.01, and maximum sequence length 32,768 tokens. Many trajectories are longer than the context limit. The research team define a truncation ratio, the fraction of steps that fit into 32K tokens. They then prefer trajectories that already fit, and for the rest they select slices with high truncation ratio. This ordered truncation strategy clearly outperforms random truncation when they compare SWE bench Verified scores. The reported compute budget for SERA-32B, including data generation and training, is about 40 GPU days. Using a scaling law over dataset size and performance, the research team estimated that the SVG approach is around 26 times cheaper than reinforcement learning based systems such as SkyRL-Agent and 57 times cheaper than earlier synthetic data pipelines such as SWE-smith for reaching similar SWE-bench scores. https://allenai.org/blog/open-coding-agents Repository specialization A central use case is adapting an agent to a specific repository. The research team studies this on three major SWE-bench Verified projects, Django, SymPy, and Sphinx. For each repository, SVG generates on the order of 46,000 to 54,000 trajectories. Due to compute limits, the specialization experiments train on 8,000 trajectories per repository, mixing 3,000 soft verified T2 trajectories with 5,000 filtered T1 trajectories. At 32K context, these specialized students match or slightly outperform the GLM-4.5-Air teacher, and also compare well with Devstral-Small-2 on those repository subsets. For Django, a specialized student reaches 52.23 percent resolve rate versus 51.20 percent for GLM-4.5-Air. For SymPy, the specialized model reaches 51.11 percent versus 48.89 percent for GLM-4.5-Air. Key Takeaways SERA turns coding agents into a supervised learning problem: SERA-32B is trained with standard supervised fine tuning on synthetic trajectories from GLM-4.6, with no reinforcement learning loop and no dependency on repository test suites. Soft Verified Generation removes the need for tests: SVG uses two rollouts and patch overlap between P1 and P2 to compute a soft verification score, and the research team show that even unverified or weakly verified trajectories can train effective coding agents. Large, realistic agent dataset from real repositories: The pipeline applies SVG to 121 Python projects from the SWE smith corpus, producing more than 200,000 trajectories and creating one of the largest open datasets for coding agents. Efficient training with explicit cost and scaling analysis: SERA-32B trains on 25,000 T2 trajectories and the scaling study shows that SVG is about 26 times cheaper than SkyRL-Agent and 57 times cheaper than SWE-smith at similar SWE bench Verified performance. Check out the Paper, Repo and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post AI2 Releases SERA, Soft Verified Coding Agents Built with Supervised Training Only for Practical Repository Level Automation Workflows appeared first on MarkTechPost.

AI2 Releases SERA, Soft Verified Coding Agents Built with Supervised Training Only for Practical Repository Level Automation Workflows 投稿を読む »

AI, Committee, ニュース, Uncategorized

Robbyant Open Sources LingBot World: a Real Time World Model for Interactive Simulation and Embodied AI

Robbyant, the embodied AI unit inside Ant Group, has open sourced LingBot-World, a large scale world model that turns video generation into an interactive simulator for embodied agents, autonomous driving and games. The system is designed to render controllable environments with high visual fidelity, strong dynamics and long temporal horizons, while staying responsive enough for real time control. From text to video to text to world Most text to video models generate short clips that look realistic but behave like passive movies. They do not model how actions change the environment over time. LingBot-World is built instead as an action conditioned world model. It learns the transition dynamics of a virtual world, so that keyboard and mouse inputs, together with camera motion, drive the evolution of future frames. Formally, the model learns the conditional distribution of future video tokens, given past frames, language prompts and discrete actions. At training time, it predicts sequences up to about 60 seconds. At inference time, it can autoregressively roll out coherent video streams that extend to around 10 minutes, while keeping scene structure stable. Data engine, from web video to interactive trajectories A core design in LingBot-World is a unified data engine. It provides rich, aligned supervision for how actions change the world while covering diverse real scenes. The data acquisition pipeline combines 3 sources: Large scale web videos of humans, animals and vehicles, from both first person and third person views Game data, where RGB frames are strictly paired with user controls such as W, A, S, D and camera parameters Synthetic trajectories rendered in Unreal Engine, where clean frames, camera intrinsics and extrinsics and object layouts are all known After collection, a profiling stage standardizes this heterogeneous corpus. It filters for resolution and duration, segments videos into clips and estimates missing camera parameters using geometry and pose models. A vision language model scores clips for quality, motion magnitude and view type, then selects a curated subset. On top of this, a hierarchical captioning module builds 3 levels of text supervision: Narrative captions for whole trajectories, including camera motion Scene static captions that describe environment layout without motion Dense temporal captions for short time windows that focus on local dynamics This separation lets the model disentangle static structure from motion patterns, which is important for long horizon consistency. Architecture, MoE video backbone and action conditioning LingBot-World starts from Wan2.2, a 14B parameter image to video diffusion transformer. This backbone already captures strong open domain video priors. Robbyant team extends it into a mixture of experts DiT, with 2 experts. Each expert has about 14B parameters, so the total parameter count is 28B, but only 1 expert is active at each denoising step. This keeps inference cost similar to a dense 14B model while expanding capacity. A curriculum extends training sequences from 5 seconds to 60 seconds. The schedule increases the proportion of high noise timesteps, which stabilizes global layouts over long contexts and reduces mode collapse for long rollouts. To make the model interactive, actions are injected directly into the transformer blocks. Camera rotations are encoded with Plücker embeddings. Keyboard actions are represented as multi hot vectors over keys such as W, A, S, D. These encodings are fused and passed through adaptive layer normalization modules, which modulate hidden states in the DiT. Only the action adapter layers are fine tuned, the main video backbone stays frozen, so the model retains visual quality from pre training while learning action responsiveness from a smaller interactive dataset. Training uses both image to video and video to video continuation tasks. Given a single image, the model can synthesize future frames. Given a partial clip, it can extend the sequence. This results in an internal transition function that can start from arbitrary time points. LingBot World Fast, distillation for real time use The mid-trained model, LingBot-World Base, still relies on multi step diffusion and full temporal attention, which are expensive for real time interaction. Robbyant team introduces LingBot-World-Fast as an accelerated variant. The fast model is initialized from the high noise expert and replaces full temporal attention with block causal attention. Inside each temporal block, attention is bidirectional. Across blocks, it is causal. This design supports key value caching, so the model can stream frames autoregressively with lower cost. Distillation uses a diffusion forcing strategy. The student is trained on a small set of target timesteps, including timestep 0, so it sees both noisy and clean latents. Distribution Matching Distillation is combined with an adversarial discriminator head. The adversarial loss updates only the discriminator. The student network is updated with the distillation loss, which stabilizes training while preserving action following and temporal coherence. In experiments, LingBot World Fast reaches 16 frames per second when processing 480p videos on a system with 1 GPU node, and, maintains end to end interaction latency under 1 second for real time control. Emergent memory and long horizon behavior One of the most interesting properties of LingBot-World is emergent memory. The model maintains global consistency without explicit 3D representations such as Gaussian splatting. When the camera moves away from a landmark such as Stonehenge and returns after about 60 seconds, the structure reappears with consistent geometry. When a car leaves the frame and later reenters, it appears at a physically plausible location, not frozen or reset. The model can also sustain ultra long sequences. The research team shows coherent video generation that extends up to 10 minutes, with stable layout and narrative structure.] VBench results and comparison to other world models For quantitative evaluation, the research team used VBench on a curated set of 100 generated videos, each longer than 30 seconds. LingBot-World is compared to 2 recent world models, Yume-1.5 and HY-World-1.5. On VBench, LingBot World reports: https://arxiv.org/pdf/2601.20540v1 These scores are higher than both baselines for imaging quality, aesthetic quality and dynamic degree. The dynamic degree margin is large, 0.8857 compared to 0.7612 and 0.7217, which indicates richer scene transitions and more complex motion that respond to user inputs. Motion

Robbyant Open Sources LingBot World: a Real Time World Model for Interactive Simulation and Embodied AI 投稿を読む »

AI, Committee, ニュース, Uncategorized

Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance

arXiv:2601.21611v1 Announce Type: cross Abstract: Effective relevance modeling is crucial for e-commerce search, as it aligns search results with user intent and enhances customer experience. Recent work has leveraged large language models (LLMs) to address the limitations of traditional relevance models, especially for long-tail and ambiguous queries. By incorporating Chain-of-Thought (CoT) reasoning, these approaches improve both accuracy and interpretability through multi-step reasoning. However, two key limitations remain: (1) most existing approaches rely on single-perspective CoT reasoning, which fails to capture the multifaceted nature of e-commerce relevance (e.g., user intent vs. attribute-level matching vs. business-specific rules); and (2) although CoT-enhanced LLM’s offer rich reasoning capabilities, their high inference latency necessitates knowledge distillation for real-time deployment, yet current distillation methods discard the CoT rationale structure at inference, using it as a transient auxiliary signal and forfeiting its reasoning utility. To address these challenges, we propose a novel framework that better exploits CoT semantics throughout the optimization pipeline. Specifically, the teacher model leverages Multi-Perspective CoT (MPCoT) to generate diverse rationales and combines Supervised Fine-Tuning (SFT) with Direct Preference Optimization (DPO) to construct a more robust reasoner. For distillation, we introduce Latent Reasoning Knowledge Distillation (LRKD), which endows a student model with a lightweight inference-time latent reasoning extractor, allowing efficient and low-latency internalization of the LLM’s sophisticated reasoning capabilities. Evaluated in offline experiments and online A/B tests on an e-commerce search advertising platform serving tens of millions of users daily, our method delivers significant offline gains, showing clear benefits in both commercial performance and user experience.

Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance 投稿を読む »

AI, Committee, ニュース, Uncategorized

Probing Neural Topology of Large Language Models

arXiv:2506.01042v3 Announce Type: replace Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural activations to interpretable semantics. However, the complex mechanisms that link neuron’s functional co-activation with the emergent model capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity of LLM neurons and relating it to language generation performance. By probing models across diverse LLM families and scales, we discover a universal predictability of language generation and understanding performance using only neural topology, which persists even when retaining just 1% of neuron connections. Strikingly, probing on topology outperforms probing on activation by up to 130.4% and 67.7% on perplexity and space/time semantic regression respectively, suggesting that neural topology contains orders of richer information of LLM performance than neural activation, which can be easily extracted with simple linear or MLP probes. To explain the dependence between neural topology and language performance, we identify default networks and hub neurons in LLMs and provide causal evidence by interventional experiments on multiple benchmarks, showing that LLMs actually exploit these topological information. Further analyses suggest that graph probing can be effectively leveraged to improve the efficiency and reliability of LLMs through proof-of-concept applications in model pruning and hallucination detection. Codes and data for the graph probing toolbox are available at https://github.com/DavyMorgan/llm-graph-probing.

Probing Neural Topology of Large Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

arXiv:2507.04746v2 Announce Type: replace Abstract: Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts. We make our code and data publicly available.

A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic 投稿を読む »

AI, Committee, ニュース, Uncategorized

RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping

arXiv:2501.04052v2 Announce Type: replace-cross Abstract: The recently introduced NVFP4 format demonstrates remarkable performance and memory benefits for quantized large language model (LLM) inference. However, we observe two types of redundancy in NVFP4 encoding: (1) The FP4 element format naturally exposes an unused quantization value due to its sign-magnitude representation that contains both positive and negative zeros. (2) The FP8 block scaling factor has an unused sign bit because it is always positive. Additionally, we find that LLM weights are more tolerant to a lower-precision block scaling factor. Based on these observations, we propose Redundant Zero Remapping (RaZeR), an enhanced numerical format that pushes the limits of NVFP4 for more accurate LLM quantization under the same memory footprint. RaZeR leverages the redundant bits of the block scaling factor to adaptively remap the redundant FP4 zero to additional quantization values with improved accuracy. To demonstrate the practicality of RaZeR, we design efficient GPU kernels for RaZeR-quantized LLM inference and propose novel hardware to natively support this. Extensive experiments validate RaZeR’s superior performance for 4-bit LLM quantization. For example, relative to native NVFP4, RaZeR reduces the average perplexity loss by 34.6% and 31.2% under weight-only and weight-activation quantization, respectively.

RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping 投稿を読む »

AI, Committee, ニュース, Uncategorized

FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

arXiv:2601.21682v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. fit mitigates degradation through rigorous data underline{F}iltering, underline{I}mportance-aware updates, and underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present textbf{PCH}, a benchmark covering textbf{P}ersonal information, textbf{C}opyright, and textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.

FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja