YouZum

Committee

AI, Committee, Actualités, Uncategorized

Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images

Manipulating lighting conditions in images post-capture is challenging. Traditional approaches rely on 3D graphics methods that reconstruct scene geometry and properties from multiple captures before simulating new lighting using physical illumination models. Though these techniques provide explicit control over light sources, recovering accurate 3D models from single images remains a problem that frequently results in unsatisfactory results. Modern diffusion-based image editing methods have emerged as alternatives that use strong statistical priors to bypass physical modeling requirements. However, these approaches struggle with precise parametric control due to their inherent stochasticity and dependence on textual conditioning. Generative image editing methods have been adapted for various relighting tasks with mixed results. Portrait relighting approaches often use light stage data to supervise generative models, while object relighting methods might fine-tune diffusion models using synthetic datasets conditioned on environment maps. Some methods assume a single dominant light source for outdoor scenes, like the sun, while indoor scenes present more complex multi-illumination challenges. Various approaches address these issues, including inverse rendering networks and methods that manipulate StyleGAN’s latent space. Flash photography research shows progress in multi-illumination editing through techniques that use flash/no-flash pairs to disentangle and manipulate scene illuminants. Researchers from Google, Tel Aviv University, Reichman University, and Hebrew University of Jerusalem have proposed LightLab, a diffusion-based method enabling explicit parametric control over light sources in images. It targets two fundamental properties of light sources, intensity and color. LightLab provides control over ambient illumination and tone mapping effects, creating a comprehensive set of editing tools that allow users to manipulate an image’s overall look and feel through illumination adjustments. The method shows effectiveness on indoor images containing visible light sources, though additional results show promise for outdoor scenes and out-of-domain examples. Comparative analysis confirms that LightLab is pioneering in delivering high-quality, precise control over visible local light sources. LightLab uses a pair of images to implicitly model controlled light changes in image space, which then trains a specialized diffusion model. The data collection combines real photographs with synthetic renderings. The photography dataset consists of 600 raw image pairs captured using mobile devices on tripods, with each pair showing identical scenes where only a visible light source is switched on or off. Auto-exposure settings and post-capture calibration ensure proper exposure. A larger set of synthetic images is rendered from 20 artist-created indoor 3D scenes to augment this collection using physically-based rendering in Blender. This synthetic pipeline randomly samples camera views around target objects and procedurally assigns light source parameters, including intensity, color temperature, area size, and cone angle. Comparative analysis shows that using a weighted mixture of real captures and synthetic renders achieves optimal results across all settings. The quantitative improvement from adding synthetic data to real captures is relatively modest at only 2.2% in PSNR, likely because significant local illumination changes are overshadowed by low-frequency image-wide details in these metrics. Qualitative comparisons on evaluation datasets show LightLab’s superiority over competing methods like OmniGen, RGB X, ScribbleLight, and IC-Light. These alternatives often introduce unwanted illumination changes, color distortion, or geometric inconsistencies. In contrast, LightLab provides faithful control over target light sources while generating physically plausible lighting effects throughout the scene. In conclusion, researchers introduced LightLab, an advancement in diffusion-based light source manipulation for images. Using light linearity principles and synthetic 3D data, the researchers created high-quality paired images that implicitly model complex illumination changes. Despite its strengths, LightLab faces limitations from dataset bias, particularly regarding light source types. This could be addressed through integration with unpaired fine-tuning methods. Moreover, while the simplistic data capture process using consumer mobile devices with post-capture exposure calibration facilitated easier dataset collection, it prevents precise relighting in absolute physical units, indicating room for further refinement in future iterations. Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images appeared first on MarkTechPost.

Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images Lire l’article »

AI, Committee, Actualités, Uncategorized

SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents

Recent advancements in LM agents have shown promising potential for automating intricate real-world tasks. These agents typically operate by proposing and executing actions through APIs, supporting applications such as software engineering, robotics, and scientific experimentation. As these tasks become more complex, LM agent frameworks have evolved to include multiple agents, multi-step retrieval, and tailored scaffolding to optimize performance. A central challenge lies in effectively exploring and understanding the environment, which has prompted the development of engineered scaffolds using tools, memory mechanisms, and custom pipelines. However, most existing methods assume partial observability, requiring agents to collect observations incrementally. While this assumption holds in dynamic or unfamiliar environments, it is less applicable in fully observable settings like SWE-bench, where all relevant information is accessible from the start. In software engineering, research on LM agents has focused on two main strategies: agent-based frameworks and structured pipelines. Agent-based systems, such as SWE-Agent and OpenHands CodeAct, allow LMs to interact autonomously with codebases, often through custom interfaces and retrieval tools. Other models like Moatless and AutoCodeRover enhance localization through search techniques, while SpecRover refines scaffolding design. Alternatively, structured pipelines—such as Agentless and CodeMonkey—decompose tasks into sequential phases like localization, repair, and validation. While these approaches depend on engineered components for performance, the current study proposes leveraging Long-Context LMs (LCLMs) to directly interpret the entire task environment. Advances in LCLM architecture and infrastructure now allow these models to outperform retrieval-augmented systems in many contexts, reducing reliance on complex external scaffolding.  Researchers from Stanford, IBM, and the University of Toronto explored whether complex scaffolding is necessary for LM agents tackling tasks like SWE-bench. They show that simply using LCLMs, such as Gemini-1.5-Pro, with proper prompting and no scaffolding, can achieve competitive performance—reaching 38% on SWE-Bench-Verified. Gemini-2.5-Pro, using the same simple setup, reaches 50.8%. Their work suggests that many complex agentic designs could be replaced with a single powerful LCLM, simplifying architecture and training. Additionally, a hybrid two-stage approach using Gemini-1.5-Pro and Claude-3.7 achieves a 48.6% solve rate, further supporting this simplified direction.  Traditional LM agents rely on interactive exploration due to partial observability, but many tasks, like software debugging, allow full observability. The study proposes state-in-context agents that leverage LCLMs to directly process full or compressed environment states, bypassing the need for complex agentic scaffolding. For large codebases, a ranking-based compression selects relevant files to fit within context limits. Two methods are introduced: DIRECTSOLVE, where LCLMs solve tasks using the full context; and SELECTSOLVE, where LCLMs localize relevant files for short-context LMs (SCLMs) to solve. Both use targeted patch formats and validation to ensure accuracy and reduce hallucination.  The experiments evaluate a simplified agent framework using LLMs on the SWE-bench Verified benchmark, which includes 500 real-world software engineering tasks. The proposed methods, DIRECTSOLVE and SELECTSOLVE, utilize LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro, and in SELECTSOLVE, an additional SCLM (Claude-3.7-Sonnet) for patch generation. Results show that DIRECTSOLVE outperforms complex agentic approaches like Agentless and CodeAct with minimal engineering. SELECTSOLVE further improves accuracy by leveraging stronger models for patching. Ablation studies highlight the importance of CoT prompting, code restatement, and token-efficient context design. Additionally, positioning relevant files at the start of the prompt improves performance, underscoring limitations in long-context processing.  In conclusion, the cost of using LCLM-based methods is currently higher than existing approaches like Agentless and CodeAct, averaging $2.60 per instance compared to $0.25 and $0.87, respectively. However, rapid drops in inference costs and increasing context lengths make LCLMs more practical. Techniques like KV caching significantly lower costs after initial runs, reducing it to about $0.725. Although slight codebase changes still limit caching benefits, further improvements could help. The study also suggests that LCLMs can handle long interaction histories, reducing the need for complex memory and retrieval mechanisms. Notably, unscaffolded LCLM models can perform competitively on SWE-bench tasks.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents appeared first on MarkTechPost.

SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents Lire l’article »

AI, Committee, Actualités, Uncategorized

AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT

OpenAI has introduced Codex, a cloud-native software engineering agent integrated into ChatGPT, signaling a new era in AI-assisted software development. Unlike traditional coding assistants, Codex is not just a tool for autocompletion—it acts as a cloud-based agent capable of autonomously performing a wide range of programming tasks, from writing and debugging code to running tests and generating pull requests. A Shift Toward Parallel, Agent-Driven Development At the core of Codex is codex-1, a fine-tuned version of OpenAI’s reasoning model, optimized specifically for software engineering workflows. Codex can handle multiple tasks simultaneously, operating inside isolated cloud sandboxes that are preloaded with the user’s codebase. Each request is handled in its own environment, allowing users to delegate different coding operations in parallel without disrupting their local development environment. This architecture introduces a fundamentally new approach to software engineering—developers now interact with an agent that behaves more like a collaborative teammate than a static code tool. You can ask Codex to “fix a bug,” “add logging,” or “refactor this module,” and it will return a verifiable response, including diffs, terminal logs, and test results. If the output looks good, you can copy the patch directly into your repository—or ask for revisions. Embedded Within ChatGPT, Accessible to Teams Codex lives in the ChatGPT interface, currently available to Pro, Team, and Enterprise users, with broader access expected soon. The interface includes a dedicated sidebar where developers can describe what they want in natural language. Codex then interprets the intent and handles the coding behind the scenes, surfacing results for review and feedback. This integration offers a significant boost to developer productivity. As OpenAI notes, Codex is designed to take on many of the repetitive or boilerplate-heavy aspects of coding—allowing developers to focus on architecture, design, and higher-order problem solving. In one case, an OpenAI staffer even “checked in two bug fixes written entirely by Codex,” all while working on unrelated tasks. Codex Understands Your Codebase What makes Codex more than just a smart code generator is its context-awareness. Each instance runs with full access to your project’s file structure, coding conventions, and style. This allows it to write code that aligns with your team’s standards—whether you’re using Flask or FastAPI, React or Vue, or a custom internal framework. Codex’s ability to adapt to a codebase makes it particularly useful for large-scale enterprise teams and open-source maintainers. It supports workflows like branch-based pull request generation, test suite execution, and static analysis—all initiated by simple English prompts. Over time, it learns the nuances of the repository it works in, leading to better suggestions and more accurate code synthesis. Broader Implications: Lowering the Barrier to Software Creation OpenAI frames Codex as a research preview, but its long-term vision is clear: AI will increasingly take over much of the routine work involved in building software. The aim isn’t to replace developers but to democratize software creation, allowing more people—especially non-traditional developers—to build working applications using natural language alone. In this light, Codex is not just a coding tool, but a stepping stone toward a world where software development is collaborative between humans and machines. It brings software creation closer to the realm of design and ideation, and further away from syntax and implementation details. What’s Next? Codex is rolling out gradually, with usage limits in place during the preview phase. OpenAI is gathering feedback to refine the agent’s capabilities, improve safety, and optimize its performance across different environments and languages. Whether you’re a solo developer, part of a DevOps team, or leading an enterprise platform, Codex represents a significant shift in how code is written, tested, and shipped. As AI agents continue to mature, the future of software engineering will be less about writing every line yourself—and more about knowing what to build, and asking the right questions. Check out the Details here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT appeared first on MarkTechPost.

AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT Lire l’article »

AI, Committee, Actualités, Uncategorized

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

Multimodal modeling focuses on building systems to understand and generate content across visual and textual formats. These models are designed to interpret visual scenes and produce new images using natural language prompts. With growing interest in bridging vision and language, researchers are working toward integrating image recognition and image generation capabilities into a unified system. This approach eliminates the need for separate pipelines and opens the path to more coherent and intelligent interactions across modalities. A key challenge in this field is to develop architectures that handle both understanding and generation without compromising the quality of either. Models need to grasp complex visual concepts and produce high-quality images matching user prompts. The difficulty lies in identifying suitable picture representations and training procedures that support both tasks. This problem becomes more evident when the same model is expected to interpret detailed text descriptions and generate visually accurate outputs based on them. It requires alignment of semantic understanding and pixel-level synthesis. Previous approaches have generally used Variational Autoencoders (VAEs) or CLIP-based encoders to represent images. VAEs are efficient for reconstruction but encode lower-level features, often leading to less informative representations. CLIP-based encoders provide high-level semantic embeddings by learning from large-scale image-text pairs. However, CLIP was not built for image reconstruction, making it challenging to use for generation unless paired with models like diffusion decoders. In terms of training, Mean Squared Error (MSE) is widely used for simplicity but tends to produce deterministic outputs. To improve generation diversity and quality, researchers have turned to Flow Matching, which introduces controlled stochasticity and better models the continuous nature of image features. Researchers from Salesforce Research, in collaboration with the University of Maryland and several academic institutions, introduced BLIP3-o, a family of unified multimodal models. The model adopts a dual-stage training strategy where image understanding is learned first, followed by image generation. The proposed system leverages CLIP embeddings to represent images and integrates them with a diffusion transformer to synthesize new visual outputs. Unlike previous joint training methods, the sequential approach maintains the strength of each task independently. The diffusion module is trained while keeping the autoregressive backbone frozen, avoiding task interference. To improve alignment and visual fidelity, the team also curated BLIP3o-60k, a high-quality instruction-tuning dataset created by prompting GPT-4o across varied visual categories, including scenes, objects, gestures, and text. They developed two model versions: an 8-billion parameter model trained with proprietary and public data, and a 4-billion version using only open-source data. The image generation pipeline of BLIP3-o is built on Qwen2.5-VL large language models. Prompts are processed to produce visual features refined through a Flow Matching diffusion transformer. This transformer is based on the Lumina-Next architecture, optimized for speed and quality with 3D rotary position embedding and grouped-query attention. The model encodes each image into 64 fixed-length semantic vectors, regardless of resolution, which supports compact storage and efficient decoding. The research team used a large-scale dataset of 25 million images from sources like CC12M, SA-1B, and JourneyDB to train the models. They extended it with 30 million proprietary samples for the 8B model. They also included 60k instruction-tuning samples covering challenging prompts such as complex gestures and landmarks, generated via GPT-4o. In terms of performance, BLIP3-o demonstrated top scores across multiple benchmarks. The 8B model achieved a GenEval score of 0.84 for image generation alignment and a WISE score of 0.62 for reasoning ability. Image understanding scored 1682.6 on MME-Perception, 647.1 on MME-Cognition, 50.6 on MMMU, and 83.1 on both VQAv2 and TextVQA datasets. A human evaluation comparing BLIP3-o 8B with Janus Pro 7B showed that BLIP3-o was preferred 50.4% of the time for visual quality and 51.5% for prompt alignment. These results are supported by statistically significant p-values (5.05e-06 and 1.16e-05), indicating the superiority of BLIP3-o in subjective quality assessments. This research outlines a clear solution to the dual challenge of image understanding and generation. CLIP embeddings, Flow Matching, and a sequential training strategy demonstrate how the problem can be approached methodically. The BLIP3-o model delivers state-of-the-art results and introduces an efficient and open approach to unified multimodal modeling. Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation appeared first on MarkTechPost.

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation Lire l’article »

AI, Committee, Actualités, Uncategorized

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it

Google’s AlphaEvolve is the epitome of a best-practice AI agent orchestration. It offers a lesson in production-grade agent engineering. Discover its architecture & essential takeaways for your enterprise AI strategy.Read More

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it Lire l’article »

AI, Committee, Actualités, Uncategorized

The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

arXiv:2505.10507v1 Announce Type: new Abstract: Translation-based strategies for cross-lingual transfer XLT such as translate-train — training on noisy target language data translated from the source language — and translate-test — evaluating on noisy source language data translated from the target language — are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks Lire l’article »

AI, Committee, Actualités, Uncategorized

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

arXiv:2505.09738v1 Announce Type: new Abstract: Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning Lire l’article »

fr_FR