YouZum

Uncategorized

AI, Committee, 新闻, Uncategorized

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues. A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers. Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction. Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction. The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model. Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency. This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks appeared first on MarkTechPost.

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks Read Post »

AI, Committee, 新闻, Uncategorized

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

The growth in developing and deploying large language models (LLMs) is closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research. A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Value (KV) stores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token (TPOT), a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware. Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges. Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attention (MLA) for memory optimization, a Mixture of Experts (MoE) framework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources. The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Prediction (MTP) module, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding. Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model. Several key takeaways from the research on insights into DeepSeek-V3 include: MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference. Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance. DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency. Achieves up to 67 tokens per second (TPS) on a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72. Multi-Token Prediction (MTP) improves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput. FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations. Capable of running on a $10,000 server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible. In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency appeared first on MarkTechPost.

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency Read Post »

AI, Committee, 新闻, Uncategorized

Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images

Manipulating lighting conditions in images post-capture is challenging. Traditional approaches rely on 3D graphics methods that reconstruct scene geometry and properties from multiple captures before simulating new lighting using physical illumination models. Though these techniques provide explicit control over light sources, recovering accurate 3D models from single images remains a problem that frequently results in unsatisfactory results. Modern diffusion-based image editing methods have emerged as alternatives that use strong statistical priors to bypass physical modeling requirements. However, these approaches struggle with precise parametric control due to their inherent stochasticity and dependence on textual conditioning. Generative image editing methods have been adapted for various relighting tasks with mixed results. Portrait relighting approaches often use light stage data to supervise generative models, while object relighting methods might fine-tune diffusion models using synthetic datasets conditioned on environment maps. Some methods assume a single dominant light source for outdoor scenes, like the sun, while indoor scenes present more complex multi-illumination challenges. Various approaches address these issues, including inverse rendering networks and methods that manipulate StyleGAN’s latent space. Flash photography research shows progress in multi-illumination editing through techniques that use flash/no-flash pairs to disentangle and manipulate scene illuminants. Researchers from Google, Tel Aviv University, Reichman University, and Hebrew University of Jerusalem have proposed LightLab, a diffusion-based method enabling explicit parametric control over light sources in images. It targets two fundamental properties of light sources, intensity and color. LightLab provides control over ambient illumination and tone mapping effects, creating a comprehensive set of editing tools that allow users to manipulate an image’s overall look and feel through illumination adjustments. The method shows effectiveness on indoor images containing visible light sources, though additional results show promise for outdoor scenes and out-of-domain examples. Comparative analysis confirms that LightLab is pioneering in delivering high-quality, precise control over visible local light sources. LightLab uses a pair of images to implicitly model controlled light changes in image space, which then trains a specialized diffusion model. The data collection combines real photographs with synthetic renderings. The photography dataset consists of 600 raw image pairs captured using mobile devices on tripods, with each pair showing identical scenes where only a visible light source is switched on or off. Auto-exposure settings and post-capture calibration ensure proper exposure. A larger set of synthetic images is rendered from 20 artist-created indoor 3D scenes to augment this collection using physically-based rendering in Blender. This synthetic pipeline randomly samples camera views around target objects and procedurally assigns light source parameters, including intensity, color temperature, area size, and cone angle. Comparative analysis shows that using a weighted mixture of real captures and synthetic renders achieves optimal results across all settings. The quantitative improvement from adding synthetic data to real captures is relatively modest at only 2.2% in PSNR, likely because significant local illumination changes are overshadowed by low-frequency image-wide details in these metrics. Qualitative comparisons on evaluation datasets show LightLab’s superiority over competing methods like OmniGen, RGB X, ScribbleLight, and IC-Light. These alternatives often introduce unwanted illumination changes, color distortion, or geometric inconsistencies. In contrast, LightLab provides faithful control over target light sources while generating physically plausible lighting effects throughout the scene. In conclusion, researchers introduced LightLab, an advancement in diffusion-based light source manipulation for images. Using light linearity principles and synthetic 3D data, the researchers created high-quality paired images that implicitly model complex illumination changes. Despite its strengths, LightLab faces limitations from dataset bias, particularly regarding light source types. This could be addressed through integration with unpaired fine-tuning methods. Moreover, while the simplistic data capture process using consumer mobile devices with post-capture exposure calibration facilitated easier dataset collection, it prevents precise relighting in absolute physical units, indicating room for further refinement in future iterations. Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images appeared first on MarkTechPost.

Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images Read Post »

AI, Committee, 新闻, Uncategorized

SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents

Recent advancements in LM agents have shown promising potential for automating intricate real-world tasks. These agents typically operate by proposing and executing actions through APIs, supporting applications such as software engineering, robotics, and scientific experimentation. As these tasks become more complex, LM agent frameworks have evolved to include multiple agents, multi-step retrieval, and tailored scaffolding to optimize performance. A central challenge lies in effectively exploring and understanding the environment, which has prompted the development of engineered scaffolds using tools, memory mechanisms, and custom pipelines. However, most existing methods assume partial observability, requiring agents to collect observations incrementally. While this assumption holds in dynamic or unfamiliar environments, it is less applicable in fully observable settings like SWE-bench, where all relevant information is accessible from the start. In software engineering, research on LM agents has focused on two main strategies: agent-based frameworks and structured pipelines. Agent-based systems, such as SWE-Agent and OpenHands CodeAct, allow LMs to interact autonomously with codebases, often through custom interfaces and retrieval tools. Other models like Moatless and AutoCodeRover enhance localization through search techniques, while SpecRover refines scaffolding design. Alternatively, structured pipelines—such as Agentless and CodeMonkey—decompose tasks into sequential phases like localization, repair, and validation. While these approaches depend on engineered components for performance, the current study proposes leveraging Long-Context LMs (LCLMs) to directly interpret the entire task environment. Advances in LCLM architecture and infrastructure now allow these models to outperform retrieval-augmented systems in many contexts, reducing reliance on complex external scaffolding.  Researchers from Stanford, IBM, and the University of Toronto explored whether complex scaffolding is necessary for LM agents tackling tasks like SWE-bench. They show that simply using LCLMs, such as Gemini-1.5-Pro, with proper prompting and no scaffolding, can achieve competitive performance—reaching 38% on SWE-Bench-Verified. Gemini-2.5-Pro, using the same simple setup, reaches 50.8%. Their work suggests that many complex agentic designs could be replaced with a single powerful LCLM, simplifying architecture and training. Additionally, a hybrid two-stage approach using Gemini-1.5-Pro and Claude-3.7 achieves a 48.6% solve rate, further supporting this simplified direction.  Traditional LM agents rely on interactive exploration due to partial observability, but many tasks, like software debugging, allow full observability. The study proposes state-in-context agents that leverage LCLMs to directly process full or compressed environment states, bypassing the need for complex agentic scaffolding. For large codebases, a ranking-based compression selects relevant files to fit within context limits. Two methods are introduced: DIRECTSOLVE, where LCLMs solve tasks using the full context; and SELECTSOLVE, where LCLMs localize relevant files for short-context LMs (SCLMs) to solve. Both use targeted patch formats and validation to ensure accuracy and reduce hallucination.  The experiments evaluate a simplified agent framework using LLMs on the SWE-bench Verified benchmark, which includes 500 real-world software engineering tasks. The proposed methods, DIRECTSOLVE and SELECTSOLVE, utilize LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro, and in SELECTSOLVE, an additional SCLM (Claude-3.7-Sonnet) for patch generation. Results show that DIRECTSOLVE outperforms complex agentic approaches like Agentless and CodeAct with minimal engineering. SELECTSOLVE further improves accuracy by leveraging stronger models for patching. Ablation studies highlight the importance of CoT prompting, code restatement, and token-efficient context design. Additionally, positioning relevant files at the start of the prompt improves performance, underscoring limitations in long-context processing.  In conclusion, the cost of using LCLM-based methods is currently higher than existing approaches like Agentless and CodeAct, averaging $2.60 per instance compared to $0.25 and $0.87, respectively. However, rapid drops in inference costs and increasing context lengths make LCLMs more practical. Techniques like KV caching significantly lower costs after initial runs, reducing it to about $0.725. Although slight codebase changes still limit caching benefits, further improvements could help. The study also suggests that LCLMs can handle long interaction histories, reducing the need for complex memory and retrieval mechanisms. Notably, unscaffolded LCLM models can perform competitively on SWE-bench tasks.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents appeared first on MarkTechPost.

SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents Read Post »

AI, Committee, 新闻, Uncategorized

AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT

OpenAI has introduced Codex, a cloud-native software engineering agent integrated into ChatGPT, signaling a new era in AI-assisted software development. Unlike traditional coding assistants, Codex is not just a tool for autocompletion—it acts as a cloud-based agent capable of autonomously performing a wide range of programming tasks, from writing and debugging code to running tests and generating pull requests. A Shift Toward Parallel, Agent-Driven Development At the core of Codex is codex-1, a fine-tuned version of OpenAI’s reasoning model, optimized specifically for software engineering workflows. Codex can handle multiple tasks simultaneously, operating inside isolated cloud sandboxes that are preloaded with the user’s codebase. Each request is handled in its own environment, allowing users to delegate different coding operations in parallel without disrupting their local development environment. This architecture introduces a fundamentally new approach to software engineering—developers now interact with an agent that behaves more like a collaborative teammate than a static code tool. You can ask Codex to “fix a bug,” “add logging,” or “refactor this module,” and it will return a verifiable response, including diffs, terminal logs, and test results. If the output looks good, you can copy the patch directly into your repository—or ask for revisions. Embedded Within ChatGPT, Accessible to Teams Codex lives in the ChatGPT interface, currently available to Pro, Team, and Enterprise users, with broader access expected soon. The interface includes a dedicated sidebar where developers can describe what they want in natural language. Codex then interprets the intent and handles the coding behind the scenes, surfacing results for review and feedback. This integration offers a significant boost to developer productivity. As OpenAI notes, Codex is designed to take on many of the repetitive or boilerplate-heavy aspects of coding—allowing developers to focus on architecture, design, and higher-order problem solving. In one case, an OpenAI staffer even “checked in two bug fixes written entirely by Codex,” all while working on unrelated tasks. Codex Understands Your Codebase What makes Codex more than just a smart code generator is its context-awareness. Each instance runs with full access to your project’s file structure, coding conventions, and style. This allows it to write code that aligns with your team’s standards—whether you’re using Flask or FastAPI, React or Vue, or a custom internal framework. Codex’s ability to adapt to a codebase makes it particularly useful for large-scale enterprise teams and open-source maintainers. It supports workflows like branch-based pull request generation, test suite execution, and static analysis—all initiated by simple English prompts. Over time, it learns the nuances of the repository it works in, leading to better suggestions and more accurate code synthesis. Broader Implications: Lowering the Barrier to Software Creation OpenAI frames Codex as a research preview, but its long-term vision is clear: AI will increasingly take over much of the routine work involved in building software. The aim isn’t to replace developers but to democratize software creation, allowing more people—especially non-traditional developers—to build working applications using natural language alone. In this light, Codex is not just a coding tool, but a stepping stone toward a world where software development is collaborative between humans and machines. It brings software creation closer to the realm of design and ideation, and further away from syntax and implementation details. What’s Next? Codex is rolling out gradually, with usage limits in place during the preview phase. OpenAI is gathering feedback to refine the agent’s capabilities, improve safety, and optimize its performance across different environments and languages. Whether you’re a solo developer, part of a DevOps team, or leading an enterprise platform, Codex represents a significant shift in how code is written, tested, and shipped. As AI agents continue to mature, the future of software engineering will be less about writing every line yourself—and more about knowing what to build, and asking the right questions. Check out the Details here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT appeared first on MarkTechPost.

AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPT Read Post »

AI, Committee, 新闻, Uncategorized

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

Multimodal modeling focuses on building systems to understand and generate content across visual and textual formats. These models are designed to interpret visual scenes and produce new images using natural language prompts. With growing interest in bridging vision and language, researchers are working toward integrating image recognition and image generation capabilities into a unified system. This approach eliminates the need for separate pipelines and opens the path to more coherent and intelligent interactions across modalities. A key challenge in this field is to develop architectures that handle both understanding and generation without compromising the quality of either. Models need to grasp complex visual concepts and produce high-quality images matching user prompts. The difficulty lies in identifying suitable picture representations and training procedures that support both tasks. This problem becomes more evident when the same model is expected to interpret detailed text descriptions and generate visually accurate outputs based on them. It requires alignment of semantic understanding and pixel-level synthesis. Previous approaches have generally used Variational Autoencoders (VAEs) or CLIP-based encoders to represent images. VAEs are efficient for reconstruction but encode lower-level features, often leading to less informative representations. CLIP-based encoders provide high-level semantic embeddings by learning from large-scale image-text pairs. However, CLIP was not built for image reconstruction, making it challenging to use for generation unless paired with models like diffusion decoders. In terms of training, Mean Squared Error (MSE) is widely used for simplicity but tends to produce deterministic outputs. To improve generation diversity and quality, researchers have turned to Flow Matching, which introduces controlled stochasticity and better models the continuous nature of image features. Researchers from Salesforce Research, in collaboration with the University of Maryland and several academic institutions, introduced BLIP3-o, a family of unified multimodal models. The model adopts a dual-stage training strategy where image understanding is learned first, followed by image generation. The proposed system leverages CLIP embeddings to represent images and integrates them with a diffusion transformer to synthesize new visual outputs. Unlike previous joint training methods, the sequential approach maintains the strength of each task independently. The diffusion module is trained while keeping the autoregressive backbone frozen, avoiding task interference. To improve alignment and visual fidelity, the team also curated BLIP3o-60k, a high-quality instruction-tuning dataset created by prompting GPT-4o across varied visual categories, including scenes, objects, gestures, and text. They developed two model versions: an 8-billion parameter model trained with proprietary and public data, and a 4-billion version using only open-source data. The image generation pipeline of BLIP3-o is built on Qwen2.5-VL large language models. Prompts are processed to produce visual features refined through a Flow Matching diffusion transformer. This transformer is based on the Lumina-Next architecture, optimized for speed and quality with 3D rotary position embedding and grouped-query attention. The model encodes each image into 64 fixed-length semantic vectors, regardless of resolution, which supports compact storage and efficient decoding. The research team used a large-scale dataset of 25 million images from sources like CC12M, SA-1B, and JourneyDB to train the models. They extended it with 30 million proprietary samples for the 8B model. They also included 60k instruction-tuning samples covering challenging prompts such as complex gestures and landmarks, generated via GPT-4o. In terms of performance, BLIP3-o demonstrated top scores across multiple benchmarks. The 8B model achieved a GenEval score of 0.84 for image generation alignment and a WISE score of 0.62 for reasoning ability. Image understanding scored 1682.6 on MME-Perception, 647.1 on MME-Cognition, 50.6 on MMMU, and 83.1 on both VQAv2 and TextVQA datasets. A human evaluation comparing BLIP3-o 8B with Janus Pro 7B showed that BLIP3-o was preferred 50.4% of the time for visual quality and 51.5% for prompt alignment. These results are supported by statistically significant p-values (5.05e-06 and 1.16e-05), indicating the superiority of BLIP3-o in subjective quality assessments. This research outlines a clear solution to the dual challenge of image understanding and generation. CLIP embeddings, Flow Matching, and a sequential training strategy demonstrate how the problem can be approached methodically. The BLIP3-o model delivers state-of-the-art results and introduces an efficient and open approach to unified multimodal modeling. Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. The post Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation appeared first on MarkTechPost.

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation Read Post »

AI, Committee, 新闻, Uncategorized

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it

Google’s AlphaEvolve is the epitome of a best-practice AI agent orchestration. It offers a lesson in production-grade agent engineering. Discover its architecture & essential takeaways for your enterprise AI strategy.Read More

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at 隱私權政策 and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
zh_CN