YouZum

AI

AI, Committee, Actualités, Uncategorized

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality

Despite the substantial progress in text-to-image (T2I) generation brought about by models such as DALL-E 3, Imagen 3, and Stable Diffusion 3, achieving consistent output quality — both in aesthetic and alignment terms — remains a persistent challenge. While large-scale pretraining provides general knowledge, it is insufficient to achieve high aesthetic quality and alignment. Supervised fine-tuning (SFT) serves as a critical post-training step but its effectiveness is strongly dependent on the quality of the fine-tuning dataset. Current public datasets used in SFT either target narrow visual domains (e.g., anime or specific art genres) or rely on basic heuristic filters over web-scale data. Human-led curation is expensive, non-scalable, and frequently fails to identify samples that yield the greatest improvements. Moreover, recent T2I models use internal proprietary datasets with minimal transparency, limiting the reproducibility of results and slowing collective progress in the field. Approach: A Model-Guided Dataset Curation To mitigate these issues, Yandex have released Alchemist, a publicly available, general-purpose SFT dataset composed of 3,350 carefully selected image-text pairs. Unlike conventional datasets, Alchemist is constructed using a novel methodology that leverages a pre-trained diffusion model to act as a sample quality estimator. This approach enables the selection of training data with high impact on generative model performance without relying on subjective human labeling or simplistic aesthetic scoring. Alchemist is designed to improve the output quality of T2I models through targeted fine-tuning. The release also includes fine-tuned versions of five publicly available Stable Diffusion models. The dataset and models are accessible on Hugging Face under an open license. More about the methodology and experiments — in the preprint . Technical Design: Filtering Pipeline and Dataset Characteristics The construction of Alchemist involves a multi-stage filtering pipeline starting from ~10 billion web-sourced images. The pipeline is structured as follows: Initial Filtering: Removal of NSFW content and low-resolution images (threshold >1024×1024 pixels). Coarse Quality Filtering: Application of classifiers to exclude images with compression artifacts, motion blur, watermarks, and other defects. These classifiers were trained on standard image quality assessment datasets such as KonIQ-10k and PIPAL. Deduplication and IQA-Based Pruning: SIFT-like features are used for clustering similar images, retaining only high-quality ones. Images are further scored using the TOPIQ model, ensuring retention of clean samples. Diffusion-Based Selection: A key contribution is the use of a pre-trained diffusion model’s cross-attention activations to rank images. A scoring function identifies samples that strongly activate features associated with visual complexity, aesthetic appeal, and stylistic richness. This enables the selection of samples most likely to enhance downstream model performance. Caption Rewriting: The final selected images are re-captioned using a vision-language model fine-tuned to produce prompt-style textual descriptions. This step ensures better alignment and usability in SFT workflows. Through ablation studies, the authors determine that increasing the dataset size beyond 3,350 (e.g., 7k or 19k samples) results in lower quality of fine-tuned models, reinforcing the value of targeted, high-quality data over raw volume. Results Across Multiple T2I Models The effectiveness of Alchemist was evaluated across five Stable Diffusion variants: SD1.5, SD2.1, SDXL, SD3.5 Medium, and SD3.5 Large. Each model was fine-tuned using three datasets: (i) the Alchemist dataset, (ii) a size-matched subset from LAION-Aesthetics v2, and (iii) their respective baselines. Human Evaluation: Expert annotators performed side-by-side assessments across four criteria — text-image relevance, aesthetic quality, image complexity, and fidelity. Alchemist-tuned models showed statistically significant improvements in aesthetic and complexity scores, often outperforming both baselines and LAION-Aesthetics-tuned versions by margins of 12–20%. Importantly, text-image relevance remained stable, suggesting that prompt alignment was not negatively affected. Automated Metrics: Across metrics such as FD-DINOv2, CLIP Score, ImageReward, and HPS-v2, Alchemist-tuned models generally scored higher than their counterparts. Notably, improvements were more consistent when compared to size-matched LAION-based models than to baseline models. Dataset Size Ablation: Fine-tuning with larger variants of Alchemist (7k and 19k samples) led to lower performance, underscoring that stricter filtering and higher per-sample quality is more impactful than dataset size. Yandex has utilized the dataset to train its proprietary text-to-image generative model, YandexART v2.5, and plans to continue leveraging it for future model updates. Conclusion Alchemist provides a well-defined and empirically validated pathway to improve the quality of text-to-image generation via supervised fine-tuning.The approach emphasizes sample quality over scale and introduces a replicable methodology for dataset construction without reliance on proprietary tools. While the improvements are most notable in perceptual attributes like aesthetics and image complexity, the framework also highlights the trade-offs that arise in fidelity, particularly for newer base models already optimized through internal SFT. Nevertheless, Alchemist establishes a new standard for general-purpose SFT datasets and offers a valuable resource for researchers and developers working to advance the output quality of generative vision models. Check out the Paper here and Alchemist Dataset on Hugging Face. Thanks to the Yandex team for the thought leadership/ Resources for this article. The post Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality appeared first on MarkTechPost.

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality Lire l’article »

AI, Committee, Actualités, Uncategorized

VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

Bridging Perception and Action in Robotics Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don’t just see and describe but also plan and move within their environments based on contextual understanding. Despite the growing power of MLLMs, one persistent issue is their inability to combine vision, reasoning, and physical interaction into one cohesive system. Typically, models trained to understand images or text fall short when asked to control robots in real-world spaces. The core problem is that understanding a scene is fundamentally different from acting within it. Multimodal understanding focuses on perception and analysis, while physical control needs precise, real-time decision-making based on that perception. This disconnect creates bottlenecks when attempting to build agents that must simultaneously observe, reason, and act in varied environments. Limitations of Prior VLA Models Previous tools designed for robot control rely heavily on vision-language-action (VLA) models. These models train on extensive robotic datasets to convert visual observations into control signals. While some solutions try to preserve the reasoning capability of MLLMs by translating commands into text-based actions, they face difficulty in maintaining accuracy and adaptability during control tasks. For instance, VLAs often degrade in performance when applied to diverse or long-horizon robotic operations. Furthermore, due to the gap between image-based understanding and motion control, these tools usually fail to generalize across different environments or robot types. Introducing VeBrain: A Unified Multimodal Framework Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research have introduced a unified framework called Visual Embodied Brain (VeBrain) in collaboration with multiple other institutes. VeBrain reformulates robot control as text-based tasks within a 2D visual space, aligning it more closely with how MLLMs function. The framework integrates multimodal understanding, spatial reasoning, and robotic control into one structure. A specially designed robotic adapter processes the MLLM’s output into executable movement policies, enabling a single model to manage perception, reasoning, and control. VeBrain is also supported by a high-quality instruction dataset called VeBrain-600k, which combines over 600,000 samples of multimodal tasks, including robot motion and reasoning steps. Technical Components: Architecture and Robotic Adapter To carry out its functions, VeBrain utilizes an architecture based on Qwen2.5-VL, augmented with components that enable real-world control. The robotic adapter contains four key modules. The point tracker updates 2D keypoints as the robot’s view changes, ensuring accurate targeting. The movement controller transforms 2D key points into 3D movements by combining image data with depth maps. The skill executor maps predicted actions, such as “turn” or “grasp,” to pre-trained robotic skills. Lastly, the dynamic takeover module monitors failures or anomalies, handing control back to the MLLM when needed. These modules form a closed-loop system that makes decisions, acts, and self-corrects, allowing robots to operate effectively in diverse situations. Performance Evaluation Across Multimodal and Robotic Benchmarks VeBrain was evaluated across 13 multimodal and 5 spatial benchmarks. On MMVet, it achieved a 5.6% improvement over Qwen2.5-VL. It achieved a score of 101.5 on the CIDEr metric for ScanQA and scored 83.7 on MMBench. On the VSI benchmark, it averaged 39.9, outperforming Qwen2.5-VL’s 35.9. In robotic evaluations, VeBrain showed 86.4% success across seven-legged robot tasks, significantly surpassing models like VLA and π0, which scored 32.1% and 31.4%, respectively. On robotic arm tasks, it achieved a success rate of 74.3%, outperforming others by up to 80%. These results show VeBrain’s ability to handle long-horizon and spatially complex control challenges with high reliability. Conclusion The research presents a compelling direction for embodied AI. Researchers succeeded in redefining robot control as a language task, enabling high-level reasoning and low-level action to coexist. The method bridges the gap between image understanding and robot execution in a way that’s both functional and scalable. With a robust design and strong performance, VeBrain signals a shift toward more unified, intelligent robotics systems capable of operating autonomously across diverse tasks and environments. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control appeared first on MarkTechPost.

VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control Lire l’article »

AI, Committee, Actualités, Uncategorized

LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models

arXiv:2506.05385v1 Announce Type: new Abstract: Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.

LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models Lire l’article »

AI, Committee, Actualités, Uncategorized

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

arXiv:2506.05412v1 Announce Type: cross Abstract: Gaze-referential inference–the ability to infer what others are looking at–is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study Lire l’article »

AI, Committee, Actualités, Uncategorized

A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations

arXiv:2506.05686v1 Announce Type: new Abstract: This paper advances a unified representation of linguistic structure for three grammar formalisms, namely, Phrase Structure Grammar (PSG), Dependency Grammar (DG) and Categorial Grammar (CG) from the perspective of syntactic and computational complexity considerations. The correspondence principle is proposed to enable a unified representation of the representational principles from PSG, DG, and CG. To that end, the paper first illustrates a series of steps in achieving a unified representation for a discontinuous subordinate clause from Turkish as an illustrative case. This affords a new way of approaching discontinuity in natural language from a theoretical point of view that unites and integrates the basic tenets of PSG, DG, and CG, with significant consequences for syntactic analysis. Then this paper demonstrates that a unified representation can simplify computational complexity with regards to the neurocognitive representation and processing of both continuous and discontinuous sentences vis-`a-vis the basic principles of PSG, DG, and CG.

A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations Lire l’article »

AI, Committee, Actualités, Uncategorized

IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems

arXiv:2506.05947v1 Announce Type: new Abstract: In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter’s motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at https://github.com/43zxj/IntentionESC_ICECoT.

IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems Lire l’article »

AI, Committee, Actualités, Uncategorized

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

arXiv:2411.16525v2 Announce Type: replace-cross Abstract: We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on emph{single-head} transformers with only a emph{single} self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-$dL$ and -in-$(1/epsilon)$ lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the emph{soft-prompt-induced} keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency Lire l’article »

AI, Committee, Actualités, Uncategorized

Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert

A major hurdle in using AI for genomics is the lack of interpretable, step-by-step reasoning from complex DNA data. While DNA foundation models excel at learning rich sequence patterns for tasks such as variant prediction and gene regulation, they often operate as black boxes, offering limited insight into the underlying biological mechanisms. Meanwhile, large language models demonstrate impressive reasoning skills across various domains, but they aren’t designed to handle raw genomic sequences. This gap between strong DNA representation and deep biological reasoning prevents AI from reaching expert-level understanding and limits its potential to drive scientific discovery through meaningful, hypothesis-driven explanations.  DNA foundation models have made significant progress by learning rich representations directly from genomic sequences, showing strong performance across a range of biological tasks. Models like Evo2, with its long-range capabilities, highlight their potential, but their lack of interpretability limits deeper biological insights. Meanwhile, large language models excel in reasoning over biomedical texts but often don’t engage directly with raw genomic data. Attempts, such as GeneGPT and TxGemma, represent early efforts to bridge this gap. Current genomic benchmarks assess task performance but fall short in evaluating reasoning and hypothesis generation.  Researchers from the University of Toronto, Vector Institute, University Health Network (UHN), Arc Institute, Cohere, University of California, San Francisco, and Google DeepMind have introduced BIOREASON, a pioneering AI system that unites a DNA foundation model with an LLM. This integration allows BIOREASON to analyze raw genomic sequences while applying LLM-based reasoning to generate clear, biologically grounded insights. Trained through supervised fine-tuning and reinforcement learning, it achieves a performance gain of 15% or more over traditional models, reaching up to 97% accuracy in KEGG-based disease pathway prediction. This approach offers interpretable, step-by-step outputs that advance biological understanding and facilitate hypothesis generation.  The BIOREASON model is a multimodal framework designed to support deep, interpretable biological reasoning by combining genomic sequences with natural language queries. It uses a DNA foundation model to extract rich, contextual embeddings from raw DNA inputs and integrates these with tokenized textual queries to form a unified input for a LLM, specifically Qwen3. The system is trained to generate step-by-step explanations of biological processes. DNA embeddings are projected into the LLM’s space using a learnable layer, and the combined input is enriched with positional encoding. Additionally, reinforcement learning via Group Relative Policy Optimization refines its reasoning capabilities.  The researchers evaluated BIOREASON on three datasets focused on DNA variant interpretation and biological reasoning. It outperformed both DNA-only and LLM-only models in predicting disease outcomes from genomic variants. The best-performing version, which combined Evo2 and Qwen3-4B, achieved high accuracy and F1-scores across all tasks. A notable case study involved a PFN1 mutation linked to ALS, where BIOREASON accurately predicted the disease and generated a 10-step explanation tracing the variant’s impact on actin dynamics and motor neuron degeneration. This shows its strength not just in accurate predictions but also in providing transparent, biologically grounded reasoning paths.  In conclusion, BIOREASON combines DNA encoders with large language models to enable detailed, interpretable reasoning over genomic data. Unlike traditional models, it not only makes accurate predictions but also explains the biological logic behind them using step-by-step outputs. This helps scientists better understand disease mechanisms and generate new research questions. While powerful, BIOREASON has challenges, like high computational costs and limited uncertainty measures. Future work aims to address these issues by improving scalability, incorporating additional biological data such as RNA and proteins, and applying it to broader tasks, including GWAS. Overall, BIOREASON shows promise in advancing precision medicine and genomic research.  Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert appeared first on MarkTechPost.

Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert Lire l’article »

fr_FR