YouZum

新闻

AI, Committee, 新闻, Uncategorized

Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment

arXiv:2505.12452v2 Announce Type: replace Abstract: Large language models (LLMs) increasingly demonstrate signs of conceptual understanding, yet much of their internal knowledge remains latent, loosely structured, and difficult to access or evaluate. We propose self-questioning as a lightweight and scalable strategy to improve LLMs’ understanding, particularly in domains where success depends on fine-grained semantic distinctions. To evaluate this approach, we introduce a challenging new benchmark of 1.3 million post-2015 computer science patent pairs, characterized by dense technical jargon and strategically complex writing. The benchmark centers on a pairwise differentiation task: can a model distinguish between closely related but substantively different inventions? We show that compared to placebo scientific information, prompting LLMs to generate and answer their own questions – targeting the background knowledge required for the task – significantly improves performance. These self-generated questions and answers activate otherwise underutilized internal knowledge. Allowing LLMs to retrieve answers from external scientific texts further enhances performance, suggesting that model knowledge is compressed and lacks the full richness of the training data. We also find that chain-of-thought prompting and self-questioning converge, though self-questioning remains more effective for improving understanding of technical concepts. Notably, we uncover an asymmetry in prompting: smaller models often generate more fundamental, more open-ended, better-aligned questions for mid-sized models than large models do, revealing a new strategy for cross-model collaboration. Altogether, our findings establish self-questioning as both a practical mechanism for automatically improving LLM comprehension, especially in domains with sparse and underrepresented knowledge, and a diagnostic probe of how internal and external knowledge are organized.

Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment Read Post »

AI, Committee, 新闻, Uncategorized

BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

arXiv:2501.18858v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps. First, it generates high-quality rationales by approximating the optimal thinking process through reinforcement learning, using a novel reward shaping mechanism. Second, it enhances the base LLM by maximizing the joint probability of rationale generation with respect to the model’s parameters. Theoretically, we demonstrate BRiTE’s convergence at a rate of $1/T$ with $T$ representing the number of iterations. Empirical evaluations on math and coding benchmarks demonstrate that our approach consistently improves performance across different base models without requiring human-annotated thinking processes. In addition, BRiTE demonstrates superior performance compared to existing algorithms that bootstrap thinking processes use alternative methods such as rejection sampling, and can even match or exceed the results achieved through supervised fine-tuning with human-annotated data.

BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning Read Post »

AI, Committee, 新闻, Uncategorized

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

arXiv:2506.07106v1 Announce Type: new Abstract: Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models Read Post »

AI, Committee, 新闻, Uncategorized

The Download: an inspiring toy robot arm, and why AM radio matters

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. How a 1980s toy robot arm inspired modern robotics —Jon Keegan As a child of an electronic engineer, I spent a lot of time in our local Radio Shack as a kid. While my dad was locating capacitors and resistors, I was in the toy section. It was there, in 1984, that I discovered the best toy of my childhood: the Armatron robotic arm. Described as a “robot-like arm to aid young masterminds in scientific and laboratory experiments,” it was a legit robotic arm. And the bold look and function of Armatron made quite an impression on many young kids who would one day have a career in robotics. Read the full story. If you’re interested in the future of robots, why not check out: + Will we ever trust robots? If most robots still need remote human operators to be safe and effective, why should we welcome them into our homes? Read the full story. + When you might start speaking to robots. Google is only the latest to fuse large language models with robots. Here’s why the trend has big implications. + How AI models let robots carry out tasks in unfamiliar environments. Read the full story. + China’s EV giants are betting big on humanoid robots. Technical know-how and existing supply chains give Chinese electric-vehicle makers a significant head start in the sector. Read the full story. Why we still need AM radio The most reliable way to keep us informed in times of disaster is being threatened. Check out Ariel Aberg-Riger’s beautiful visual story illustrating AM radio’s importance in uncertain times.  Both of these stories are from the most recent edition of our print magazine, which is all about how technology is changing creativity. Subscribe now to get future copies before they land. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Protestors set Waymo robotaxis alight in Los AngelesThe groups clashed with police over the Trump administration’s immigration raids. (LA Times $)+ Much of the technology that fuels deportation orders is error-ridden. (Slate $)+ Immigrants are using a swathe of new apps to stay ahead of deportation. (Rest of World) 2 What’s next for Elon Musk and Donald TrumpA full breakdown in relations could be much worse for Musk in the long run. (NY Mag $)+ Trump’s backers are rapidly turning on Musk, too. (New Yorker $)+ The biggest winner from their fall out? Jeff Bezos. (The Information $) 3 DOGE used an inaccurate AI tool to terminate Veteran Affairs contactsIts code frequently produced glaring mistakes. (ProPublica)+ Undeterred, the department is on a hiring spree. (Wired $)+ Can AI help DOGE slash government budgets? It’s complex. (MIT Technology Review) 4 Europe’s shrinking forests could cause it to miss net-zero targetsIts trees aren’t soaking up as much carbon as they used to. (New Scientist $)+ Inside the controversial tree farms powering Apple’s carbon neutral goal. (MIT Technology Review) 5 OpenAI wants to embed ChatGPT into college campuses The ultimate goal? A personalized AI account for every student. (NYT $)+ Meanwhile, other universities are experimenting with tech-free classes. (The Atlantic $)+ ChatGPT is going to change education, not destroy it. (MIT Technology Review) 6 Chinese regulators are pumping the brakes on self-driving carsThey’re developing a new framework to assess the safety of autonomous features. (FT $)+ The country’s robotaxis are rapidly catching up with the west. (Rest of World)+ How China is regulating robotaxis. (MIT Technology Review) 7 Desalination is finally becoming a realityRemoving salt from seawater is one way to combat water scarcity. (WSJ $)+ If you can make it through tons of plastic, that is. (The Atlantic $) 8 We’re getting better at fighting cancerDeaths from the disease in the US have dropped by a third since 1991. (Vox)+ Why it’s so hard to use AI to diagnose cancer. (MIT Technology Review) 9 Teenage TikTokers’ skin regimes offer virtually no benefitAnd could even be potentially harmful. (The Guardian)+ The fight for “Instagram face” (MIT Technology Review) 10 Tech’s layoff groups are providing much-needed supportWorkers who have been let go by their employers are forming little communities. (Insider $) Quote of the day “Every tech company is doing similar things but we were open about it.” —Luis von Ahn, chief executive of the language-learning app Duolingo, tells the Financial Times that his company is far from the only one adopting an AI-first strategy.  One more thing How to break free of Spotify’s algorithm Since the heyday of radio, the branding of sound has evolved from broad genres like rock and hip-hop to “paranormal dark cabaret afternoon” and “synth space,” and streaming has become the default. Meanwhile, the ritual of discovering something new is now neatly packaged in a 30-song playlist. The only rule in music streaming is personalization. What we’ve gained in convenience, we’ve lost in curiosity. But it doesn’t have to be this way. Read the full story. —Tiffany Ng We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Happy birthday to Michael J Fox, who turns 64 today!+ Whenever you need to play the world’s smallest violin, these scientists can help you out + An early JMW Turner oil painting has been rediscovered.+ Watching robots attempt to kickbox is pretty amusing.

The Download: an inspiring toy robot arm, and why AM radio matters Read Post »

AI, Committee, 新闻, Uncategorized

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality

Despite the substantial progress in text-to-image (T2I) generation brought about by models such as DALL-E 3, Imagen 3, and Stable Diffusion 3, achieving consistent output quality — both in aesthetic and alignment terms — remains a persistent challenge. While large-scale pretraining provides general knowledge, it is insufficient to achieve high aesthetic quality and alignment. Supervised fine-tuning (SFT) serves as a critical post-training step but its effectiveness is strongly dependent on the quality of the fine-tuning dataset. Current public datasets used in SFT either target narrow visual domains (e.g., anime or specific art genres) or rely on basic heuristic filters over web-scale data. Human-led curation is expensive, non-scalable, and frequently fails to identify samples that yield the greatest improvements. Moreover, recent T2I models use internal proprietary datasets with minimal transparency, limiting the reproducibility of results and slowing collective progress in the field. Approach: A Model-Guided Dataset Curation To mitigate these issues, Yandex have released Alchemist, a publicly available, general-purpose SFT dataset composed of 3,350 carefully selected image-text pairs. Unlike conventional datasets, Alchemist is constructed using a novel methodology that leverages a pre-trained diffusion model to act as a sample quality estimator. This approach enables the selection of training data with high impact on generative model performance without relying on subjective human labeling or simplistic aesthetic scoring. Alchemist is designed to improve the output quality of T2I models through targeted fine-tuning. The release also includes fine-tuned versions of five publicly available Stable Diffusion models. The dataset and models are accessible on Hugging Face under an open license. More about the methodology and experiments — in the preprint . Technical Design: Filtering Pipeline and Dataset Characteristics The construction of Alchemist involves a multi-stage filtering pipeline starting from ~10 billion web-sourced images. The pipeline is structured as follows: Initial Filtering: Removal of NSFW content and low-resolution images (threshold >1024×1024 pixels). Coarse Quality Filtering: Application of classifiers to exclude images with compression artifacts, motion blur, watermarks, and other defects. These classifiers were trained on standard image quality assessment datasets such as KonIQ-10k and PIPAL. Deduplication and IQA-Based Pruning: SIFT-like features are used for clustering similar images, retaining only high-quality ones. Images are further scored using the TOPIQ model, ensuring retention of clean samples. Diffusion-Based Selection: A key contribution is the use of a pre-trained diffusion model’s cross-attention activations to rank images. A scoring function identifies samples that strongly activate features associated with visual complexity, aesthetic appeal, and stylistic richness. This enables the selection of samples most likely to enhance downstream model performance. Caption Rewriting: The final selected images are re-captioned using a vision-language model fine-tuned to produce prompt-style textual descriptions. This step ensures better alignment and usability in SFT workflows. Through ablation studies, the authors determine that increasing the dataset size beyond 3,350 (e.g., 7k or 19k samples) results in lower quality of fine-tuned models, reinforcing the value of targeted, high-quality data over raw volume. Results Across Multiple T2I Models The effectiveness of Alchemist was evaluated across five Stable Diffusion variants: SD1.5, SD2.1, SDXL, SD3.5 Medium, and SD3.5 Large. Each model was fine-tuned using three datasets: (i) the Alchemist dataset, (ii) a size-matched subset from LAION-Aesthetics v2, and (iii) their respective baselines. Human Evaluation: Expert annotators performed side-by-side assessments across four criteria — text-image relevance, aesthetic quality, image complexity, and fidelity. Alchemist-tuned models showed statistically significant improvements in aesthetic and complexity scores, often outperforming both baselines and LAION-Aesthetics-tuned versions by margins of 12–20%. Importantly, text-image relevance remained stable, suggesting that prompt alignment was not negatively affected. Automated Metrics: Across metrics such as FD-DINOv2, CLIP Score, ImageReward, and HPS-v2, Alchemist-tuned models generally scored higher than their counterparts. Notably, improvements were more consistent when compared to size-matched LAION-based models than to baseline models. Dataset Size Ablation: Fine-tuning with larger variants of Alchemist (7k and 19k samples) led to lower performance, underscoring that stricter filtering and higher per-sample quality is more impactful than dataset size. Yandex has utilized the dataset to train its proprietary text-to-image generative model, YandexART v2.5, and plans to continue leveraging it for future model updates. Conclusion Alchemist provides a well-defined and empirically validated pathway to improve the quality of text-to-image generation via supervised fine-tuning.The approach emphasizes sample quality over scale and introduces a replicable methodology for dataset construction without reliance on proprietary tools. While the improvements are most notable in perceptual attributes like aesthetics and image complexity, the framework also highlights the trade-offs that arise in fidelity, particularly for newer base models already optimized through internal SFT. Nevertheless, Alchemist establishes a new standard for general-purpose SFT datasets and offers a valuable resource for researchers and developers working to advance the output quality of generative vision models. Check out the Paper here and Alchemist Dataset on Hugging Face. Thanks to the Yandex team for the thought leadership/ Resources for this article. The post Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality appeared first on MarkTechPost.

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality Read Post »

AI, Committee, 新闻, Uncategorized

VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

Bridging Perception and Action in Robotics Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don’t just see and describe but also plan and move within their environments based on contextual understanding. Despite the growing power of MLLMs, one persistent issue is their inability to combine vision, reasoning, and physical interaction into one cohesive system. Typically, models trained to understand images or text fall short when asked to control robots in real-world spaces. The core problem is that understanding a scene is fundamentally different from acting within it. Multimodal understanding focuses on perception and analysis, while physical control needs precise, real-time decision-making based on that perception. This disconnect creates bottlenecks when attempting to build agents that must simultaneously observe, reason, and act in varied environments. Limitations of Prior VLA Models Previous tools designed for robot control rely heavily on vision-language-action (VLA) models. These models train on extensive robotic datasets to convert visual observations into control signals. While some solutions try to preserve the reasoning capability of MLLMs by translating commands into text-based actions, they face difficulty in maintaining accuracy and adaptability during control tasks. For instance, VLAs often degrade in performance when applied to diverse or long-horizon robotic operations. Furthermore, due to the gap between image-based understanding and motion control, these tools usually fail to generalize across different environments or robot types. Introducing VeBrain: A Unified Multimodal Framework Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research have introduced a unified framework called Visual Embodied Brain (VeBrain) in collaboration with multiple other institutes. VeBrain reformulates robot control as text-based tasks within a 2D visual space, aligning it more closely with how MLLMs function. The framework integrates multimodal understanding, spatial reasoning, and robotic control into one structure. A specially designed robotic adapter processes the MLLM’s output into executable movement policies, enabling a single model to manage perception, reasoning, and control. VeBrain is also supported by a high-quality instruction dataset called VeBrain-600k, which combines over 600,000 samples of multimodal tasks, including robot motion and reasoning steps. Technical Components: Architecture and Robotic Adapter To carry out its functions, VeBrain utilizes an architecture based on Qwen2.5-VL, augmented with components that enable real-world control. The robotic adapter contains four key modules. The point tracker updates 2D keypoints as the robot’s view changes, ensuring accurate targeting. The movement controller transforms 2D key points into 3D movements by combining image data with depth maps. The skill executor maps predicted actions, such as “turn” or “grasp,” to pre-trained robotic skills. Lastly, the dynamic takeover module monitors failures or anomalies, handing control back to the MLLM when needed. These modules form a closed-loop system that makes decisions, acts, and self-corrects, allowing robots to operate effectively in diverse situations. Performance Evaluation Across Multimodal and Robotic Benchmarks VeBrain was evaluated across 13 multimodal and 5 spatial benchmarks. On MMVet, it achieved a 5.6% improvement over Qwen2.5-VL. It achieved a score of 101.5 on the CIDEr metric for ScanQA and scored 83.7 on MMBench. On the VSI benchmark, it averaged 39.9, outperforming Qwen2.5-VL’s 35.9. In robotic evaluations, VeBrain showed 86.4% success across seven-legged robot tasks, significantly surpassing models like VLA and π0, which scored 32.1% and 31.4%, respectively. On robotic arm tasks, it achieved a success rate of 74.3%, outperforming others by up to 80%. These results show VeBrain’s ability to handle long-horizon and spatially complex control challenges with high reliability. Conclusion The research presents a compelling direction for embodied AI. Researchers succeeded in redefining robot control as a language task, enabling high-level reasoning and low-level action to coexist. The method bridges the gap between image understanding and robot execution in a way that’s both functional and scalable. With a robust design and strong performance, VeBrain signals a shift toward more unified, intelligent robotics systems capable of operating autonomously across diverse tasks and environments. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control appeared first on MarkTechPost.

VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control Read Post »

AI, Committee, 新闻, Uncategorized

LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models

arXiv:2506.05385v1 Announce Type: new Abstract: Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.

LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models Read Post »

AI, Committee, 新闻, Uncategorized

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

arXiv:2506.05412v1 Announce Type: cross Abstract: Gaze-referential inference–the ability to infer what others are looking at–is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study Read Post »

zh_CN