YouZum

AI

AI, Committee, ニュース, Uncategorized

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

arXiv:2509.06795v1 Announce Type: new Abstract: Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance various abilities of Large Language Models (LLMs). However, prior studies have shown that IFT can significantly compromise LLMs’ safety, particularly their ability to refuse malicious instructions, raising significant concerns. Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior. Building on this insight, our study reveals that the r-direction tends to drift during training, which we identify as one of the causes of the associated safety risks. To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample’s hidden state onto the r-direction. Our initial analysis shows that applying an appropriate constraint can effectively mitigate the refusal direction drift and associated safety risks, but remains limited by overall performance barriers. To overcome this barrier, informed by our observation of early-stage sharp drift and a data-driven perspective, we introduce a warm-up strategy that emphasizes early-stage strong constraints and broaden the data distribution to strengthen constraint signals, leading to an enhanced ProCon method. Experimental results under various datasets, scenarios, and LLMs demonstrate that our method can significantly mitigate safety risks posed by IFT while preserving task performance gains. Even compared with strong baselines, our method consistently delivers superior overall performance. Crucially, our analysis indicates that ProCon can contribute to stabilizing the r-direction during training, while such an interpretability-driven exploration of LLMs’ internal mechanisms lays a solid foundation for future safety research.

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint 投稿を読む »

AI, Committee, ニュース, Uncategorized

Antidistillation Sampling

arXiv:2504.13146v4 Announce Type: replace-cross Abstract: Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model’s next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model’s practical utility. For further details, see https://antidistillation.com.

Antidistillation Sampling 投稿を読む »

AI, Committee, ニュース, Uncategorized

Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

arXiv:2509.04482v1 Announce Type: new Abstract: Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women’s health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM’s advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.

Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare 投稿を読む »

AI, Committee, ニュース, Uncategorized

Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

arXiv:2509.04770v1 Announce Type: new Abstract: Accurately answering complex questions has consistently been a significant challenge for Large Language Models (LLMs). To address this, this paper proposes a multi-hop question decomposition method for complex questions, building upon research within the MQUAKE framework. Utilizing the LLAMA3 model, we systematically investigate the impact of multi-hop question decomposition within knowledge graphs on model comprehension and reasoning accuracy, both before and after model training. In our experiments, we systematically partitioned and converted the MQUAKE-T dataset into two distinct formats: a single-hop dataset designed for directly answering complex questions, and a multi-hop dataset constructed using the multi-hop question decomposition method. We then fine-tuned the LLAMA3 model on these datasets and conducted inference tests. Our results demonstrate that, without fine-tuning the LLM, the prediction performance based on the multi-hop question decomposition method significantly outperforms the method of directly answering complex questions. After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance of both approaches improved compared to the untrained baseline. Crucially, the method utilizing multi-hop decomposition consistently maintained its superiority. These findings validate the effectiveness of the multi-hop decomposition method both before and after training, demonstrating its capability to effectively enhance the LLM’s ability to answer complex questions.

Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework 投稿を読む »

AI, Committee, ニュース, Uncategorized

DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs

arXiv:2509.04483v1 Announce Type: new Abstract: Claim decomposition plays a crucial role in the fact-checking process by breaking down complex claims into simpler atomic components and identifying their unfactual elements. Despite its importance, current research primarily focuses on generative methods for decomposition, with insufficient emphasis on evaluating the quality of these decomposed atomic claims. To bridge this gap, we introduce textbf{DecMetrics}, which comprises three new metrics: texttt{COMPLETENESS}, texttt{CORRECTNESS}, and texttt{SEMANTIC ENTROPY}, designed to automatically assess the quality of claims produced by decomposition models. Utilizing these metrics, we develop a lightweight claim decomposition model, optimizing its performance through the integration of these metrics as a reward function. Through automatic evaluation, our approach aims to set a benchmark for claim decomposition, enhancing both the reliability and effectiveness of fact-checking systems.

DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs 投稿を読む »

AI, Committee, ニュース, Uncategorized

Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

arXiv:2509.05060v1 Announce Type: new Abstract: We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.

Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations 投稿を読む »

AI, Committee, ニュース, Uncategorized

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning

arXiv:2509.04731v1 Announce Type: cross Abstract: The convergence of Language models, Agent models, and World models represents a critical frontier for artificial intelligence. While recent progress has focused on scaling Language and Agent models, the development of sophisticated, explicit World Models remains a key bottleneck, particularly for complex, long-horizon multi-agent tasks. In domains such as robotic soccer, agents trained via standard reinforcement learning in high-fidelity but structurally-flat simulators often fail due to intractable exploration spaces and sparse rewards. This position paper argues that the next frontier in developing capable agents lies in creating environments that possess an explicit, hierarchical World Model. We contend that this is best achieved through hierarchical scaffolding, where complex goals are decomposed into structured, manageable subgoals. Drawing evidence from a systematic review of 2024 research in multi-agent soccer, we identify a clear and decisive trend towards integrating symbolic and hierarchical methods with multi-agent reinforcement learning (MARL). These approaches implicitly or explicitly construct a task-based world model to guide agent learning. We then propose a paradigm shift: leveraging Large Language Models to dynamically generate this hierarchical scaffold, effectively using language to structure the World Model on the fly. This language-driven world model provides an intrinsic curriculum, dense and meaningful learning signals, and a framework for compositional learning, enabling Agent Models to acquire sophisticated, strategic behaviors with far greater sample efficiency. By building environments with explicit, language-configurable task layers, we can bridge the gap between low-level reactive behaviors and high-level strategic team play, creating a powerful and generalizable framework for training the next generation of intelligent agents.

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning 投稿を読む »

AI, Committee, ニュース, Uncategorized

Alibaba AI Unveils Qwen3-Max Preview: A Trillion-Parameter Qwen Model with Super Fast Speed and Quality

Alibaba’s Qwen Team unveiled Qwen3-Max-Preview (Instruct), a new flagship large language model with over one trillion parameters—their largest to date. It is accessible through Qwen Chat, Alibaba Cloud API, OpenRouter, and as default in Hugging Face’s AnyCoder tool. How does it fit in today’s LLM landscape? This milestone comes at a time when the industry is trending toward smaller, more efficient models. Alibaba’s decision to move upward in scale marks a deliberate strategic choice, highlighting both its technical capabilities and commitment to trillion-parameter research. How large is Qwen3-Max and what are its context limits? Parameters: >1 trillion. Context window: Up to 262,144 tokens (258,048 input, 32,768 output). Efficiency feature: Includes context caching to speed up multi-turn sessions. How does Qwen3-Max perform against other models? Benchmarks show it outperforms Qwen3-235B-A22B-2507 and competes strongly with Claude Opus 4, Kimi K2, and Deepseek-V3.1 across SuperGPQA, AIME25, LiveCodeBench v6, Arena-Hard v2, and LiveBench. What is the pricing structure for usage? Alibaba Cloud applies tiered token-based pricing: 0–32K tokens: $0.861/million input, $3.441/million output 32K–128K: $1.434/million input, $5.735/million output 128K–252K: $2.151/million input, $8.602/million output This model is cost-efficient for smaller tasks but scales up significantly in price for long-context workloads. How does the closed-source approach impact adoption? Unlike earlier Qwen releases, this model is not open-weight. Access is restricted to APIs and partner platforms. This choice highlights Alibaba’s commercialization focus but may slow broader adoption in research and open-source communities Key Takeaways First trillion-parameter Qwen model – Qwen3-Max surpasses 1T parameters, making it Alibaba’s largest and most advanced LLM to date. Ultra-long context handling – Supports 262K tokens with caching, enabling extended document and session processing beyond most commercial models. Competitive benchmark performance – Outperforms Qwen3-235B and competes with Claude Opus 4, Kimi K2, and Deepseek-V3.1 on reasoning, coding, and general tasks. Emergent reasoning despite design – Though not marketed as a reasoning model, early results show structured reasoning capabilities on complex tasks. Closed-source, tiered pricing model – Available via APIs with token-based pricing; economical for small tasks but costly at higher context usage, limiting accessibility. Summary Qwen3-Max-Preview sets a new scale benchmark in commercial LLMs. Its trillion-parameter design, 262K context length, and strong benchmark results highlight Alibaba’s technical depth. Yet the model’s closed-source release and steep tiered pricing create a question for broader accessibility. Check out the Qwen Chat and Alibaba Cloud API. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Alibaba AI Unveils Qwen3-Max Preview: A Trillion-Parameter Qwen Model with Super Fast Speed and Quality appeared first on MarkTechPost.

Alibaba AI Unveils Qwen3-Max Preview: A Trillion-Parameter Qwen Model with Super Fast Speed and Quality 投稿を読む »

AI, Committee, ニュース, Uncategorized

Implementing DeepSpeed for Scalable Transformers: Advanced Training with Gradient Checkpointing and Parallelism

In this advanced DeepSpeed tutorial, we provide a hands-on walkthrough of cutting-edge optimization techniques for training large language models efficiently. By combining ZeRO optimization, mixed-precision training, gradient accumulation, and advanced DeepSpeed configurations, the tutorial demonstrates how to maximize GPU memory utilization, reduce training overhead, and enable scaling of transformer models in resource-constrained environments, such as Colab. Alongside model creation and training, it also covers performance monitoring, inference optimization, checkpointing, and benchmarking different ZeRO stages, providing practitioners with both theoretical insights and practical code to accelerate model development. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import subprocess import sys import os import json import time from pathlib import Path def install_dependencies(): “””Install required packages for DeepSpeed in Colab””” print(” Installing DeepSpeed and dependencies…”) subprocess.check_call([ sys.executable, “-m”, “pip”, “install”, “torch”, “torchvision”, “torchaudio”, “–index-url”, “https://download.pytorch.org/whl/cu118” ]) subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “deepspeed”]) subprocess.check_call([ sys.executable, “-m”, “pip”, “install”, “transformers”, “datasets”, “accelerate”, “wandb” ]) print(” Installation complete!”) install_dependencies() import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, Dataset import deepspeed from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer import numpy as np from typing import Dict, Any import argparse We set up our Colab environment by installing PyTorch with CUDA support, DeepSpeed, and essential libraries like Transformers, Datasets, Accelerate, and Weights & Biases. We ensure everything is ready so we can smoothly build and train models with DeepSpeed. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class SyntheticTextDataset(Dataset): “””Synthetic dataset for demonstration purposes””” def __init__(self, size: int = 1000, seq_length: int = 512, vocab_size: int = 50257): self.size = size self.seq_length = seq_length self.vocab_size = vocab_size self.data = torch.randint(0, vocab_size, (size, seq_length)) def __len__(self): return self.size def __getitem__(self, idx): return { ‘input_ids’: self.data[idx], ‘labels’: self.data[idx].clone() } We create a SyntheticTextDataset where we generate random token sequences to mimic real text data. We use these sequences as both inputs and labels, allowing us to quickly test DeepSpeed training without relying on a large external dataset. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AdvancedDeepSpeedTrainer: “””Advanced DeepSpeed trainer with multiple optimization techniques””” def __init__(self, model_config: Dict[str, Any], ds_config: Dict[str, Any]): self.model_config = model_config self.ds_config = ds_config self.model = None self.engine = None self.tokenizer = None def create_model(self): “””Create a GPT-2 style model for demonstration””” print(” Creating model…”) config = GPT2Config( vocab_size=self.model_config[‘vocab_size’], n_positions=self.model_config[‘seq_length’], n_embd=self.model_config[‘hidden_size’], n_layer=self.model_config[‘num_layers’], n_head=self.model_config[‘num_heads’], resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, ) self.model = GPT2LMHeadModel(config) self.tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’) self.tokenizer.pad_token = self.tokenizer.eos_token print(f” Model parameters: {sum(p.numel() for p in self.model.parameters()):,}”) return self.model def create_deepspeed_config(self): “””Create comprehensive DeepSpeed configuration””” return { “train_batch_size”: self.ds_config[‘train_batch_size’], “train_micro_batch_size_per_gpu”: self.ds_config[‘micro_batch_size’], “gradient_accumulation_steps”: self.ds_config[‘gradient_accumulation_steps’], “zero_optimization”: { “stage”: self.ds_config[‘zero_stage’], “allgather_partitions”: True, “allgather_bucket_size”: 5e8, “overlap_comm”: True, “reduce_scatter”: True, “reduce_bucket_size”: 5e8, “contiguous_gradients”: True, “cpu_offload”: self.ds_config.get(‘cpu_offload’, False) }, “fp16”: { “enabled”: True, “loss_scale”: 0, “loss_scale_window”: 1000, “initial_scale_power”: 16, “hysteresis”: 2, “min_loss_scale”: 1 }, “optimizer”: { “type”: “AdamW”, “params”: { “lr”: self.ds_config[‘learning_rate’], “betas”: [0.9, 0.999], “eps”: 1e-8, “weight_decay”: 0.01 } }, “scheduler”: { “type”: “WarmupLR”, “params”: { “warmup_min_lr”: 0, “warmup_max_lr”: self.ds_config[‘learning_rate’], “warmup_num_steps”: 100 } }, “gradient_clipping”: 1.0, “wall_clock_breakdown”: True, “memory_breakdown”: True, “tensorboard”: { “enabled”: True, “output_path”: “./logs/”, “job_name”: “deepspeed_advanced_tutorial” } } def initialize_deepspeed(self): “””Initialize DeepSpeed engine””” print(” Initializing DeepSpeed…”) parser = argparse.ArgumentParser() parser.add_argument(‘–local_rank’, type=int, default=0) args = parser.parse_args([]) self.engine, optimizer, _, lr_scheduler = deepspeed.initialize( args=args, model=self.model, config=self.create_deepspeed_config() ) print(f” DeepSpeed engine initialized with ZeRO stage {self.ds_config[‘zero_stage’]}”) return self.engine def train_step(self, batch: Dict[str, torch.Tensor]) -> Dict[str, float]: “””Perform a single training step with DeepSpeed optimizations””” input_ids = batch[‘input_ids’].to(self.engine.device) labels = batch[‘labels’].to(self.engine.device) outputs = self.engine(input_ids=input_ids, labels=labels) loss = outputs.loss self.engine.backward(loss) self.engine.step() return { ‘loss’: loss.item(), ‘lr’: self.engine.lr_scheduler.get_last_lr()[0] if self.engine.lr_scheduler else 0 } def train(self, dataloader: DataLoader, num_epochs: int = 2): “””Complete training loop with monitoring””” print(f” Starting training for {num_epochs} epochs…”) self.engine.train() total_steps = 0 for epoch in range(num_epochs): epoch_loss = 0.0 epoch_steps = 0 print(f”n Epoch {epoch + 1}/{num_epochs}”) for step, batch in enumerate(dataloader): start_time = time.time() metrics = self.train_step(batch) epoch_loss += metrics[‘loss’] epoch_steps += 1 total_steps += 1 if step % 10 == 0: step_time = time.time() – start_time print(f” Step {step:4d} | Loss: {metrics[‘loss’]:.4f} | ” f”LR: {metrics[‘lr’]:.2e} | Time: {step_time:.3f}s”) if step % 20 == 0 and hasattr(self.engine, ‘monitor’): self.log_memory_stats() if step >= 50: break avg_loss = epoch_loss / epoch_steps print(f” Epoch {epoch + 1} completed | Average Loss: {avg_loss:.4f}”) print(” Training completed!”) def log_memory_stats(self): “””Log GPU memory statistics””” if torch.cuda.is_available(): allocated = torch.cuda.memory_allocated() / 1024**3 reserved = torch.cuda.memory_reserved() / 1024**3 print(f” GPU Memory – Allocated: {allocated:.2f}GB | Reserved: {reserved:.2f}GB”) def save_checkpoint(self, path: str): “””Save model checkpoint using DeepSpeed””” print(f” Saving checkpoint to {path}”) self.engine.save_checkpoint(path) def demonstrate_inference(self, text: str = “The future of AI is”): “””Demonstrate optimized inference with DeepSpeed””” print(f”n Running inference with prompt: ‘{text}'”) inputs = self.tokenizer.encode(text, return_tensors=’pt’).to(self.engine.device) self.engine.eval() with torch.no_grad(): outputs = self.engine.module.generate( inputs, max_length=inputs.shape[1] + 50, num_return_sequences=1, temperature=0.8, do_sample=True, pad_token_id=self.tokenizer.eos_token_id ) generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True) print(f” Generated text: {generated_text}”) self.engine.train() We build an end-to-end trainer that creates a GPT-2 model, sets a DeepSpeed config (ZeRO, FP16, AdamW, warmup scheduler, tensorboard), and initializes the engine. We then run efficient training steps with logging and memory statistics, save checkpoints, and demonstrate inference to verify optimization and generation in one place. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def run_advanced_tutorial(): “””Main function to run the advanced DeepSpeed tutorial””” print(” Advanced DeepSpeed Tutorial Starting…”) print(“=” * 60) model_config = { ‘vocab_size’: 50257, ‘seq_length’: 512, ‘hidden_size’: 768, ‘num_layers’: 6, ‘num_heads’: 12 } ds_config = { ‘train_batch_size’: 16, ‘micro_batch_size’: 4, ‘gradient_accumulation_steps’: 4, ‘zero_stage’: 2, ‘learning_rate’: 1e-4, ‘cpu_offload’: False } print(” Configuration:”) print(f” Model size: ~{sum(np.prod(shape) for shape in [[model_config[‘vocab_size’], model_config[‘hidden_size’]], [model_config[‘hidden_size’], model_config[‘hidden_size’]] * model_config[‘num_layers’]]) / 1e6:.1f}M parameters”) print(f” ZeRO Stage: {ds_config[‘zero_stage’]}”) print(f” Batch size: {ds_config[‘train_batch_size’]}”) trainer = AdvancedDeepSpeedTrainer(model_config, ds_config) model = trainer.create_model() engine = trainer.initialize_deepspeed() print(“n Creating synthetic dataset…”) dataset = SyntheticTextDataset( size=200, seq_length=model_config[‘seq_length’], vocab_size=model_config[‘vocab_size’] ) dataloader = DataLoader( dataset, batch_size=ds_config[‘micro_batch_size’], shuffle=True ) print(“n Pre-training memory stats:”) trainer.log_memory_stats() trainer.train(dataloader, num_epochs=2) print(“n Post-training memory stats:”) trainer.log_memory_stats() trainer.demonstrate_inference(“DeepSpeed enables efficient training of”) checkpoint_path = “./deepspeed_checkpoint” trainer.save_checkpoint(checkpoint_path) demonstrate_zero_stages() demonstrate_memory_optimization() print(“n Tutorial completed successfully!”) print(“Key DeepSpeed features demonstrated:”) print(” ZeRO optimization for memory efficiency”) print(” Mixed precision training (FP16)”) print(” Gradient accumulation”) print(” Learning

Implementing DeepSpeed for Scalable Transformers: Advanced Training with Gradient Checkpointing and Parallelism 投稿を読む »

AI, Committee, ニュース, Uncategorized

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs)

Hugging Face has just released FineVision, an open multimodal dataset designed to set a new standard for Vision-Language Models (VLMs). With 17.3 million images, 24.3 million samples, 88.9 million question-answer turns, and nearly 10 billion answer tokens, FineVision position itself as one of the largest and structured publicly available VLM training datasets. FineVision aggregates 200+ sources into a unified format, rigorously filtered for duplicates and benchmark contamination. Rated systematically across multiple quality dimensions, the dataset enables researchers and devs to construct robust training mixtures while minimizing data leakage. Why is FineVision Important for VLM Training? Most state-of-the-art VLMs rely on proprietary datasets, limiting reproducibility and accessibility for the broader research community. FineVision addresses this gap by: Scale and Coverage: 5 TB of curated data across 9 categories, including General VQA, OCR QA, Chart & Table reasoning, Science, Captioning, Grounding & Counting, and GUI navigation. Benchmark Gains: Across 11 widely used benchmarks (e.g., AI2D, ChartQA, DocVQA, ScienceQA, OCRBench), models trained on FineVision outperform alternatives by significant margins—up to 46.3% over LLaVA, 40.7% over Cauldron, and 12.1% over Cambrian. New Skill Domains: FineVision introduces data for emerging tasks like GUI navigation, pointing, and counting, expanding the capabilities of VLMs beyond conventional captioning and VQA. How Was FineVision Built? The curation pipeline followed a three-step process: Collection and AugmentationOver 200 publicly available image-text datasets were gathered. Missing modalities (e.g., text-only data) were reformatted into QA pairs. Underrepresented domains, such as GUI data, were supplemented through targeted collection. Cleaning Removed oversized QA pairs (>8192 tokens). Resized large images to a maximum of 2048 px while preserving aspect ratio. Discarded corrupted samples. Quality RatingUsing Qwen3-32B and Qwen2.5-VL-32B-Instruct as judges, every QA pair was rated on four axes: Text Formatting Quality Question-Answer Relevance Visual Dependency Image-Question Correspondence These ratings enable selective training mixtures, though ablations show that retaining all samples yields the best performance, even when lower-rated samples are included. Comparative Analysis: FineVision vs. Existing Open Datasets Dataset Images Samples Turns Tokens Leakage Perf. Drop After Deduplication Cauldron 2.0M 1.8M 27.8M 0.3B 3.05% -2.39% LLaVA-Vision 2.5M 3.9M 9.1M 1.0B 2.15% -2.72% Cambrian-7M 5.4M 7.0M 12.2M 0.8B 2.29% -2.78% FineVision 17.3M 24.3M 88.9M 9.5B 1.02% -1.45% FineVision is not only one of the largest but also the least hallucinated dataset, with just 1% overlap with benchmark test sets. This ensures minimal data leakage and reliable evaluation performance. Performance Insights Model Setup: Ablations were conducted using nanoVLM (460M parameters), combining SmolLM2-360M-Instruct as the language backbone and SigLIP2-Base-512 as the vision encoder. Training Efficiency: On 32 NVIDIA H100 GPUs, one full epoch (12k steps) takes ~20 hours. Performance Trends: FineVision models improve steadily with exposure to diverse data, overtaking baselines after ~12k steps. Deduplication experiments confirm FineVision’s low leakage compared to Cauldron, LLaVA, and Cambrian. Multilingual subsets, even when the backbone is monolingual, show slight performance gains, suggesting diversity outweighs strict alignment. Attempts at multi-stage training (two or 2.5 stages) did not yield consistent benefits, reinforcing that scale + diversity is more critical than training heuristics. Why FineVision Brings the New Standard? +20% Average Performance Boost: Outperforms all existing open datasets across 10+ benchmarks. Unprecedented Scale: 17M+ images, 24M+ samples, 10B tokens. Skill Expansion: GUI navigation, counting, pointing, and document reasoning included. Lowest Data Leakage: 1% contamination, compared to 2–3% in other datasets. Fully Open Source: Available on Hugging Face Hub for immediate use via the datasets library. Conclusion FineVision marks a significant advancement in open multimodal datasets. Its large scale, systematic curation, and transparent quality assessments create a reproducible and extensible foundation for training state-of-the-art Vision-Language Models. By reducing dependence on proprietary resources, it enables researchers and devs to build competitive systems and accelerate progress in areas such as document analysis, visual reasoning, and agentic multimodal tasks. Check out the Dataset and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs) appeared first on MarkTechPost.

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs) 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja