AI Archives - Página 30 de 102

Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

admin NU / septiembre 8, 2025

arXiv:2509.04770v1 Announce Type: new Abstract: Accurately answering complex questions has consistently been a significant challenge for Large Language Models (LLMs). To address this, this paper proposes a multi-hop question decomposition method for complex questions, building upon research within the MQUAKE framework. Utilizing the LLAMA3 model, we systematically investigate the impact of multi-hop question decomposition within knowledge graphs on model comprehension and reasoning accuracy, both before and after model training. In our experiments, we systematically partitioned and converted the MQUAKE-T dataset into two distinct formats: a single-hop dataset designed for directly answering complex questions, and a multi-hop dataset constructed using the multi-hop question decomposition method. We then fine-tuned the LLAMA3 model on these datasets and conducted inference tests. Our results demonstrate that, without fine-tuning the LLM, the prediction performance based on the multi-hop question decomposition method significantly outperforms the method of directly answering complex questions. After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance of both approaches improved compared to the untrained baseline. Crucially, the method utilizing multi-hop decomposition consistently maintained its superiority. These findings validate the effectiveness of the multi-hop decomposition method both before and after training, demonstrating its capability to effectively enhance the LLM’s ability to answer complex questions.

Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework Leer entrada »

AI, Committee, Noticias, Uncategorized

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning

admin NU / septiembre 8, 2025

arXiv:2509.04731v1 Announce Type: cross Abstract: The convergence of Language models, Agent models, and World models represents a critical frontier for artificial intelligence. While recent progress has focused on scaling Language and Agent models, the development of sophisticated, explicit World Models remains a key bottleneck, particularly for complex, long-horizon multi-agent tasks. In domains such as robotic soccer, agents trained via standard reinforcement learning in high-fidelity but structurally-flat simulators often fail due to intractable exploration spaces and sparse rewards. This position paper argues that the next frontier in developing capable agents lies in creating environments that possess an explicit, hierarchical World Model. We contend that this is best achieved through hierarchical scaffolding, where complex goals are decomposed into structured, manageable subgoals. Drawing evidence from a systematic review of 2024 research in multi-agent soccer, we identify a clear and decisive trend towards integrating symbolic and hierarchical methods with multi-agent reinforcement learning (MARL). These approaches implicitly or explicitly construct a task-based world model to guide agent learning. We then propose a paradigm shift: leveraging Large Language Models to dynamically generate this hierarchical scaffold, effectively using language to structure the World Model on the fly. This language-driven world model provides an intrinsic curriculum, dense and meaningful learning signals, and a framework for compositional learning, enabling Agent Models to acquire sophisticated, strategic behaviors with far greater sample efficiency. By building environments with explicit, language-configurable task layers, we can bridge the gap between low-level reactive behaviors and high-level strategic team play, creating a powerful and generalizable framework for training the next generation of intelligent agents.

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

Alibaba AI Unveils Qwen3-Max Preview: A Trillion-Parameter Qwen Model with Super Fast Speed and Quality

admin NU / septiembre 7, 2025

Alibaba’s Qwen Team unveiled Qwen3-Max-Preview (Instruct), a new flagship large language model with over one trillion parameters—their largest to date. It is accessible through Qwen Chat, Alibaba Cloud API, OpenRouter, and as default in Hugging Face’s AnyCoder tool. How does it fit in today’s LLM landscape? This milestone comes at a time when the industry is trending toward smaller, more efficient models. Alibaba’s decision to move upward in scale marks a deliberate strategic choice, highlighting both its technical capabilities and commitment to trillion-parameter research. How large is Qwen3-Max and what are its context limits? Parameters: >1 trillion. Context window: Up to 262,144 tokens (258,048 input, 32,768 output). Efficiency feature: Includes context caching to speed up multi-turn sessions. How does Qwen3-Max perform against other models? Benchmarks show it outperforms Qwen3-235B-A22B-2507 and competes strongly with Claude Opus 4, Kimi K2, and Deepseek-V3.1 across SuperGPQA, AIME25, LiveCodeBench v6, Arena-Hard v2, and LiveBench. What is the pricing structure for usage? Alibaba Cloud applies tiered token-based pricing: 0–32K tokens: $0.861/million input, $3.441/million output 32K–128K: $1.434/million input, $5.735/million output 128K–252K: $2.151/million input, $8.602/million output This model is cost-efficient for smaller tasks but scales up significantly in price for long-context workloads. How does the closed-source approach impact adoption? Unlike earlier Qwen releases, this model is not open-weight. Access is restricted to APIs and partner platforms. This choice highlights Alibaba’s commercialization focus but may slow broader adoption in research and open-source communities Key Takeaways First trillion-parameter Qwen model – Qwen3-Max surpasses 1T parameters, making it Alibaba’s largest and most advanced LLM to date. Ultra-long context handling – Supports 262K tokens with caching, enabling extended document and session processing beyond most commercial models. Competitive benchmark performance – Outperforms Qwen3-235B and competes with Claude Opus 4, Kimi K2, and Deepseek-V3.1 on reasoning, coding, and general tasks. Emergent reasoning despite design – Though not marketed as a reasoning model, early results show structured reasoning capabilities on complex tasks. Closed-source, tiered pricing model – Available via APIs with token-based pricing; economical for small tasks but costly at higher context usage, limiting accessibility. Summary Qwen3-Max-Preview sets a new scale benchmark in commercial LLMs. Its trillion-parameter design, 262K context length, and strong benchmark results highlight Alibaba’s technical depth. Yet the model’s closed-source release and steep tiered pricing create a question for broader accessibility. Check out the Qwen Chat and Alibaba Cloud API. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Alibaba AI Unveils Qwen3-Max Preview: A Trillion-Parameter Qwen Model with Super Fast Speed and Quality appeared first on MarkTechPost.

Alibaba AI Unveils Qwen3-Max Preview: A Trillion-Parameter Qwen Model with Super Fast Speed and Quality Leer entrada »

AI, Committee, Noticias, Uncategorized

Implementing DeepSpeed for Scalable Transformers: Advanced Training with Gradient Checkpointing and Parallelism

admin NU / septiembre 7, 2025

In this advanced DeepSpeed tutorial, we provide a hands-on walkthrough of cutting-edge optimization techniques for training large language models efficiently. By combining ZeRO optimization, mixed-precision training, gradient accumulation, and advanced DeepSpeed configurations, the tutorial demonstrates how to maximize GPU memory utilization, reduce training overhead, and enable scaling of transformer models in resource-constrained environments, such as Colab. Alongside model creation and training, it also covers performance monitoring, inference optimization, checkpointing, and benchmarking different ZeRO stages, providing practitioners with both theoretical insights and practical code to accelerate model development. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import subprocess import sys import os import json import time from pathlib import Path def install_dependencies(): “””Install required packages for DeepSpeed in Colab””” print(” Installing DeepSpeed and dependencies…”) subprocess.check_call([ sys.executable, “-m”, “pip”, “install”, “torch”, “torchvision”, “torchaudio”, “–index-url”, “https://download.pytorch.org/whl/cu118” ]) subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “deepspeed”]) subprocess.check_call([ sys.executable, “-m”, “pip”, “install”, “transformers”, “datasets”, “accelerate”, “wandb” ]) print(” Installation complete!”) install_dependencies() import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, Dataset import deepspeed from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer import numpy as np from typing import Dict, Any import argparse We set up our Colab environment by installing PyTorch with CUDA support, DeepSpeed, and essential libraries like Transformers, Datasets, Accelerate, and Weights & Biases. We ensure everything is ready so we can smoothly build and train models with DeepSpeed. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class SyntheticTextDataset(Dataset): “””Synthetic dataset for demonstration purposes””” def __init__(self, size: int = 1000, seq_length: int = 512, vocab_size: int = 50257): self.size = size self.seq_length = seq_length self.vocab_size = vocab_size self.data = torch.randint(0, vocab_size, (size, seq_length)) def __len__(self): return self.size def __getitem__(self, idx): return { ‘input_ids’: self.data[idx], ‘labels’: self.data[idx].clone() } We create a SyntheticTextDataset where we generate random token sequences to mimic real text data. We use these sequences as both inputs and labels, allowing us to quickly test DeepSpeed training without relying on a large external dataset. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AdvancedDeepSpeedTrainer: “””Advanced DeepSpeed trainer with multiple optimization techniques””” def __init__(self, model_config: Dict[str, Any], ds_config: Dict[str, Any]): self.model_config = model_config self.ds_config = ds_config self.model = None self.engine = None self.tokenizer = None def create_model(self): “””Create a GPT-2 style model for demonstration””” print(” Creating model…”) config = GPT2Config( vocab_size=self.model_config[‘vocab_size’], n_positions=self.model_config[‘seq_length’], n_embd=self.model_config[‘hidden_size’], n_layer=self.model_config[‘num_layers’], n_head=self.model_config[‘num_heads’], resid_pdrop=0.1, embd_pdrop=0.1, attn_pdrop=0.1, ) self.model = GPT2LMHeadModel(config) self.tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’) self.tokenizer.pad_token = self.tokenizer.eos_token print(f” Model parameters: {sum(p.numel() for p in self.model.parameters()):,}”) return self.model def create_deepspeed_config(self): “””Create comprehensive DeepSpeed configuration””” return { “train_batch_size”: self.ds_config[‘train_batch_size’], “train_micro_batch_size_per_gpu”: self.ds_config[‘micro_batch_size’], “gradient_accumulation_steps”: self.ds_config[‘gradient_accumulation_steps’], “zero_optimization”: { “stage”: self.ds_config[‘zero_stage’], “allgather_partitions”: True, “allgather_bucket_size”: 5e8, “overlap_comm”: True, “reduce_scatter”: True, “reduce_bucket_size”: 5e8, “contiguous_gradients”: True, “cpu_offload”: self.ds_config.get(‘cpu_offload’, False) }, “fp16”: { “enabled”: True, “loss_scale”: 0, “loss_scale_window”: 1000, “initial_scale_power”: 16, “hysteresis”: 2, “min_loss_scale”: 1 }, “optimizer”: { “type”: “AdamW”, “params”: { “lr”: self.ds_config[‘learning_rate’], “betas”: [0.9, 0.999], “eps”: 1e-8, “weight_decay”: 0.01 } }, “scheduler”: { “type”: “WarmupLR”, “params”: { “warmup_min_lr”: 0, “warmup_max_lr”: self.ds_config[‘learning_rate’], “warmup_num_steps”: 100 } }, “gradient_clipping”: 1.0, “wall_clock_breakdown”: True, “memory_breakdown”: True, “tensorboard”: { “enabled”: True, “output_path”: “./logs/”, “job_name”: “deepspeed_advanced_tutorial” } } def initialize_deepspeed(self): “””Initialize DeepSpeed engine””” print(” Initializing DeepSpeed…”) parser = argparse.ArgumentParser() parser.add_argument(‘–local_rank’, type=int, default=0) args = parser.parse_args([]) self.engine, optimizer, _, lr_scheduler = deepspeed.initialize( args=args, model=self.model, config=self.create_deepspeed_config() ) print(f” DeepSpeed engine initialized with ZeRO stage {self.ds_config[‘zero_stage’]}”) return self.engine def train_step(self, batch: Dict[str, torch.Tensor]) -> Dict[str, float]: “””Perform a single training step with DeepSpeed optimizations””” input_ids = batch[‘input_ids’].to(self.engine.device) labels = batch[‘labels’].to(self.engine.device) outputs = self.engine(input_ids=input_ids, labels=labels) loss = outputs.loss self.engine.backward(loss) self.engine.step() return { ‘loss’: loss.item(), ‘lr’: self.engine.lr_scheduler.get_last_lr()[0] if self.engine.lr_scheduler else 0 } def train(self, dataloader: DataLoader, num_epochs: int = 2): “””Complete training loop with monitoring””” print(f” Starting training for {num_epochs} epochs…”) self.engine.train() total_steps = 0 for epoch in range(num_epochs): epoch_loss = 0.0 epoch_steps = 0 print(f”n Epoch {epoch + 1}/{num_epochs}”) for step, batch in enumerate(dataloader): start_time = time.time() metrics = self.train_step(batch) epoch_loss += metrics[‘loss’] epoch_steps += 1 total_steps += 1 if step % 10 == 0: step_time = time.time() – start_time print(f” Step {step:4d} | Loss: {metrics[‘loss’]:.4f} | ” f”LR: {metrics[‘lr’]:.2e} | Time: {step_time:.3f}s”) if step % 20 == 0 and hasattr(self.engine, ‘monitor’): self.log_memory_stats() if step >= 50: break avg_loss = epoch_loss / epoch_steps print(f” Epoch {epoch + 1} completed | Average Loss: {avg_loss:.4f}”) print(” Training completed!”) def log_memory_stats(self): “””Log GPU memory statistics””” if torch.cuda.is_available(): allocated = torch.cuda.memory_allocated() / 1024**3 reserved = torch.cuda.memory_reserved() / 1024**3 print(f” GPU Memory – Allocated: {allocated:.2f}GB | Reserved: {reserved:.2f}GB”) def save_checkpoint(self, path: str): “””Save model checkpoint using DeepSpeed””” print(f” Saving checkpoint to {path}”) self.engine.save_checkpoint(path) def demonstrate_inference(self, text: str = “The future of AI is”): “””Demonstrate optimized inference with DeepSpeed””” print(f”n Running inference with prompt: ‘{text}'”) inputs = self.tokenizer.encode(text, return_tensors=’pt’).to(self.engine.device) self.engine.eval() with torch.no_grad(): outputs = self.engine.module.generate( inputs, max_length=inputs.shape[1] + 50, num_return_sequences=1, temperature=0.8, do_sample=True, pad_token_id=self.tokenizer.eos_token_id ) generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True) print(f” Generated text: {generated_text}”) self.engine.train() We build an end-to-end trainer that creates a GPT-2 model, sets a DeepSpeed config (ZeRO, FP16, AdamW, warmup scheduler, tensorboard), and initializes the engine. We then run efficient training steps with logging and memory statistics, save checkpoints, and demonstrate inference to verify optimization and generation in one place. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def run_advanced_tutorial(): “””Main function to run the advanced DeepSpeed tutorial””” print(” Advanced DeepSpeed Tutorial Starting…”) print(“=” * 60) model_config = { ‘vocab_size’: 50257, ‘seq_length’: 512, ‘hidden_size’: 768, ‘num_layers’: 6, ‘num_heads’: 12 } ds_config = { ‘train_batch_size’: 16, ‘micro_batch_size’: 4, ‘gradient_accumulation_steps’: 4, ‘zero_stage’: 2, ‘learning_rate’: 1e-4, ‘cpu_offload’: False } print(” Configuration:”) print(f” Model size: ~{sum(np.prod(shape) for shape in [[model_config[‘vocab_size’], model_config[‘hidden_size’]], [model_config[‘hidden_size’], model_config[‘hidden_size’]] * model_config[‘num_layers’]]) / 1e6:.1f}M parameters”) print(f” ZeRO Stage: {ds_config[‘zero_stage’]}”) print(f” Batch size: {ds_config[‘train_batch_size’]}”) trainer = AdvancedDeepSpeedTrainer(model_config, ds_config) model = trainer.create_model() engine = trainer.initialize_deepspeed() print(“n Creating synthetic dataset…”) dataset = SyntheticTextDataset( size=200, seq_length=model_config[‘seq_length’], vocab_size=model_config[‘vocab_size’] ) dataloader = DataLoader( dataset, batch_size=ds_config[‘micro_batch_size’], shuffle=True ) print(“n Pre-training memory stats:”) trainer.log_memory_stats() trainer.train(dataloader, num_epochs=2) print(“n Post-training memory stats:”) trainer.log_memory_stats() trainer.demonstrate_inference(“DeepSpeed enables efficient training of”) checkpoint_path = “./deepspeed_checkpoint” trainer.save_checkpoint(checkpoint_path) demonstrate_zero_stages() demonstrate_memory_optimization() print(“n Tutorial completed successfully!”) print(“Key DeepSpeed features demonstrated:”) print(” ZeRO optimization for memory efficiency”) print(” Mixed precision training (FP16)”) print(” Gradient accumulation”) print(” Learning

Implementing DeepSpeed for Scalable Transformers: Advanced Training with Gradient Checkpointing and Parallelism Leer entrada »

AI, Committee, Noticias, Uncategorized

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs)

admin NU / septiembre 7, 2025

Hugging Face has just released FineVision, an open multimodal dataset designed to set a new standard for Vision-Language Models (VLMs). With 17.3 million images, 24.3 million samples, 88.9 million question-answer turns, and nearly 10 billion answer tokens, FineVision position itself as one of the largest and structured publicly available VLM training datasets. FineVision aggregates 200+ sources into a unified format, rigorously filtered for duplicates and benchmark contamination. Rated systematically across multiple quality dimensions, the dataset enables researchers and devs to construct robust training mixtures while minimizing data leakage. Why is FineVision Important for VLM Training? Most state-of-the-art VLMs rely on proprietary datasets, limiting reproducibility and accessibility for the broader research community. FineVision addresses this gap by: Scale and Coverage: 5 TB of curated data across 9 categories, including General VQA, OCR QA, Chart & Table reasoning, Science, Captioning, Grounding & Counting, and GUI navigation. Benchmark Gains: Across 11 widely used benchmarks (e.g., AI2D, ChartQA, DocVQA, ScienceQA, OCRBench), models trained on FineVision outperform alternatives by significant margins—up to 46.3% over LLaVA, 40.7% over Cauldron, and 12.1% over Cambrian. New Skill Domains: FineVision introduces data for emerging tasks like GUI navigation, pointing, and counting, expanding the capabilities of VLMs beyond conventional captioning and VQA. How Was FineVision Built? The curation pipeline followed a three-step process: Collection and AugmentationOver 200 publicly available image-text datasets were gathered. Missing modalities (e.g., text-only data) were reformatted into QA pairs. Underrepresented domains, such as GUI data, were supplemented through targeted collection. Cleaning Removed oversized QA pairs (>8192 tokens). Resized large images to a maximum of 2048 px while preserving aspect ratio. Discarded corrupted samples. Quality RatingUsing Qwen3-32B and Qwen2.5-VL-32B-Instruct as judges, every QA pair was rated on four axes: Text Formatting Quality Question-Answer Relevance Visual Dependency Image-Question Correspondence These ratings enable selective training mixtures, though ablations show that retaining all samples yields the best performance, even when lower-rated samples are included. Comparative Analysis: FineVision vs. Existing Open Datasets Dataset Images Samples Turns Tokens Leakage Perf. Drop After Deduplication Cauldron 2.0M 1.8M 27.8M 0.3B 3.05% -2.39% LLaVA-Vision 2.5M 3.9M 9.1M 1.0B 2.15% -2.72% Cambrian-7M 5.4M 7.0M 12.2M 0.8B 2.29% -2.78% FineVision 17.3M 24.3M 88.9M 9.5B 1.02% -1.45% FineVision is not only one of the largest but also the least hallucinated dataset, with just 1% overlap with benchmark test sets. This ensures minimal data leakage and reliable evaluation performance. Performance Insights Model Setup: Ablations were conducted using nanoVLM (460M parameters), combining SmolLM2-360M-Instruct as the language backbone and SigLIP2-Base-512 as the vision encoder. Training Efficiency: On 32 NVIDIA H100 GPUs, one full epoch (12k steps) takes ~20 hours. Performance Trends: FineVision models improve steadily with exposure to diverse data, overtaking baselines after ~12k steps. Deduplication experiments confirm FineVision’s low leakage compared to Cauldron, LLaVA, and Cambrian. Multilingual subsets, even when the backbone is monolingual, show slight performance gains, suggesting diversity outweighs strict alignment. Attempts at multi-stage training (two or 2.5 stages) did not yield consistent benefits, reinforcing that scale + diversity is more critical than training heuristics. Why FineVision Brings the New Standard? +20% Average Performance Boost: Outperforms all existing open datasets across 10+ benchmarks. Unprecedented Scale: 17M+ images, 24M+ samples, 10B tokens. Skill Expansion: GUI navigation, counting, pointing, and document reasoning included. Lowest Data Leakage: 1% contamination, compared to 2–3% in other datasets. Fully Open Source: Available on Hugging Face Hub for immediate use via the datasets library. Conclusion FineVision marks a significant advancement in open multimodal datasets. Its large scale, systematic curation, and transparent quality assessments create a reproducible and extensible foundation for training state-of-the-art Vision-Language Models. By reducing dependence on proprietary resources, it enables researchers and devs to build competitive systems and accelerate progress in areas such as document analysis, visual reasoning, and agentic multimodal tasks. Check out the Dataset and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs) appeared first on MarkTechPost.

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs) Leer entrada »

AI, Committee, Noticias, Uncategorized

Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model with Over 30 Billion Parameters and Support Most European Languages

admin NU / septiembre 7, 2025

Latvian language-tech firm Tilde has released TildeOpen LLM, an open-source foundational large language model (LLM) purpose-built for European languages, with a sharp focus on under-represented and smaller national and regional languages. It’s a strategic leap toward linguistic equity and digital sovereignty within the EU. Under the Hood: Architecture, Training and Governance The public release occurred on September 3, 2025, when Tilde deployed the model free to users via Hugging Face. Built as a 30-billion-parameter dense decoder-only transformer, the model is available under a permissive license (CC-BY-4.0) and includes broad language support—from Latvian and Lithuanian to Ukrainian, Turkish, and beyond. Training occurred on the EU’s supercomputers: LUMI (Finland) and JUPITER, tapping into 2 million GPU hours awarded via the European Commission’s Large AI Grand Challenge. Fine technical detail: trained via EleutherAI–inspired GPT-NeoX scripts across 450K updates, consuming ~2 trillion tokens. Training included three-stage sampling: uniform across languages, natural distribution to boost high-data-volume languages, and a final uniform sweep for balance. Hyperparameters: 60 layers, embedding size 6144, 48 attention heads, 8192-token context window, SwiGLU activations, RoPE positional encoding, RMSNorm layer norms. Language Equity and Data Sovereignty Mainstream models lean heavily on English and other major languages, causing skewed performance when dealing with Baltic, Slavic, or other smaller European languages. This under-representation leads to poor grammar, awkward phrasing, and hallucinations. TildeOpen resolves this by embedding an “equitable tokenizer”, engineered to represent text similarly regardless of language—reducing token count and increasing inference efficiency for lesser-represented languages. Crucially, organizations can self-host—in local data centers or secure EU-compliant clouds—ensuring adherence to GDPR and other data-protection mandates. This addresses sovereignty concerns tied to US- or Asia-hosted models. Strategic Horizon: From Prototype to European AI Infrastructure TildeOpen is a foundational “base” model. It is expected for it’s upcoming versions more specialized (e.g., instruction-tuned translation models) built atop this core. It’s also a geo-flag planting moment: Latvia, via Tilde, positions itself as a tech exporter, with aspirations to scale European AI infrastructure while preserving linguistic diversity. For Research, the move mirrors broader research on multilingual model behavior—gaps still exist. Evaluations show even strong open LLMs can hallucinate or lag in lexical accuracy for Baltic languages, reinforcing the need for localized development. Summary TildeOpen LLM reframes EU AI—not just as regulatory compliance, but as technical stewardship. It’s a grounded, high-capacity model with transparent architecture, scalable deployment, and a fierce commitment to linguistic equity. It doesn’t indulge hype; it delivers substance. FAQs Q1: What is TildeOpen LLM?TildeOpen is a 30B-parameter multilingual large language model trained on EU supercomputers, optimized for European languages, especially under-represented ones. Q2: How is it different from mainstream LLMs?Unlike global models that prioritize English, TildeOpen uses an equitable tokenizer and balanced training to ensure fair representation and accuracy across smaller European languages. Q3: Can organizations self-host the model?Yes. TildeOpen is open-source under CC-BY-4.0 and can be deployed in local data centers or EU-compliant clouds to meet GDPR and data sovereignty requirements. Q4: What are the main use cases?Government services, translation, education, AI assistants, speech technologies, and multilingual customer support—any domain requiring accurate European language processing. Check out the Model on Hugging Face and Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model with Over 30 Billion Parameters and Support Most European Languages appeared first on MarkTechPost.

Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model with Over 30 Billion Parameters and Support Most European Languages Leer entrada »

AI, Committee, Noticias, Uncategorized

From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem

admin NU / septiembre 7, 2025

Large language models (LLMs) very often generate “hallucinations”—confident yet incorrect outputs that appear plausible. Despite improvements in training methods and architectures, hallucinations persist. A new research from OpenAI provides a rigorous explanation: hallucinations stem from statistical properties of supervised versus self-supervised learning, and their persistence is reinforced by misaligned evaluation benchmarks. What Makes Hallucinations Statistically Inevitable? The research team explains hallucinations as errors inherent to generative modeling. Even with perfectly clean training data, the cross-entropy objective used in pretraining introduces statistical pressures that produce errors. The research team reduce the problem to a supervised binary classification task called Is-It-Valid (IIV): determining whether a model’s output is valid or erroneous. They prove that the generative error rate of an LLM is at least twice its IIV misclassification rate. In other words, hallucinations occur for the same reasons misclassifications appear in supervised learning: epistemic uncertainty, poor models, distribution shift, or noisy data. Why Do Rare Facts Trigger More Hallucinations? One major driver is the singleton rate—the fraction of facts that appear only once in training data. By analogy to Good–Turing missing-mass estimation, if 20% of facts are singletons, at least 20% of them will be hallucinated. This explains why LLMs answer reliably about widely repeated facts (e.g., Einstein’s birthday) but fail on obscure or rarely mentioned ones. Can Poor Model Families Lead to Hallucinations? Yes. Hallucinations also emerge when the model class cannot adequately represent a pattern. Classic examples include n-gram models generating ungrammatical sentences, or modern tokenized models miscounting letters because characters are hidden inside subword tokens. These representational limits cause systematic errors even when the data itself is sufficient. Why Doesn’t Post-Training Eliminate Hallucinations? Post-training methods such as RLHF (reinforcement learning from human feedback), DPO, and RLAIF reduce some errors, especially harmful or conspiratorial outputs. But overconfident hallucinations remain because evaluation incentives are misaligned. Like students guessing on multiple-choice exams, LLMs are rewarded for bluffing when unsure. Most benchmarks—such as MMLU, GPQA, and SWE-bench—apply binary scoring: correct answers get credit, abstentions (“I don’t know”) get none, and incorrect answers are penalized no more harshly than abstentions. Under this scheme, guessing maximizes benchmark scores, even if it fosters hallucinations. How Do Leaderboards Reinforce Hallucinations? A review of popular benchmarks shows that nearly all use binary grading with no partial credit for uncertainty. As a result, models that truthfully express uncertainty perform worse than those that always guess. This creates systemic pressure for developers to optimize models for confident answers rather than calibrated ones. What Changes Could Reduce Hallucinations? The research team argue that fixing hallucinations requires socio-technical change, not just new evaluation suites. They propose explicit confidence targets: benchmarks should clearly specify penalties for wrong answers and partial credit for abstentions. For example: “Answer only if you are >75% confident. Mistakes lose 2 points; correct answers earn 1; ‘I don’t know’ earns 0.” This design mirrors real-world exams like earlier SAT and GRE formats, where guessing carried penalties. It encourages behavioral calibration—models abstain when their confidence is below the threshold, producing fewer overconfident hallucinations while still optimizing for benchmark performance. What Are the Broader Implications? This work reframes hallucinations as predictable outcomes of training objectives and evaluation misalignment rather than inexplicable quirks. The findings highlight: Pretraining inevitability: Hallucinations parallel misclassification errors in supervised learning. Post-training reinforcement: Binary grading schemes incentivize guessing. Evaluation reform: Adjusting mainstream benchmarks to reward uncertainty can realign incentives and improve trustworthiness. By connecting hallucinations to established learning theory, the research demystifies their origin and suggests practical mitigation strategies that shift responsibility from model architectures to evaluation design. Check out the PAPER and Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem appeared first on MarkTechPost.

From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem Leer entrada »

AI, Committee, Noticias, Uncategorized

Putin says organ transplants could grant immortality. Not quite.

admin NU / septiembre 6, 2025

This week I’m writing from Manchester, where I’ve been attending a conference on aging. Wednesday was full of talks and presentations by scientists who are trying to understand the nitty-gritty of aging—all the way down to the molecular level. Once we can understand the complex biology of aging, we should be able to slow or prevent the onset of age-related diseases, they hope. Then my editor forwarded me a video of the leaders of Russia and China talking about immortality. “These days at 70 years old you are still a child,” China’s Xi Jinping, 72, was translated as saying, according to footage livestreamed by CCTV to multiple media outlets. “With the developments of biotechnology, human organs can be continuously transplanted, and people can live younger and younger, and even achieve immortality,” Russia’s Vladimir Putin, also 72, is reported to have replied. SERGEI BOBYLEV, SPUTNIK, KREMLIN POOL PHOTO VIA AP There’s a striking contrast between that radical vision and the incremental longevity science presented at the meeting. Repeated rounds of organ transplantation surgery aren’t likely to help anyone radically extend their lifespan anytime soon. First, back to Putin’s proposal: the idea of continually replacing aged organs to stay young. It’s a simplistic way to think about aging. After all, aging is so complicated that researchers can’t agree on what causes it, why it occurs, or even how to define it, let alone “treat” it. Having said that, there may be some merit to the idea of repairing worn-out body parts with biological or synthetic replacements. Replacement therapies—including bioengineered organs—are being developed by multiple research teams. Some have already been tested in people. This week, let’s take a look at the idea of replacement therapies. No one fully understands why our organs start to fail with age. On the face of it, replacing them seems like a good idea. After all, we already know how to do organ transplants. They’ve been a part of medicine since the 1950s and have been used to save hundreds of thousands of lives in the US alone. And replacing old organs with young ones might have more broadly beneficial effects. When a young mouse is stitched to an old one, the older mouse benefits from the arrangement, and its health seems to improve. The problem is that we don’t really know why. We don’t know what it is about young body tissues that makes them health-promoting. We don’t know how long these effects might last in a person. We don’t know how different organ transplants will compare, either. Might a young heart be more beneficial than a young liver? No one knows. And that’s before you consider the practicalities of organ transplantation. There is already a shortage of donor organs—thousands of people die on waiting lists. Transplantation requires major surgery and, typically, a lifetime of prescription drugs that damp down the immune system, leaving a person more susceptible to certain infections and diseases. So the idea of repeated organ transplantations shouldn’t really be a particularly appealing one. “I don’t think that’s going to happen anytime soon,” says Jesse Poganik, who studies aging at Brigham and Women’s Hospital in Boston and is also in Manchester for the meeting. Poganik has been collaborating with transplant surgeons in his own research. “The surgeries are good, but they’re not simple,” he tells me. And they come with real risks. His own 24-year-old cousin developed a form of cancer after a liver and heart transplant. She died a few weeks ago, he says. So when it comes to replacing worn-out organs, scientists are looking for both biological and synthetic alternatives. We’ve been replacing body parts for centuries. Wooden toes were used as far back as the 15th century. Joint replacements have been around for more than a hundred years. And major innovations over the last 70 years have given us devices like pacemakers, hearing aids, brain implants, and artificial hearts. Scientists are exploring other ways to make tissues and organs, too. There are different approaches here, but they include everything from injecting stem cells to seeding “scaffolds” with cells in a lab. In 1999, researchers used volunteers’ own cells to seed bladder-shaped collagen scaffolds. The resulting bioengineered bladders went on to be transplanted into seven people in an initial trial. Now scientists are working on more complicated organs. Jean Hébert, a program manager at the US government’s Advanced Research Projects Agency for Health, has been exploring ways to gradually replace the cells in a person’s brain. The idea is that, eventually, the recipient will end up with a young brain. Hébert showed my colleague Antonio Regalado how, in his early experiments, he removed parts of mice’s brains and replaced them with embryonic stem cells. That work seems a world away from the biochemical studies being presented at the British Society for Research on Ageing annual meeting in Manchester, where I am now. On Wednesday, one scientist described how he’d been testing potential longevity drugs on the tiny nematode worm C. elegans. These worms live for only about 15 to 40 days, and his team can perform tens of thousands of experiments with them. About 40% of the drugs that extend lifespan in C. elegans also help mice live longer, he told us. To me, that’s not an amazing hit rate. And we don’t know how many of those drugs will work in people. Probably less than 40% of that 40%. Other scientists presented work on chemical reactions happening at the cellular level. It was deep, basic science, and my takeaway was that there’s a lot aging researchers still don’t fully understand. It will take years—if not decades—to get the full picture of aging at the molecular level. And if we rely on a series of experiments in worms, and then mice, and then humans, we’re unlikely to make progress for a really long time. In that context, the idea of replacement therapy feels like a shortcut. “Replacement is a really exciting avenue because you don’t have to understand the biology of aging as much,” says Sierra Lore,

Putin says organ transplants could grant immortality. Not quite. Leer entrada »

AI, Committee, Noticias, Uncategorized

A Gentle Introduction to Batch Normalization

admin NU / septiembre 6, 2025

Deep neural networks have drastically evolved over the years, overcoming common challenges that arise when training these complex models.

A Gentle Introduction to Batch Normalization Leer entrada »

AI, Committee, Noticias, Uncategorized

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis

admin NU / septiembre 6, 2025

In this tutorial, we present a complete end-to-end Natural Language Processing (NLP) pipeline built with Gensim and supporting libraries, designed to run seamlessly in Google Colab. It integrates multiple core techniques in modern NLP, including preprocessing, topic modeling with Latent Dirichlet Allocation (LDA), word embeddings with Word2Vec, TF-IDF-based similarity analysis, and semantic search. The pipeline not only demonstrates how to train and evaluate these models but also showcases practical visualizations, advanced topic analysis, and document classification workflows. By combining statistical methods with machine learning approaches, the tutorial provides a comprehensive framework for understanding and experimenting with text data at scale. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install –upgrade scipy==1.11.4 !pip install gensim==4.3.2 nltk wordcloud matplotlib seaborn pandas numpy scikit-learn !pip install –upgrade setuptools print(“Please restart runtime after installation!”) print(“Go to Runtime > Restart runtime, then run the next cell”) import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from wordcloud import WordCloud import warnings warnings.filterwarnings(‘ignore’) from gensim import corpora, models, similarities from gensim.models import Word2Vec, LdaModel, TfidfModel, CoherenceModel from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short import nltk nltk.download(‘punkt’, quiet=True) nltk.download(‘stopwords’, quiet=True) from nltk.corpus import stopwords from nltk.tokenize import word_tokenize We install and upgrade the necessary libraries, such as SciPy, Gensim, NLTK, and visualization tools, to ensure compatibility. We then import all required modules for preprocessing, modeling, and analysis. We also download NLTK resources to tokenize and handle stopwords efficiently, thereby setting up the environment for our NLP pipeline. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AdvancedGensimPipeline: def __init__(self): self.dictionary = None self.corpus = None self.lda_model = None self.word2vec_model = None self.tfidf_model = None self.similarity_index = None self.processed_docs = None def create_sample_corpus(self): “””Create a diverse sample corpus for demonstration””” documents = [ “Data science combines statistics, programming, and domain expertise to extract insights”, “Big data analytics helps organizations make data-driven decisions at scale”, “Cloud computing provides scalable infrastructure for modern applications and services”, “Cybersecurity protects digital systems from threats and unauthorized access attempts”, “Software engineering practices ensure reliable and maintainable code development”, “Database management systems store and organize large amounts of structured information”, “Python programming language is widely used for data analysis and machine learning”, “Statistical modeling helps identify patterns and relationships in complex datasets”, “Cross-validation techniques ensure robust model performance evaluation and selection”, “Recommendation systems suggest relevant items based on user preferences and behavior”, “Text mining extracts valuable insights from unstructured textual data sources”, “Image classification assigns predefined categories to visual content automatically”, “Reinforcement learning trains agents through interaction with dynamic environments” ] return documents def preprocess_documents(self, documents): “””Advanced document preprocessing using Gensim filters””” print(“Preprocessing documents…”) CUSTOM_FILTERS = [ strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, lambda x: x.lower() ] processed_docs = [] for doc in documents: processed = preprocess_string(doc, CUSTOM_FILTERS) stop_words = set(stopwords.words(‘english’)) processed = [word for word in processed if word not in stop_words and len(word) > 2] processed_docs.append(processed) self.processed_docs = processed_docs print(f”Processed {len(processed_docs)} documents”) return processed_docs def create_dictionary_and_corpus(self): “””Create Gensim dictionary and corpus””” print(“Creating dictionary and corpus…”) self.dictionary = corpora.Dictionary(self.processed_docs) self.dictionary.filter_extremes(no_below=2, no_above=0.8) self.corpus = [self.dictionary.doc2bow(doc) for doc in self.processed_docs] print(f”Dictionary size: {len(self.dictionary)}”) print(f”Corpus size: {len(self.corpus)}”) def train_word2vec_model(self): “””Train Word2Vec model for word embeddings””” print(“Training Word2Vec model…”) self.word2vec_model = Word2Vec( sentences=self.processed_docs, vector_size=100, window=5, min_count=2, workers=4, epochs=50 ) print(“Word2Vec model trained successfully”) def analyze_word_similarities(self): “””Analyze word similarities using Word2Vec””” print(“n=== Word2Vec Similarity Analysis ===”) test_words = [‘machine’, ‘data’, ‘learning’, ‘computer’] for word in test_words: if word in self.word2vec_model.wv: similar_words = self.word2vec_model.wv.most_similar(word, topn=3) print(f”Words similar to ‘{word}’: {similar_words}”) try: if all(w in self.word2vec_model.wv for w in [‘machine’, ‘computer’, ‘data’]): analogy = self.word2vec_model.wv.most_similar( positive=[‘computer’, ‘data’], negative=[‘machine’], topn=1 ) print(f”Analogy result: {analogy}”) except: print(“Not enough vocabulary for complex analogies”) def train_lda_model(self, num_topics=5): “””Train LDA topic model””” print(f”Training LDA model with {num_topics} topics…”) self.lda_model = LdaModel( corpus=self.corpus, id2word=self.dictionary, num_topics=num_topics, random_state=42, passes=10, alpha=’auto’, per_word_topics=True, eval_every=None ) print(“LDA model trained successfully”) def evaluate_topic_coherence(self): “””Evaluate topic model coherence””” print(“Evaluating topic coherence…”) coherence_model = CoherenceModel( model=self.lda_model, texts=self.processed_docs, dictionary=self.dictionary, coherence=’c_v’ ) coherence_score = coherence_model.get_coherence() print(f”Topic Coherence Score: {coherence_score:.4f}”) return coherence_score def display_topics(self): “””Display discovered topics””” print(“n=== Discovered Topics ===”) topics = self.lda_model.print_topics(num_words=8) for idx, topic in enumerate(topics): print(f”Topic {idx}: {topic[1]}”) def create_tfidf_model(self): “””Create TF-IDF model for document similarity””” print(“Creating TF-IDF model…”) self.tfidf_model = TfidfModel(self.corpus) corpus_tfidf = self.tfidf_model[self.corpus] self.similarity_index = similarities.MatrixSimilarity(corpus_tfidf) print(“TF-IDF model and similarity index created”) def find_similar_documents(self, query_doc_idx=0): “””Find documents similar to a query document””” print(f”n=== Document Similarity Analysis ===”) query_doc_tfidf = self.tfidf_model[self.corpus[query_doc_idx]] similarities_scores = self.similarity_index[query_doc_tfidf] sorted_similarities = sorted(enumerate(similarities_scores), key=lambda x: x[1], reverse=True) print(f”Documents most similar to document {query_doc_idx}:”) for doc_idx, similarity in sorted_similarities[:5]: print(f”Doc {doc_idx}: {similarity:.4f}”) def visualize_topics(self): “””Create visualizations for topic analysis””” print(“Creating topic visualizations…”) doc_topic_matrix = [] for doc_bow in self.corpus: doc_topics = dict(self.lda_model.get_document_topics(doc_bow, minimum_probability=0)) topic_vec = [doc_topics.get(i, 0) for i in range(self.lda_model.num_topics)] doc_topic_matrix.append(topic_vec) doc_topic_df = pd.DataFrame(doc_topic_matrix, columns=[f’Topic_{i}’ for i in range(self.lda_model.num_topics)]) plt.figure(figsize=(12, 8)) sns.heatmap(doc_topic_df.T, annot=True, cmap=’Blues’, fmt=’.2f’) plt.title(‘Document-Topic Distribution Heatmap’) plt.xlabel(‘Documents’) plt.ylabel(‘Topics’) plt.tight_layout() plt.show() fig, axes = plt.subplots(2, 3, figsize=(15, 10)) axes = axes.flatten() for topic_id in range(min(6, self.lda_model.num_topics)): topic_words = dict(self.lda_model.show_topic(topic_id, topn=20)) wordcloud = WordCloud( width=300, height=200, background_color=’white’, colormap=’viridis’ ).generate_from_frequencies(topic_words) axes[topic_id].imshow(wordcloud, interpolation=’bilinear’) axes[topic_id].set_title(f’Topic {topic_id}’) axes[topic_id].axis(‘off’) for i in range(self.lda_model.num_topics, 6): axes[i].axis(‘off’) plt.tight_layout() plt.show() def advanced_topic_analysis(self): “””Perform advanced topic analysis””” print(“n=== Advanced Topic Analysis ===”) topic_distributions = [] for i, doc_bow in enumerate(self.corpus): doc_topics = self.lda_model.get_document_topics(doc_bow) dominant_topic = max(doc_topics, key=lambda x: x[1]) if doc_topics else (0, 0) topic_distributions.append({ ‘doc_id’: i, ‘dominant_topic’: dominant_topic[0], ‘topic_probability’: dominant_topic[1] }) topic_df = pd.DataFrame(topic_distributions) plt.figure(figsize=(10, 6)) topic_counts = topic_df[‘dominant_topic’].value_counts().sort_index() plt.bar(range(len(topic_counts)), topic_counts.values) plt.xlabel(‘Topic ID’) plt.ylabel(‘Number of Documents’) plt.title(‘Distribution of Dominant Topics Across Documents’) plt.xticks(range(len(topic_counts)), [f’Topic {i}’ for i in topic_counts.index]) plt.show() return topic_df def document_classification_demo(self, new_document): “””Classify a new document using trained models””” print(f”n=== Document Classification Demo ===”) print(f”Classifying: ‘{new_document[:50]}…'”) processed_new = preprocess_string(new_document, [ strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, lambda x: x.lower() ]) new_doc_bow = self.dictionary.doc2bow(processed_new) doc_topics = self.lda_model.get_document_topics(new_doc_bow) print(“Topic probabilities:”) for topic_id, prob in doc_topics: print(f” Topic {topic_id}: {prob:.4f}”) new_doc_tfidf = self.tfidf_model[new_doc_bow] similarities_scores = self.similarity_index[new_doc_tfidf] most_similar = np.argmax(similarities_scores) print(f”Most similar document: {most_similar} (similarity: {similarities_scores[most_similar]:.4f})”) return doc_topics, most_similar def run_complete_pipeline(self): “””Execute the complete NLP pipeline””” print(“=== Advanced Gensim NLP Pipeline

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis Leer entrada »

AI

Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning

Alibaba AI Unveils Qwen3-Max Preview: A Trillion-Parameter Qwen Model with Super Fast Speed and Quality

Implementing DeepSpeed for Scalable Transformers: Advanced Training with Gradient Checkpointing and Parallelism

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs)

Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model with Over 30 Billion Parameters and Support Most European Languages

From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem

Putin says organ transplants could grant immortality. Not quite.

A Gentle Introduction to Batch Normalization

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis

Nuestros servicios

Inicio

Cómo funciona

Noticias

Precios

Soporte

Centro de ayuda

Reportar un problema

Dar comentarios

Política de privacidad

Cuenta de usuario

Síguenos