YouZum

Noticias

AI, Committee, Noticias, Uncategorized

IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture

IBM has quietly built a strong presence in the open-source AI ecosystem, and its latest release shows why it shouldn’t be overlooked. The company has introduced two new embedding models—granite-embedding-english-r2 and granite-embedding-small-english-r2—designed specifically for high-performance retrieval and RAG (retrieval-augmented generation) systems. These models are not only compact and efficient but also licensed under Apache 2.0, making them ready for commercial deployment. What Models Did IBM Release? The two models target different compute budgets. The larger granite-embedding-english-r2 has 149 million parameters with an embedding size of 768, built on a 22-layer ModernBERT encoder. Its smaller counterpart, granite-embedding-small-english-r2, comes in at just 47 million parameters with an embedding size of 384, using a 12-layer ModernBERT encoder. Despite their differences in size, both support a maximum context length of 8192 tokens, a major upgrade from the first-generation Granite embeddings. This long-context capability makes them highly suitable for enterprise workloads involving long documents and complex retrieval tasks. https://arxiv.org/abs/2508.21085 What’s Inside the Architecture? Both models are built on the ModernBERT backbone, which introduces several optimizations: Alternating global and local attention to balance efficiency with long-range dependencies. Rotary positional embeddings (RoPE) tuned for positional interpolation, enabling longer context windows. FlashAttention 2 to improve memory usage and throughput at inference time. IBM also trained these models with a multi-stage pipeline. The process started with masked language pretraining on a two-trillion-token dataset sourced from web, Wikipedia, PubMed, BookCorpus, and internal IBM technical documents. This was followed by context extension from 1k to 8k tokens, contrastive learning with distillation from Mistral-7B, and domain-specific tuning for conversational, tabular, and code retrieval tasks. How Do They Perform on Benchmarks? The Granite R2 models deliver strong results across widely used retrieval benchmarks. On MTEB-v2 and BEIR, the larger granite-embedding-english-r2 outperforms similarly sized models like BGE Base, E5, and Arctic Embed. The smaller model, granite-embedding-small-english-r2, achieves accuracy close to models two to three times larger, making it particularly attractive for latency-sensitive workloads. https://arxiv.org/abs/2508.21085 Both models also perform well in specialized domains: Long-document retrieval (MLDR, LongEmbed) where 8k context support is critical. Table retrieval tasks (OTT-QA, FinQA, OpenWikiTables) where structured reasoning is required. Code retrieval (CoIR), handling both text-to-code and code-to-text queries. Are They Fast Enough for Large-Scale Use? Efficiency is one of the standout aspects of these models. On an Nvidia H100 GPU, the granite-embedding-small-english-r2 encodes nearly 200 documents per second, which is significantly faster than BGE Small and E5 Small. The larger granite-embedding-english-r2 also reaches 144 documents per second, outperforming many ModernBERT-based alternatives. Crucially, these models remain practical even on CPUs, allowing enterprises to run them in less GPU-intensive environments. This balance of speed, compact size, and retrieval accuracy makes them highly adaptable for real-world deployment. What Does This Mean for Retrieval in Practice? IBM’s Granite Embedding R2 models demonstrate that embedding systems don’t need massive parameter counts to be effective. They combine long-context support, benchmark-leading accuracy, and high throughput in compact architectures. For companies building retrieval pipelines, knowledge management systems, or RAG workflows, Granite R2 provides a production-ready, commercially viable alternative to existing open-source options. https://arxiv.org/abs/2508.21085 Summary In short, IBM’s Granite Embedding R2 models strike an effective balance between compact design, long-context capability, and strong retrieval performance. With throughput optimized for both GPU and CPU environments, and an Apache 2.0 license that enables unrestricted commercial use, they present a practical alternative to bulkier open-source embeddings. For enterprises deploying RAG, search, or large-scale knowledge systems, Granite R2 stands out as an efficient and production-ready option. Check out the Paper, granite-embedding-small-english-r2 and granite-embedding-english-r2. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture appeared first on MarkTechPost.

IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture Leer entrada »

AI, Committee, Noticias, Uncategorized

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

arXiv:2509.09631v1 Announce Type: cross Abstract: Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech Leer entrada »

AI, Committee, Noticias, Uncategorized

Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition

arXiv:2509.09196v1 Announce Type: new Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives “bonus scores” to partial hypothesis (e.g. “Bon”) that may lead to the generation of the rare word (e.g. “Bonham”). If the full word (“Bonham”) isn’t ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.

Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition Leer entrada »

AI, Committee, Noticias, Uncategorized

Optimizing Length Compression in Large Reasoning Models

arXiv:2506.14755v2 Announce Type: replace-cross Abstract: Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” — models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

Optimizing Length Compression in Large Reasoning Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving

arXiv:2501.02348v2 Announce Type: replace Abstract: Complex problem-solving requires cognitive flexibility–the capacity to entertain multiple perspectives while preserving their distinctiveness. This flexibility replicates the “wisdom of crowds” within a single individual, allowing them to “think with many minds.” While mental simulation enables imagined deliberation, cognitive constraints limit its effectiveness. We propose synthetic deliberation, a Large Language Model (LLM)-based method that simulates discourse between agents embodying diverse perspectives, as a solution. Using a custom GPT-based model, we showcase its benefits: concurrent processing of multiple viewpoints without cognitive degradation, parallel exploration of perspectives, and precise control over viewpoint synthesis. By externalizing the deliberative process and distributing cognitive labor between parallel search and integration, synthetic deliberation transcends mental simulation’s limitations. This approach shows promise for strategic planning, policymaking, and conflict resolution.

Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving Leer entrada »

AI, Committee, Noticias, Uncategorized

DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning

arXiv:2509.09524v1 Announce Type: new Abstract: This system paper presents the DeMeVa team’s approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.

DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

Meet mmBERT: An Encoder-only Language Model Pretrained on 3T Tokens of Multilingual Text in over 1800 Languages and 2–4× Faster than Previous Models

Table of contents Why was a new multilingual encoder needed? Understanding the architecture of mmBERT What training data and phases were used? What new training strategies were introduced? How does mmBERT perform on benchmarks? How does mmBERT handle low-resource languages? What efficiency gains does mmBERT achieve? Summary Why was a new multilingual encoder needed? XLM-RoBERTa (XLM-R) has dominated multilingual NLP for more than 5 years, an unusually long reign in AI research. While encoder-only models like BERT and RoBERTa were central to early progress, most research energy shifted toward decoder-based generative models. Encoders, however, remain more efficient and often outperform decoders on embedding, retrieval, and classification tasks. Despite this, multilingual encoder development stalled. A team of researchers from Johns Hopkins University propose mmBERT that addresses this gap by delivering a modern encoder, surpassesing XLM-R and rivals recent large-scale models such as OpenAI’s o3 and Google’s Gemini 2.5 Pro. Understanding the architecture of mmBERT mmBERT comes in two main configurations: Base model: 22 transformer layers, 1152 hidden dimension, ~307M parameters (110M non-embedding). Small model: ~140M parameters (42M non-embedding). It adopts the Gemma 2 tokenizer with a 256k vocabulary, rotary position embeddings (RoPE), and FlashAttention2 for efficiency. Sequence length is extended from 1024 to 8192 tokens, using unpadded embeddings and sliding-window attention. This allows mmBERT to process contexts nearly an order of magnitude longer than XLM-R while maintaining faster inference. What training data and phases were used? mmBERT was trained on 3 trillion tokens spanning 1,833 languages. Data sources include FineWeb2, Dolma, MegaWika v2, ProLong, StarCoder, and others. English makes up only ~10–34% of the corpus depending on the phase. Training was done in three stages: Pre-training: 2.3T tokens across 60 languages and code. Mid-training: 600B tokens across 110 languages, focused on higher-quality sources. Decay phase: 100B tokens covering 1,833 languages, emphasizing low-resource adaptation. What new training strategies were introduced? Three main innovations drive mmBERT’s performance: Annealed Language Learning (ALL): Languages are introduced gradually (60 → 110 → 1833). Sampling distributions are annealed from high-resource to uniform, ensuring low-resource languages gain influence during later stages without overfitting limited data. Inverse Masking Schedule: The masking ratio starts at 30% and decays to 5%, encouraging coarse-grained learning early and fine-grained refinements later. Model Merging Across Decay Variants: Multiple decay-phase models (English-heavy, 110-language, and 1833-language) are combined via TIES merging, leveraging complementary strengths without retraining from scratch. How does mmBERT perform on benchmarks? English NLU (GLUE): mmBERT base achieves 86.3, surpassing XLM-R (83.3) and nearly matching ModernBERT (87.4), despite allocating >75% of training to non-English data. Multilingual NLU (XTREME): mmBERT base scores 72.8 vs. XLM-R’s 70.4, with gains in classification and QA tasks. Embedding tasks (MTEB v2): mmBERT base ties ModernBERT in English (53.9 vs. 53.8) and leads in multilingual (54.1 vs. 52.4 for XLM-R). Code retrieval (CoIR): mmBERT outperforms XLM-R by ~9 points, though EuroBERT remains stronger on proprietary data. How does mmBERT handle low-resource languages? The annealed learning schedule ensures that low-resource languages benefit during later training. On benchmarks like Faroese FoQA and Tigrinya TiQuAD, mmBERT significantly outperforms both o3 and Gemini 2.5 Pro. These results demonstrate that encoder models, if trained carefully, can generalize effectively even in extreme low-resource scenarios. What efficiency gains does mmBERT achieve? mmBERT is 2–4× faster than XLM-R and MiniLM while supporting 8192-token inputs. Notably, it remains faster at 8192 tokens than older encoders were at 512 tokens. This speed boost derives from the ModernBERT training recipe, efficient attention mechanisms, and optimized embeddings. Summary mmBERT comes as the long-overdue replacement for XLM-R, redefining what a multilingual encoder can deliver. It runs 2–4× faster, handles sequences up to 8K tokens, and outperforms prior models on both high-resource benchmarks and low-resource languages that were underserved in the past. Its training recipe—3 trillion tokens paired with annealed language learning, inverse masking, and model merging—shows how careful design can unlock broad generalization without excessive redundancy. The result is an open, efficient, and scalable encoder that not only fills the six-year gap since XLM-R but also provides a robust foundation for the next generation of multilingual NLP systems. Check out the Paper, Model on Hugging Face, GitHub and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meet mmBERT: An Encoder-only Language Model Pretrained on 3T Tokens of Multilingual Text in over 1800 Languages and 2–4× Faster than Previous Models appeared first on MarkTechPost.

Meet mmBERT: An Encoder-only Language Model Pretrained on 3T Tokens of Multilingual Text in over 1800 Languages and 2–4× Faster than Previous Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

arXiv:2502.06130v2 Announce Type: replace-cross Abstract: While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.

Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

arXiv:2504.02438v5 Announce Type: replace Abstract: Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at “mixed precision” through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP’s superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available at https://github.com/steven-ccq/ViLAMP.

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation Leer entrada »

AI, Committee, Noticias, Uncategorized

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

arXiv:2509.08753v1 Announce Type: new Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling Leer entrada »

We use cookies to improve your experience and performance on our website. You can learn more at Política de privacidad and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
es_ES