YouZum

Uncategorized

AI, Committee, News, Uncategorized

Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

arXiv:2507.03433v1 Announce Type: new Abstract: Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.

Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models Read Post »

AI, Committee, News, Uncategorized

What Is Context Engineering in AI? Techniques, Use Cases, and Why It Matters

Introduction: What is Context Engineering? Context engineering refers to the discipline of designing, organizing, and manipulating the context that is fed into large language models (LLMs) to optimize their performance. Rather than fine-tuning the model weights or architectures, context engineering focuses on the input—the prompts, system instructions, retrieved knowledge, formatting, and even the ordering of information. Context engineering isn’t about crafting better prompts. It’s about building systems that deliver the right context, exactly when it’s needed. Imagine an AI assistant asked to write a performance review.→ Poor Context: It only sees the instruction. The result is vague, generic feedback that lacks insight.→ Rich Context: It sees the instruction plus the employee’s goals, past reviews, project outcomes, peer feedback, and manager notes. The result? A nuanced, data-backed review that feels informed and personalized—because it is. This emerging practice is gaining traction due to the increasing reliance on prompt-based models like GPT-4, Claude, and Mistral. The performance of these models is often less about their size and more about the quality of the context they receive. In this sense, context engineering is the equivalent of prompt programming for the era of intelligent agents and retrieval-augmented generation (RAG). Why Do We Need Context Engineering? Token Efficiency: With context windows expanding but still bounded (e.g., 128K in GPT-4-Turbo), efficient context management becomes crucial. Redundant or poorly structured context wastes valuable tokens. Precision and Relevance: LLMs are sensitive to noise. The more targeted and logically arranged the prompt, the higher the likelihood of accurate output. Retrieval-Augmented Generation (RAG): In RAG systems, external data is fetched in real-time. Context engineering helps decide what to retrieve, how to chunk it, and how to present it. Agentic Workflows: When using tools like LangChain or OpenAgents, autonomous agents rely on context to maintain memory, goals, and tool usage. Bad context leads to failure in planning or hallucination. Domain-Specific Adaptation: Fine-tuning is expensive. Structuring better prompts or building retrieval pipelines lets models perform well in specialized tasks with zero-shot or few-shot learning. Key Techniques in Context Engineering Several methodologies and practices are shaping the field: 1. System Prompt Optimization The system prompt is foundational. It defines the LLM’s behavior and style. Techniques include: Role assignment (e.g., “You are a data science tutor”) Instructional framing (e.g., “Think step-by-step”) Constraint imposition (e.g., “Only output JSON”) 2. Prompt Composition and Chaining LangChain popularized the use of prompt templates and chains to modularize prompting. Chaining allows splitting tasks across prompts—for example, decomposing a question, retrieving evidence, then answering. 3. Context Compression With limited context windows, one can: Use summarization models to compress previous conversation Embed and cluster similar content to remove redundancy Apply structured formats (like tables) instead of verbose prose 4. Dynamic Retrieval and Routing RAG pipelines (like those in LlamaIndex and LangChain) retrieve documents from vector stores based on user intent. Advanced setups include: Query rephrasing or expansion before retrieval Multi-vector routing to choose different sources or retrievers Context re-ranking based on relevance and recency 5. Memory Engineering Short-term memory (what’s in the prompt) and long-term memory (retrievable history) need alignment. Techniques include: Context replay (injecting past relevant interactions) Memory summarization Intent-aware memory selection 6. Tool-Augmented Context In agent-based systems, tool usage is context-aware: Tool description formatting Tool history summarization Observations passed between steps Context Engineering vs. Prompt Engineering While related, context engineering is broader and more system-level. Prompt engineering is typically about static, handcrafted input strings. Context engineering encompasses dynamic context construction using embeddings, memory, chaining, and retrieval. As Simon Willison noted, “Context engineering is what we do instead of fine-tuning.” Real-World Applications Customer Support Agents: Feeding prior ticket summaries, customer profile data, and KB docs. Code Assistants: Injecting repo-specific documentation, previous commits, and function usage. Legal Document Search: Context-aware querying with case history and precedents. Education: Personalized tutoring agents with memory of learner behavior and goals. Challenges in Context Engineering Despite its promise, several pain points remain: Latency: Retrieval and formatting steps introduce overhead. Ranking Quality: Poor retrieval hurts downstream generation. Token Budgeting: Choosing what to include/exclude is non-trivial. Tool Interoperability: Mixing tools (LangChain, LlamaIndex, custom retrievers) adds complexity. Emerging Best Practices Combine structured (JSON, tables) and unstructured text for better parsing. Limit each context injection to a single logical unit (e.g., one document or conversation summary). Use metadata (timestamps, authorship) for better sorting and scoring. Log, trace, and audit context injections to improve over time. The Future of Context Engineering Several trends suggest that context engineering will be foundational in LLM pipelines: Model-Aware Context Adaptation: Future models may dynamically request the type or format of context they need. Self-Reflective Agents: Agents that audit their context, revise their own memory, and flag hallucination risk. Standardization: Similar to how JSON became a universal data interchange format, context templates may become standardized for agents and tools. As Andrej Karpathy hinted in a recent post, “Context is the new weight update.” Rather than retraining models, we are now programming them via their context—making context engineering the dominant software interface in the LLM era. Conclusion Context engineering is no longer optional—it is central to unlocking the full capabilities of modern language models. As toolkits like LangChain and LlamaIndex mature and agentic workflows proliferate, mastering context construction becomes as important as model selection. Whether you’re building a retrieval system, coding agent, or a personalized tutor, how you structure the model’s context will increasingly define its intelligence. Sources: https://x.com/tobi/status/1935533422589399127 https://x.com/karpathy/status/1937902205765607626 https://blog.langchain.com/the-rise-of-context-engineering/ https://rlancemartin.github.io/2025/06/23/context_engineering/ https://www.philschmid.de/context-engineering https://blog.langchain.com/context-engineering-for-agents/ https://www.llamaindex.ai/blog/context-engineering-what-it-is-and-techniques-to-consider Feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post What Is Context Engineering in AI? Techniques, Use Cases, and Why It Matters appeared first on MarkTechPost.

What Is Context Engineering in AI? Techniques, Use Cases, and Why It Matters Read Post »

AI, Committee, News, Uncategorized

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning

Optimizing LLMs for Human Alignment Using Reinforcement Learning Large language models often require a further alignment phase to optimize them for human use. In this phase, reinforcement learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning allows for the models to align more closely with user expectations, making them more suitable for instruction-based applications or precise mathematical tasks. Challenges in Choosing Offline vs. Online Reinforcement Learning Strategies A major difficulty arises when choosing the most effective way to conduct this fine-tuning. Training methods fall into two extremes—offline approaches that depend on static, pre-generated data and fully online approaches that continuously update with each new interaction. Each method has distinct challenges. Offline models can’t adapt during training, which limits performance, while online models often demand more computational resources. Moreover, ensuring that models perform well across both mathematical (verifiable) and open-ended (non-verifiable) tasks adds further complexity to this choice. Overview of Alignment Algorithms: DPO and GRPO Historically, tools like Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have been employed for model alignment. DPO operates offline and is designed to work with preference-based data pairs. It is valued for its simplicity and data efficiency but lacks the adaptability of online methods. GRPO is based on the PPO algorithm and handles online fine-tuning by comparing groups of outputs to compute relative advantages. While GRPO adapts in real-time and suits dynamic reward systems, its on-policy nature increases computational load and makes experimentation more demanding. A Balanced Alternative for LLM Alignment Research introduced by Meta and NYU explored a method to overcome these limitations through a semi-online training setup. This technique modulates how frequently the model’s generation and training components are synchronized, rather than updating at every training step, as in fully online methods, or not at all, as in offline setups. The semi-online method strikes a middle ground by adjusting the synchronization rate. Researchers designed this approach to reduce training time and maintain high model adaptability. The modular setup also allowed them to apply either DPO or GRPO with task-specific reward models in a flexible manner. Instruction Following and Mathematical Reasoning The methodology involved fine-tuning the Llama-3.1-8B-Instruct model using two types of tasks: open-ended instruction following and math problem-solving. For non-verifiable tasks, user prompts were sampled from the WildChat-1M dataset and evaluated using the Athene-RM-8B reward model, which assigns scalar scores to each prompt. For verifiable tasks, the team utilized the NuminaMath dataset in conjunction with the Math-Verify toolkit, which verifies whether generated answers align with expected outputs. Training experiments were conducted on 32 NVIDIA H200 GPUs for training and 8 GPUs for inference, with different setups comparing offline, semi-online, and online synchronization intervals. Performance Gains Across Both Verifiable and Non-Verifiable Tasks The performance differences were observed. On Math500, the offline DPO reached 53.7% accuracy, whereas the semi-online DPO with a synchronization interval of s = 100 achieved 58.9%. Online DPO and GRPO showed similar results at 58.7% and 58.1%, respectively. Similar trends were observed on the NuminaMath benchmark, where the offline DPO achieved 36.4%, and semi-online variants increased this to 39.4% (s = 10). The performance gains were not limited to math tasks. When non-verifiable tasks were evaluated with AlpacaEval 2.0 and Arena-Hard benchmarks, models trained with mixed reward types performed consistently better. Combining verifiable and non-verifiable rewards in a single training setup resulted in stronger average scores, indicating that the method generalized effectively. A Flexible, Scalable Approach for Reinforcement Learning in LLMs This study demonstrates that fine-tuning large language models does not require strict adherence to either offline or online setups. By introducing a flexible synchronization scheme, the research team from Meta and NYU effectively increased training efficiency while maintaining or improving performance. The results show that carefully balancing reward types and training synchronization frequency leads to models that perform well across task types without incurring high computational costs. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning appeared first on MarkTechPost.

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning Read Post »

AI, Committee, News, Uncategorized

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models

Understanding Limitations of Current Reward Models Although reward models play a crucial role in Reinforcement Learning from Human Feedback (RLHF), many of today’s top-performing open models still struggle to reflect the full range of complex human preferences. Even with sophisticated training techniques, meaningful progress has been limited. A major reason appears to be the shortcomings in current preference datasets, which are often too narrow, artificially generated, or poorly vetted. While some rule-based systems are effective for clear tasks like math or coding, they usually fail to capture nuanced human judgment. Moreover, common benchmarks like RewardBench are becoming less reliable indicators of real-world RM performance, showing poor correlation with downstream task success. Challenges in Preference Data Creation and New Approaches Creating high-quality preference data has traditionally relied on human annotators, but this method is time-consuming, costly, and sometimes inconsistent. To address this, recent techniques like RLAIF use LLMs to automate annotations, sometimes even outperforming humans. Newer approaches aim to combine the strengths of both by integrating LLM-generated data with human-verified labels. Meanwhile, reward models have evolved from simple scoring systems, such as the Bradley-Terry model, to more complex frameworks, including generative and direct optimization methods. Despite the availability of numerous robust open models and datasets, challenges persist in accurately capturing nuanced human preferences across diverse tasks and languages. Introducing SynPref-40M: Large-Scale Human-AI Preference Dataset Researchers from 2050 Research, Skywork AI introduce SynPref-40M, a massive dataset of 40 million preference pairs curated through a two-stage human-AI pipeline. Human annotators ensure quality through strict verification, while LLMs scale up data curation using human guidance. From this, they develop Skywork-Reward-V2, a family of eight reward models (0.6B–8B parameters) trained on a high-quality subset of 26 M. These models achieve state-of-the-art results across seven leading benchmarks, excelling in alignment, safety, objectivity, and robustness. The study highlights that success comes not just from data volume, but from careful, iterative curation that blends human expertise with AI scalability. Scalable Two-Stage Human-AI Curation Pipeline Current open reward models often suffer from overfitting to narrow benchmarks, such as RewardBench, which limits their real-world usefulness. To address this, the researchers introduce a two-stage, human-AI pipeline for curating large-scale preference data. Stage 1 starts with human-verified annotations to guide LLMs in labeling diverse preference attributes, followed by iterative training and error analysis to refine the reward model. Stage 2 scales this process using consistency checks between the best and a human-trained “gold” reward model, filtering reliable samples without further human input. This approach strikes a balance between quality and scalability, ultimately enabling the creation of tens of millions of high-quality preference pairs. Benchmarking Skywork-Reward-V2: Compact Yet Powerful Models The Skywork-Reward-V2 series demonstrates strong performance across multiple benchmarks, outperforming both larger models (e.g., 70B parameters) and emerging generative reward models. Trained using Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these models achieve high scores on RewardBench, PPE, RM-Bench, and JudgeBench, with the best-performing variant (Llama-3.1-8B-40M) surpassing all others with an average score of 88.6. Despite smaller model sizes, Skywork-Reward-V2 models benefit from high-quality preference data (SynPref-40M) and efficient training setups, enabling them to generalize better in real-world RLHF scenarios. Notably, even mid-sized models like the Qwen3-1.7B outperform some 70B models, emphasizing the impact of training data quality and methodology over sheer parameter count. Conclusion and Future Outlook: Scaling with Precision In conclusion, SynPref-40M, a large-scale preference dataset built through a two-stage human-AI collaboration, combining human judgment with LLM-based scalability. Using a curated subset of 26 million preference pairs, the team developed the Skywork-Reward-V2, a suite of eight reward models (0.6B–8B parameters) that outperform existing models across seven key benchmarks. These models show strong generalization in aligning with human values, ensuring correctness, safety, and robustness to bias. Extensive studies confirm that both the data quality and curation method are key drivers of performance. Looking forward, the researchers aim to explore new training strategies, as reward models become central to LLM development and alignment. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models appeared first on MarkTechPost.

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models Read Post »

AI, Committee, News, Uncategorized

Chai Discovery Team Releases Chai-2: AI Model Achieves 16% Hit Rate in De Novo Antibody Design

TLDR: Chai Discovery Team introduces Chai-2, a multimodal AI model that enables zero-shot de novo antibody design. Achieving a 16% hit rate across 52 novel targets using ≤20 candidates per target, Chai-2 outperforms prior methods by over 100x and delivers validated binders in under two weeks—eliminating the need for large-scale screening. In a significant advancement for computational drug discovery, the Chai Discovery Team has introduced Chai-2, a multimodal generative AI platform capable of zero-shot antibody and protein binder design. Unlike previous approaches that rely on extensive high-throughput screening, Chai-2 reliably designs functional binders in a single 24-well plate setup, achieving more than 100-fold improvement over existing state-of-the-art (SOTA) methods. Chai-2 was tested on 52 novel targets, none of which had known antibody or nanobody binders in the Protein Data Bank (PDB). Despite this challenge, the system achieved a 16% experimental hit rate, discovering binders for 50% of the tested targets within a two-week cycle from computational design to wet-lab validation. This performance marks a shift from probabilistic screening to deterministic generation in molecular engineering. AI-Powered De Novo Design at Experimental Scale Chai-2 integrates an all-atom generative design module and a folding model that predicts antibody-antigen complex structures with double the accuracy of its predecessor, Chai-1. The system operates in a zero-shot setting, generating sequences for antibody modalities like scFvs and VHHs without requiring prior binders. Key features of Chai-2 include: No target-specific tuning required Ability to prompt designs using epitope-level constraints Generation of therapeutically relevant formats (miniproteins, scFvs, VHHs) Support for cross-reactivity design between species (e.g., human and cyno) This approach allows researchers to design ≤20 antibodies or nanobodies per target and bypass the need for high-throughput screening altogether. Benchmarking Across Diverse Protein Targets In rigorous lab validations, Chai-2 was applied to targets with no sequence or structure similarity to known antibodies. Designs were synthesized and tested using bio-layer interferometry (BLI) for binding. Results show: 15.5% average hit rate across all formats 20.0% for VHHs, 13.7% for scFvs Successful binders for 26 out of 52 targets Notably, Chai-2 produced hits for hard targets such as TNFα, which has historically been intractable for in silico design. Many binders showed picomolar to low-nanomolar dissociation constants (KDs), indicating high-affinity interactions. Novelty, Diversity, and Specificity Chai-2’s outputs are structurally and sequentially distinct from known antibodies. Structural analysis showed: No generated design had <2Å RMSD from any known structure All CDR sequences had >10 edit distance from the closest known antibody Binders fell into multiple structural clusters per target, suggesting conformational diversity Additional evaluations confirmed low off-target binding and comparable polyreactivity profiles to clinical antibodies like Trastuzumab and Ixekizumab. Design Flexibility and Customization Beyond general-purpose binder generation, Chai-2 demonstrates the ability to: Target multiple epitopes on a single protein Produce binders across different antibody formats (e.g., scFv, VHH) Generate cross-species reactive antibodies in one prompt In a cross-reactivity case study, a Chai-2 designed antibody achieved nanomolar KDs against both human and cyno variants of a protein, demonstrating its utility for preclinical studies and therapeutic development. Implications for Drug Discovery Chai-2 effectively compresses the traditional biologics discovery timeline from months to weeks, delivering experimentally validated leads in a single round. Its combination of high success rate, design novelty, and modular prompting marks a paradigm shift in therapeutic discovery workflows. The framework can be extended beyond antibodies to miniproteins, macrocycles, enzymes, and potentially small molecules, paving the way for computational-first design paradigms. Future directions include expanding into bispecifics, ADCs, and exploring biophysical property optimization (e.g., viscosity, aggregation). As the field of AI in molecular design matures, Chai-2 sets a new bar for what can be achieved with generative models in real-world drug discovery settings. Check out the Technical Report. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Chai Discovery Team Releases Chai-2: AI Model Achieves 16% Hit Rate in De Novo Antibody Design appeared first on MarkTechPost.

Chai Discovery Team Releases Chai-2: AI Model Achieves 16% Hit Rate in De Novo Antibody Design Read Post »

AI, Committee, News, Uncategorized

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training

Kyutai, an open AI research lab, has released a groundbreaking streaming Text-to-Speech (TTS) model with ~2 billion parameters. Designed for real-time responsiveness, this model delivers ultra-low latency audio generation (220 milliseconds) while maintaining high fidelity. It’s trained on an unprecedented 2.5 million hours of audio and is licensed under the permissive CC-BY-4.0, reinforcing Kyutai’s commitment to openness and reproducibility. This advancement redefines the efficiency and accessibility of large-scale speech generation models, particularly for edge deployment and agentic AI. Unpacking the Performance: Sub-350ms Latency for 32 Concurrent Users on a Single L40 GPU The model’s streaming capability is its most distinctive feature. On a single NVIDIA L40 GPU, the system can serve up to 32 concurrent users while keeping the latency under 350ms. For individual use, the model maintains a generation latency as low as 220ms, enabling nearly real-time applications such as conversational agents, voice assistants, and live narration systems. This performance is enabled through Kyutai’s novel Delayed Streams Modeling approach, which allows the model to generate speech incrementally as text arrives. Key Technical Metrics: Model size: ~2B parameters Training data: 2.5 million hours of speech Latency: 220ms single-user, <350ms with 32 users on one L40 GPU Language support: English and French License: CC-BY-4.0 (open source) Delayed Streams Modeling: Architecting Real-Time Responsiveness Kyutai’s innovation is anchored in Delayed Streams Modeling, a technique that allows speech synthesis to begin before the full input text is available. This approach is specifically designed to balance prediction quality with response speed, enabling high-throughput streaming TTS. Unlike conventional autoregressive models that suffer from response lag, this architecture maintains temporal coherence while achieving faster-than-real-time synthesis. The codebase and training recipe for this architecture are available at Kyutai’s GitHub repository, supporting full reproducibility and community contributions. Model Availability and Open Research Commitment Kyutai has released the model weights and inference scripts on Hugging Face, making it accessible for researchers, developers, and commercial teams. The permissive CC-BY-4.0 license encourages unrestricted adaptation and integration into applications, provided proper attribution is maintained. This release supports both batch and streaming inference, making it a versatile foundation for voice cloning, real-time chatbots, accessibility tools, and more. With pretrained models in both English and French, Kyutai sets the stage for multilingual TTS pipelines. Implications for Real-Time AI Applications By reducing the speech generation latency to the 200ms range, Kyutai’s model narrows the human-perceptible delay between intent and speech, making it viable for: Conversational AI: Human-like voice interfaces with low turnaround Assistive Tech: Faster screen readers and voice feedback systems Media Production: Voiceovers with rapid iteration cycles Edge Devices: Optimized inference for low-power or on-device environments The ability to serve 32 users on a single L40 GPU without quality degradation also makes it attractive for scaling speech services efficiently in cloud environments. Conclusion: Open, Fast, and Ready for Deployment Kyutai’s streaming TTS release is a milestone in speech AI. With high-quality synthesis, real-time latency, and generous licensing, it addresses critical needs for both researchers and real-world product teams. The model’s reproducibility, multilingual support, and scalable performance make it a standout alternative to proprietary solutions. For more details, you can explore the official model card on Hugging Face, technical explanation on Kyutai’s site, and implementation specifics on GitHub. The post Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training appeared first on MarkTechPost.

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training Read Post »

AI, Committee, News, Uncategorized

AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks

Recent research indicates that LLMs, particularly smaller ones, frequently struggle with robust reasoning. They tend to perform well on familiar questions but falter when those same problems are slightly altered, such as changing names or numbers, or adding irrelevant but related information. This weakness, known as poor out-of-distribution (OOD) generalization, results in notable accuracy drops, even in simple math tasks. One promising solution is to create synthetic variations of reasoning problems, helping models learn to focus on the underlying logic rather than surface details. Strengthening reasoning in this manner is crucial for developing more general and reliable AI systems. Abstracting the Core Logic of LLM Reasoning Failures LLMs have demonstrated impressive reasoning capabilities, yet they often falter when exposed to distribution shifts, such as changes in phrasing, numerical values, or the introduction of distractions. This vulnerability is evident across benchmarks in logic, mathematics, and commonsense reasoning. Prior solutions have relied on data augmentation to expose models to a broader variety of inputs, improving robustness but increasing computational demands. Researchers have also explored formats such as abstraction-of-thought and chain-of-abstraction to teach abstract reasoning, while planning techniques like chain-of-thought and tree-of-thought aid step-by-step problem-solving. Reinforcement learning and preference-based methods provide additional support for reasoning skill development beyond pattern memorization. AbstRaL’s Symbolic Learning Method to Improve Reasoning Consistency Researchers from Apple and EPFL propose AbstRaL, a method that teaches LLMs to understand abstract reasoning patterns rather than memorizing surface details. Instead of generating many varied training examples, which is computationally costly, AbstRaL helps LLMs learn the underlying structure of reasoning problems using reinforcement learning. This method connects these abstract patterns to symbolic tools, enabling more reliable problem-solving. Tested on GSM benchmarks, AbstRaL significantly improves LLM performance, especially when faced with input changes or distracting information. It outperforms models trained only with supervised learning by promoting more consistent and context-independent reasoning. Four Steps to Abstract Symbolic Reasoning via AbstRaL AbstRaL is a four-step framework designed to teach LLMs to reason abstractly rather than rely on surface patterns. First, it identifies key variables in a question and replaces them with symbolic placeholders. Then, using specially crafted data (GranulAR), the model learns to reason step-by-step with these abstract symbols. Next, it retrieves the general reasoning structure (abstraction) from the symbolic answer. Finally, it uses this abstraction with the original values to compute the correct answer. Reinforcement learning with two rewards, one for correctness and another for symbolic similarity, further improves the model’s ability to generate accurate, context-independent reasoning patterns. GSM8K Variations Reveal AbstRaL’s Robustness Across LLM Sizes The researchers evaluate AbstRaL on math reasoning tasks using models such as Llama-3 and Qwen2, training them with a dataset called GranulAR that rewrites math problems in an abstract symbolic form. This helps models focus on structure rather than surface details. They test robustness using altered versions of GSM8K problems, changing numbers, names, and phrasing. Compared to baselines like standard Chain-of-Thought prompting, AbstRaL shows stronger consistency and less accuracy drop on these variations. Especially for smaller models, it improves reliability across reworded inputs. The results suggest that teaching models to reason abstractly makes them more adaptable and less reliant on memorized patterns. Teaching LLMs Abstract Thinking through Reinforcement Yields Robust Reasoning In conclusion, AbstRaL is a method designed to enhance abstract reasoning in LLMs, making them more resilient to superficial changes in problems. Unlike traditional fine-tuning or data augmentation, AbstRaL uses reinforcement learning to train models on GranulAR rationales that mix Socratic chain-of-thought with detailed abstraction. This approach helps models strip away surface-level distractions and better connect with symbolic tools. Tested on challenging GSM8K perturbation benchmarks, AbstRaL notably reduces performance drops under distribution shifts, particularly in smaller models. The study shows that learning to abstract improves reasoning robustness more effectively than relying solely on direct supervision. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks appeared first on MarkTechPost.

AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks Read Post »

AI, Committee, News, Uncategorized

Inside India’s scramble for AI independence

In Bengaluru, India, Adithya Kolavi felt a mix of excitement and validation as he watched DeepSeek unleash its disruptive language model on the world earlier this year. The Chinese technology rivaled the best of the West in terms of benchmarks, but it had been built with far less capital in far less time.  “I thought: ‘This is how we disrupt with less,’” says Kolavi, the 20-year-old founder of the Indian AI startup CognitiveLab. “If DeepSeek could do it, why not us?”  But for Abhishek Upperwal, founder of Soket AI Labs and architect of one of India’s earliest efforts to develop a foundation model, the moment felt more bittersweet.  Upperwal’s model, called Pragna-1B, had struggled to stay afloat with tiny grants while he watched global peers raise millions. The multilingual model had a relatively modest 1.25 billion parameters and was designed to reduce the “language tax,” the extra costs that arise because India—unlike the US or even China—has a multitude of languages to support. His team had trained it, but limited resources meant it couldn’t scale. As a result, he says, the project became a proof of concept rather than a product.  “If we had been funded two years ago, there’s a good chance we’d be the ones building what DeepSeek just released,” he says. Kolavi’s enthusiasm and Upperwal’s dismay reflect the spectrum of emotions among India’s AI builders. Despite its status as a global tech hub, the country lags far behind the likes of the US and China when it comes to homegrown AI. That gap has opened largely because India has chronically underinvested in R&D, institutions, and invention. Meanwhile, since no one native language is spoken by the majority of the population, training language models is far more complicated than it is elsewhere.  Historically known as the global back office for the software industry, India has a tech ecosystem that evolved with a services-first mindset. Giants like Infosys and TCS built their success on efficient software delivery, but invention was neither prioritized nor rewarded. Meanwhile, India’s R&D spending hovered at just 0.65% of GDP ($25.4 billion) in 2024, far behind China’s 2.68% ($476.2 billion) and the US’s 3.5% ($962.3 billion). The muscle to invent and commercialize deep tech, from algorithms to chips, was just never built. Isolated pockets of world-class research do exist within government agencies like the DRDO (Defense Research & Development Organization) and ISRO (Indian Space Research Organization), but their breakthroughs rarely spill into civilian or commercial use. India lacks the bridges to connect risk-taking research to commercial pathways, the way DARPA does in the US. Meanwhile, much of India’s top talent migrates abroad, drawn to ecosystems that better understand and, crucially, fund deep tech. So when the open-source foundation model DeepSeek-R1 suddenly outperformed many global peers, it struck a nerve. This launch by a Chinese startup prompted Indian policymakers to confront just how far behind the country was in AI infrastructure, and how urgently it needed to respond. India responds In January 2025, 10 days after DeepSeek-R1’s launch, the Ministry of Electronics and Information Technology (MeitY) solicited proposals for India’s own foundation models, which are large AI models that can be adapted to a wide range of tasks. Its public tender invited private-sector cloud and data‑center companies to reserve GPU compute capacity for government‑led AI research.  Providers including Jio, Yotta, E2E Networks, Tata, AWS partners, and CDAC responded. Through this arrangement, MeitY suddenly had access to nearly 19,000 GPUs at subsidized rates, repurposed from private infrastructure and allocated specifically to foundational AI projects. This triggered a surge of proposals from companies wanting to build their own models.  Within two weeks, it had 67 proposals in hand. That number tripled by mid-March.  In April, the government announced plans to develop six large-scale models by the end of 2025, plus 18 additional AI applications targeting sectors like agriculture, education, and climate action. Most notably, it tapped Sarvam AI to build a 70-billion-parameter model optimized for Indian languages and needs.  For a nation long restricted by limited research infrastructure, things moved at record speed, marking a rare convergence of ambition, talent, and political will. “India could do a Mangalyaan in AI,” said Gautam Shroff of IIIT-Delhi, referencing the country’s cost-effective, and successful, Mars orbiter mission.  Jaspreet Bindra, cofounder of AI&Beyond, an organization focused on teaching AI literacy, captured the urgency: “DeepSeek is probably the best thing that happened to India. It gave us a kick in the backside to stop talking and start doing something.” The language problem One of the most fundamental challenges in building foundational AI models for India is the country’s sheer linguistic diversity. With 22 official languages, hundreds of dialects, and millions of people who are multilingual, India poses a problem that few existing LLMs are equipped to handle. Whereas a massive amount of high-quality web data is available in English, Indian languages collectively make up less than 1% of online content. The lack of digitized, labeled, and cleaned data in languages like Bhojpuri and Kannada makes it difficult to train LLMs that understand how Indians actually speak or search. Global tokenizers, which break text into units a model can process, also perform poorly on many Indian scripts, misinterpreting characters or skipping some altogether. As a result, even when Indian languages are included in multilingual models, they’re often poorly understood and inaccurately generated. And unlike OpenAI and DeepSeek, which achieved scale using structured English-language data, Indian teams often begin with fragmented and low-quality data sets encompassing dozens of Indian languages. This makes the early steps of training foundation models far more complex. Nonetheless, a small but determined group of Indian builders is starting to shape the country’s AI future. For example, Sarvam AI has created OpenHathi-Hi-v0.1, an open-source Hindi language model that shows the Indian AI field’s growing ability to address the country’s vast linguistic diversity. The model, built on Meta’s Llama 2 architecture, was trained on 40 billion tokens of Hindi and related Indian-language content, making it one of the largest open-source Hindi

Inside India’s scramble for AI independence Read Post »

en_US