YouZum

Committee

AI, Committee, Noticias, Uncategorized

Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools

arXiv:2507.05305v1 Announce Type: cross Abstract: Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.

Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools Leer entrada »

AI, Committee, Noticias, Uncategorized

Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right

arXiv:2507.03473v1 Announce Type: new Abstract: Despite mounting evidence that multilinguality can be easily weaponized against language models (LMs), works across NLP Security remain overwhelmingly English-centric. In terms of securing LMs, the NLP norm of “English first” collides with standard procedure in cybersecurity, whereby practitioners are expected to anticipate and prepare for worst-case outcomes. To mitigate worst-case outcomes in NLP Security, researchers must be willing to engage with the weakest links in LM security: lower-resourced languages. Accordingly, this work examines the security of LMs for lower- and medium-resourced languages. We extend existing adversarial attacks for up to 70 languages to evaluate the security of monolingual and multilingual LMs for these languages. Through our analysis, we find that monolingual models are often too small in total number of parameters to ensure sound security, and that while multilinguality is helpful, it does not always guarantee improved security either. Ultimately, these findings highlight important considerations for more secure deployment of LMs, for communities of lower-resourced languages.

Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right Leer entrada »

AI, Committee, Noticias, Uncategorized

Towards Understanding the Cognitive Habits of Large Reasoning Models

arXiv:2506.21571v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns — e.g., “Wait, did I miss anything?” — consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs’ cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs’ cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs’ CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: https://github.com/jianshuod/CogTest.

Towards Understanding the Cognitive Habits of Large Reasoning Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Self-Consistency Preference Optimization

arXiv:2411.04109v3 Announce Type: replace Abstract: Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.

Self-Consistency Preference Optimization Leer entrada »

AI, Committee, Noticias, Uncategorized

Demystifying ChatGPT: How It Masters Genre Recognition

arXiv:2507.03875v1 Announce Type: new Abstract: The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT’s substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT’s remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT’s potential for content-related applications by incorporating visual information from movie posters.

Demystifying ChatGPT: How It Masters Genre Recognition Leer entrada »

AI, Committee, Noticias, Uncategorized

Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

arXiv:2507.03433v1 Announce Type: new Abstract: Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.

Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

What Is Context Engineering in AI? Techniques, Use Cases, and Why It Matters

Introduction: What is Context Engineering? Context engineering refers to the discipline of designing, organizing, and manipulating the context that is fed into large language models (LLMs) to optimize their performance. Rather than fine-tuning the model weights or architectures, context engineering focuses on the input—the prompts, system instructions, retrieved knowledge, formatting, and even the ordering of information. Context engineering isn’t about crafting better prompts. It’s about building systems that deliver the right context, exactly when it’s needed. Imagine an AI assistant asked to write a performance review.→ Poor Context: It only sees the instruction. The result is vague, generic feedback that lacks insight.→ Rich Context: It sees the instruction plus the employee’s goals, past reviews, project outcomes, peer feedback, and manager notes. The result? A nuanced, data-backed review that feels informed and personalized—because it is. This emerging practice is gaining traction due to the increasing reliance on prompt-based models like GPT-4, Claude, and Mistral. The performance of these models is often less about their size and more about the quality of the context they receive. In this sense, context engineering is the equivalent of prompt programming for the era of intelligent agents and retrieval-augmented generation (RAG). Why Do We Need Context Engineering? Token Efficiency: With context windows expanding but still bounded (e.g., 128K in GPT-4-Turbo), efficient context management becomes crucial. Redundant or poorly structured context wastes valuable tokens. Precision and Relevance: LLMs are sensitive to noise. The more targeted and logically arranged the prompt, the higher the likelihood of accurate output. Retrieval-Augmented Generation (RAG): In RAG systems, external data is fetched in real-time. Context engineering helps decide what to retrieve, how to chunk it, and how to present it. Agentic Workflows: When using tools like LangChain or OpenAgents, autonomous agents rely on context to maintain memory, goals, and tool usage. Bad context leads to failure in planning or hallucination. Domain-Specific Adaptation: Fine-tuning is expensive. Structuring better prompts or building retrieval pipelines lets models perform well in specialized tasks with zero-shot or few-shot learning. Key Techniques in Context Engineering Several methodologies and practices are shaping the field: 1. System Prompt Optimization The system prompt is foundational. It defines the LLM’s behavior and style. Techniques include: Role assignment (e.g., “You are a data science tutor”) Instructional framing (e.g., “Think step-by-step”) Constraint imposition (e.g., “Only output JSON”) 2. Prompt Composition and Chaining LangChain popularized the use of prompt templates and chains to modularize prompting. Chaining allows splitting tasks across prompts—for example, decomposing a question, retrieving evidence, then answering. 3. Context Compression With limited context windows, one can: Use summarization models to compress previous conversation Embed and cluster similar content to remove redundancy Apply structured formats (like tables) instead of verbose prose 4. Dynamic Retrieval and Routing RAG pipelines (like those in LlamaIndex and LangChain) retrieve documents from vector stores based on user intent. Advanced setups include: Query rephrasing or expansion before retrieval Multi-vector routing to choose different sources or retrievers Context re-ranking based on relevance and recency 5. Memory Engineering Short-term memory (what’s in the prompt) and long-term memory (retrievable history) need alignment. Techniques include: Context replay (injecting past relevant interactions) Memory summarization Intent-aware memory selection 6. Tool-Augmented Context In agent-based systems, tool usage is context-aware: Tool description formatting Tool history summarization Observations passed between steps Context Engineering vs. Prompt Engineering While related, context engineering is broader and more system-level. Prompt engineering is typically about static, handcrafted input strings. Context engineering encompasses dynamic context construction using embeddings, memory, chaining, and retrieval. As Simon Willison noted, “Context engineering is what we do instead of fine-tuning.” Real-World Applications Customer Support Agents: Feeding prior ticket summaries, customer profile data, and KB docs. Code Assistants: Injecting repo-specific documentation, previous commits, and function usage. Legal Document Search: Context-aware querying with case history and precedents. Education: Personalized tutoring agents with memory of learner behavior and goals. Challenges in Context Engineering Despite its promise, several pain points remain: Latency: Retrieval and formatting steps introduce overhead. Ranking Quality: Poor retrieval hurts downstream generation. Token Budgeting: Choosing what to include/exclude is non-trivial. Tool Interoperability: Mixing tools (LangChain, LlamaIndex, custom retrievers) adds complexity. Emerging Best Practices Combine structured (JSON, tables) and unstructured text for better parsing. Limit each context injection to a single logical unit (e.g., one document or conversation summary). Use metadata (timestamps, authorship) for better sorting and scoring. Log, trace, and audit context injections to improve over time. The Future of Context Engineering Several trends suggest that context engineering will be foundational in LLM pipelines: Model-Aware Context Adaptation: Future models may dynamically request the type or format of context they need. Self-Reflective Agents: Agents that audit their context, revise their own memory, and flag hallucination risk. Standardization: Similar to how JSON became a universal data interchange format, context templates may become standardized for agents and tools. As Andrej Karpathy hinted in a recent post, “Context is the new weight update.” Rather than retraining models, we are now programming them via their context—making context engineering the dominant software interface in the LLM era. Conclusion Context engineering is no longer optional—it is central to unlocking the full capabilities of modern language models. As toolkits like LangChain and LlamaIndex mature and agentic workflows proliferate, mastering context construction becomes as important as model selection. Whether you’re building a retrieval system, coding agent, or a personalized tutor, how you structure the model’s context will increasingly define its intelligence. Sources: https://x.com/tobi/status/1935533422589399127 https://x.com/karpathy/status/1937902205765607626 https://blog.langchain.com/the-rise-of-context-engineering/ https://rlancemartin.github.io/2025/06/23/context_engineering/ https://www.philschmid.de/context-engineering https://blog.langchain.com/context-engineering-for-agents/ https://www.llamaindex.ai/blog/context-engineering-what-it-is-and-techniques-to-consider Feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post What Is Context Engineering in AI? Techniques, Use Cases, and Why It Matters appeared first on MarkTechPost.

What Is Context Engineering in AI? Techniques, Use Cases, and Why It Matters Leer entrada »

AI, Committee, Noticias, Uncategorized

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning

Optimizing LLMs for Human Alignment Using Reinforcement Learning Large language models often require a further alignment phase to optimize them for human use. In this phase, reinforcement learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning allows for the models to align more closely with user expectations, making them more suitable for instruction-based applications or precise mathematical tasks. Challenges in Choosing Offline vs. Online Reinforcement Learning Strategies A major difficulty arises when choosing the most effective way to conduct this fine-tuning. Training methods fall into two extremes—offline approaches that depend on static, pre-generated data and fully online approaches that continuously update with each new interaction. Each method has distinct challenges. Offline models can’t adapt during training, which limits performance, while online models often demand more computational resources. Moreover, ensuring that models perform well across both mathematical (verifiable) and open-ended (non-verifiable) tasks adds further complexity to this choice. Overview of Alignment Algorithms: DPO and GRPO Historically, tools like Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have been employed for model alignment. DPO operates offline and is designed to work with preference-based data pairs. It is valued for its simplicity and data efficiency but lacks the adaptability of online methods. GRPO is based on the PPO algorithm and handles online fine-tuning by comparing groups of outputs to compute relative advantages. While GRPO adapts in real-time and suits dynamic reward systems, its on-policy nature increases computational load and makes experimentation more demanding. A Balanced Alternative for LLM Alignment Research introduced by Meta and NYU explored a method to overcome these limitations through a semi-online training setup. This technique modulates how frequently the model’s generation and training components are synchronized, rather than updating at every training step, as in fully online methods, or not at all, as in offline setups. The semi-online method strikes a middle ground by adjusting the synchronization rate. Researchers designed this approach to reduce training time and maintain high model adaptability. The modular setup also allowed them to apply either DPO or GRPO with task-specific reward models in a flexible manner. Instruction Following and Mathematical Reasoning The methodology involved fine-tuning the Llama-3.1-8B-Instruct model using two types of tasks: open-ended instruction following and math problem-solving. For non-verifiable tasks, user prompts were sampled from the WildChat-1M dataset and evaluated using the Athene-RM-8B reward model, which assigns scalar scores to each prompt. For verifiable tasks, the team utilized the NuminaMath dataset in conjunction with the Math-Verify toolkit, which verifies whether generated answers align with expected outputs. Training experiments were conducted on 32 NVIDIA H200 GPUs for training and 8 GPUs for inference, with different setups comparing offline, semi-online, and online synchronization intervals. Performance Gains Across Both Verifiable and Non-Verifiable Tasks The performance differences were observed. On Math500, the offline DPO reached 53.7% accuracy, whereas the semi-online DPO with a synchronization interval of s = 100 achieved 58.9%. Online DPO and GRPO showed similar results at 58.7% and 58.1%, respectively. Similar trends were observed on the NuminaMath benchmark, where the offline DPO achieved 36.4%, and semi-online variants increased this to 39.4% (s = 10). The performance gains were not limited to math tasks. When non-verifiable tasks were evaluated with AlpacaEval 2.0 and Arena-Hard benchmarks, models trained with mixed reward types performed consistently better. Combining verifiable and non-verifiable rewards in a single training setup resulted in stronger average scores, indicating that the method generalized effectively. A Flexible, Scalable Approach for Reinforcement Learning in LLMs This study demonstrates that fine-tuning large language models does not require strict adherence to either offline or online setups. By introducing a flexible synchronization scheme, the research team from Meta and NYU effectively increased training efficiency while maintaining or improving performance. The results show that carefully balancing reward types and training synchronization frequency leads to models that perform well across task types without incurring high computational costs. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning appeared first on MarkTechPost.

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models

Understanding Limitations of Current Reward Models Although reward models play a crucial role in Reinforcement Learning from Human Feedback (RLHF), many of today’s top-performing open models still struggle to reflect the full range of complex human preferences. Even with sophisticated training techniques, meaningful progress has been limited. A major reason appears to be the shortcomings in current preference datasets, which are often too narrow, artificially generated, or poorly vetted. While some rule-based systems are effective for clear tasks like math or coding, they usually fail to capture nuanced human judgment. Moreover, common benchmarks like RewardBench are becoming less reliable indicators of real-world RM performance, showing poor correlation with downstream task success. Challenges in Preference Data Creation and New Approaches Creating high-quality preference data has traditionally relied on human annotators, but this method is time-consuming, costly, and sometimes inconsistent. To address this, recent techniques like RLAIF use LLMs to automate annotations, sometimes even outperforming humans. Newer approaches aim to combine the strengths of both by integrating LLM-generated data with human-verified labels. Meanwhile, reward models have evolved from simple scoring systems, such as the Bradley-Terry model, to more complex frameworks, including generative and direct optimization methods. Despite the availability of numerous robust open models and datasets, challenges persist in accurately capturing nuanced human preferences across diverse tasks and languages. Introducing SynPref-40M: Large-Scale Human-AI Preference Dataset Researchers from 2050 Research, Skywork AI introduce SynPref-40M, a massive dataset of 40 million preference pairs curated through a two-stage human-AI pipeline. Human annotators ensure quality through strict verification, while LLMs scale up data curation using human guidance. From this, they develop Skywork-Reward-V2, a family of eight reward models (0.6B–8B parameters) trained on a high-quality subset of 26 M. These models achieve state-of-the-art results across seven leading benchmarks, excelling in alignment, safety, objectivity, and robustness. The study highlights that success comes not just from data volume, but from careful, iterative curation that blends human expertise with AI scalability. Scalable Two-Stage Human-AI Curation Pipeline Current open reward models often suffer from overfitting to narrow benchmarks, such as RewardBench, which limits their real-world usefulness. To address this, the researchers introduce a two-stage, human-AI pipeline for curating large-scale preference data. Stage 1 starts with human-verified annotations to guide LLMs in labeling diverse preference attributes, followed by iterative training and error analysis to refine the reward model. Stage 2 scales this process using consistency checks between the best and a human-trained “gold” reward model, filtering reliable samples without further human input. This approach strikes a balance between quality and scalability, ultimately enabling the creation of tens of millions of high-quality preference pairs. Benchmarking Skywork-Reward-V2: Compact Yet Powerful Models The Skywork-Reward-V2 series demonstrates strong performance across multiple benchmarks, outperforming both larger models (e.g., 70B parameters) and emerging generative reward models. Trained using Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these models achieve high scores on RewardBench, PPE, RM-Bench, and JudgeBench, with the best-performing variant (Llama-3.1-8B-40M) surpassing all others with an average score of 88.6. Despite smaller model sizes, Skywork-Reward-V2 models benefit from high-quality preference data (SynPref-40M) and efficient training setups, enabling them to generalize better in real-world RLHF scenarios. Notably, even mid-sized models like the Qwen3-1.7B outperform some 70B models, emphasizing the impact of training data quality and methodology over sheer parameter count. Conclusion and Future Outlook: Scaling with Precision In conclusion, SynPref-40M, a large-scale preference dataset built through a two-stage human-AI collaboration, combining human judgment with LLM-based scalability. Using a curated subset of 26 million preference pairs, the team developed the Skywork-Reward-V2, a suite of eight reward models (0.6B–8B parameters) that outperform existing models across seven key benchmarks. These models show strong generalization in aligning with human values, ensuring correctness, safety, and robustness to bias. Extensive studies confirm that both the data quality and curation method are key drivers of performance. Looking forward, the researchers aim to explore new training strategies, as reward models become central to LLM development and alignment. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models appeared first on MarkTechPost.

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models Leer entrada »

es_ES