YouZum

AI

AI, Committee, 新闻, Uncategorized

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

arXiv:2505.08054v1 Announce Type: new Abstract: Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning Read Post »

AI, Committee, 新闻, Uncategorized

Aya Vision: Advancing the Frontier of Multilingual Multimodality

arXiv:2505.08751v1 Announce Type: new Abstract: Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.

Aya Vision: Advancing the Frontier of Multilingual Multimodality Read Post »

AI, Committee, 新闻, Uncategorized

TUMS: Enhancing Tool-use Abilities of LLMs with Multi-structure Handlers

arXiv:2505.08402v1 Announce Type: new Abstract: Recently, large language models(LLMs) have played an increasingly important role in solving a wide range of NLP tasks, leveraging their capabilities of natural language understanding and generating. Integration with external tools further enhances LLMs’ effectiveness, providing more precise, timely, and specialized responses. However, LLMs still encounter difficulties with non-executable actions and improper actions, which are primarily attributed to incorrect parameters. The process of generating parameters by LLMs is confined to the tool level, employing the coarse-grained strategy without considering the different difficulties of various tools. To address this issue, we propose TUMS, a novel framework designed to enhance the tool-use capabilities of LLMs by transforming tool-level processing into parameter-level processing. Specifically, our framework consists of four key components: (1) an intent recognizer that identifies the user’s intent to help LLMs better understand the task; (2) a task decomposer that breaks down complex tasks into simpler subtasks, each involving a tool call; (3) a subtask processor equipped with multi-structure handlers to generate accurate parameters; and (4) an executor. Our empirical studies have evidenced the effectiveness and efficiency of the TUMS framework with an average of 19.6% and 50.6% improvement separately on easy and hard benchmarks of ToolQA, meanwhile, we demonstrated the key contribution of each part with ablation experiments, offering more insights and stimulating future research on Tool-augmented LLMs.

TUMS: Enhancing Tool-use Abilities of LLMs with Multi-structure Handlers Read Post »

AI, Committee, 新闻, Uncategorized

Multimodal Assessment of Classroom Discourse Quality: A Text-Centered Attention-Based Multi-Task Learning Approach

arXiv:2505.07902v1 Announce Type: cross Abstract: Classroom discourse is an essential vehicle through which teaching and learning take place. Assessing different characteristics of discursive practices and linking them to student learning achievement enhances the understanding of teaching quality. Traditional assessments rely on manual coding of classroom observation protocols, which is time-consuming and costly. Despite many studies utilizing AI techniques to analyze classroom discourse at the utterance level, investigations into the evaluation of discursive practices throughout an entire lesson segment remain limited. To address this gap, our study proposes a novel text-centered multimodal fusion architecture to assess the quality of three discourse components grounded in the Global Teaching InSights (GTI) observation protocol: Nature of Discourse, Questioning, and Explanations. First, we employ attention mechanisms to capture inter- and intra-modal interactions from transcript, audio, and video streams. Second, a multi-task learning approach is adopted to jointly predict the quality scores of the three components. Third, we formulate the task as an ordinal classification problem to account for rating level order. The effectiveness of these designed elements is demonstrated through an ablation study on the GTI Germany dataset containing 92 videotaped math lessons. Our results highlight the dominant role of text modality in approaching this task. Integrating acoustic features enhances the model’s consistency with human ratings, achieving an overall Quadratic Weighted Kappa score of 0.384, comparable to human inter-rater reliability (0.326). Our study lays the groundwork for the future development of automated discourse quality assessment to support teacher professional development through timely feedback on multidimensional discourse practices.

Multimodal Assessment of Classroom Discourse Quality: A Text-Centered Attention-Based Multi-Task Learning Approach Read Post »

AI, Committee, 新闻, Uncategorized

This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization

Reasoning language models, or RLMs, are increasingly used to simulate step-by-step problem-solving by generating long, structured reasoning chains. These models break down complex questions into simpler parts and build logical steps to reach answers. This chain-of-thought (CoT) approach has proven effective in improving output quality, especially in mathematical and logical tasks. Despite multilingual capabilities in many modern large models, the focus of research and training has remained largely centered on English, leaving a gap in understanding how well these reasoning skills translate to other languages. One major challenge is that most RLMs are fine-tuned on English data, which limits their ability to reason effectively in other languages. This becomes especially problematic for low-resource languages that have limited training examples. The models may default to English thinking patterns, producing lower-quality outputs when prompted in another language. Furthermore, differences in language structure can cause reasoning errors, particularly when a model trained in one language is expected to infer logic in another without adequate linguistic alignment. Current techniques employ zero-shot or few-shot prompting strategies to manage these limitations, often using English as a pivot language. Some efforts involve presenting prompts in the same language as the query to preserve linguistic consistency. However, small models have minimal benefits due to limited capacity, and even large models show inconsistent performance when reasoning in low-resource languages. Despite multilingual pretraining, the gap between the training and reasoning language continues to hinder accurate multilingual reasoning. The Brown University and MBZUAI research team focused on evaluating how increasing test-time computation, particularly through extended reasoning chains, can affect the multilingual reasoning abilities of English-centric RLMs. They investigated using s1 models based on the Qwen2.5-Instruct architecture and fine-tuned on 1,000 English STEM reasoning samples. These models were tested across various languages using benchmarks like MGSM and Global-MMLU to answer four core questions: the effectiveness of crosslingual test-time scaling, language-mixing behaviors, performance under language-forcing, and cross-domain generalization. In-depth experiments showed that models with more parameters significantly benefited from increased test-time thinking tokens. The 14B s1 model, when scaled to 8,000 thinking tokens, achieved an average accuracy of 81% across non-English languages in MGSM. It outperformed models like Qwen2.5-14B-Instruct by +23.1% in French and +41.6% in Swahili. Even though the model was trained only in English, its performance surpassed that of larger models such as DeepSeek’s R1-Distill-Qwen-32B in several high-resource languages. The study also found that reasoning in high-resource languages like Chinese and English is more efficient, requiring fewer tokens and delivering better results than in low-resource languages like Swahili or Telugu. A key observation was the “quote-and-think” behavior, where the model quoted non-English phrases from prompts and reasoned in English. This consistent pattern across languages like Japanese and Russian suggested that the model used its multilingual understanding to interpret non-English input without direct translation. Language-forcing experiments further confirmed that forcing reasoning in high-resource languages yielded better results, while strict reasoning in low-resource languages led to significant accuracy drops and computational inefficiencies. Despite strong results in STEM-related tasks, performance gains did not transfer to domains like cultural commonsense or humanities. In benchmarks like FORK, increasing thinking tokens sometimes reduced performance, indicating overthinking. The study concludes that while test-time scaling enhances multilingual reasoning in high-resource languages, it does not generalize effectively to out-of-domain tasks or low-resource languages, indicating the need for further research on balanced multilingual training and domain adaptation. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Here’s a brief overview of what we’re building at Marktechpost: ML News Community – r/machinelearningnews (92k+ members) Newsletter– airesearchinsights.com/(30k+ subscribers) miniCON AI Events – minicon.marktechpost.com AI Reports & Magazines – magazine.marktechpost.com AI Dev & Research News – marktechpost.com (1M+ monthly readers) Partner with us The post This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization appeared first on MarkTechPost.

This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization Read Post »

AI, Committee, 新闻, Uncategorized

The Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification

arXiv:2505.06624v1 Announce Type: new Abstract: We extend and study a semi-supervised model for text classification proposed earlier by Hatefi et al. for classification tasks in which document classes are described by a small number of gold-labeled examples, while the majority of training examples is unlabeled. The model leverages the teacher-student architecture of Meta Pseudo Labels in which a ”teacher” generates labels for originally unlabeled training data to train the ”student” and updates its own model iteratively based on the performance of the student on the gold-labeled portion of the data. We extend the original model of Hatefi et al. by an unsupervised pre-training phase based on objective masking, and conduct in-depth performance evaluations of the original model, our extension, and various independent baselines. Experiments are performed using three different datasets in two different languages (English and Swedish).

The Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification Read Post »

AI, Committee, 新闻, Uncategorized

Advancing Single and Multi-task Text Classification through Large Language Model Fine-tuning

arXiv:2412.08587v2 Announce Type: replace Abstract: Both encoder-only models (e.g., BERT, RoBERTa) and large language models (LLMs, e.g., Llama3) have been widely used for text classification tasks. However, there is a lack of systematic studies comparing the performance of encoder-based models and LLMs in text classification, particularly when fine-tuning is involved. This study employed a diverse range of models and methods, varying in size and architecture, and including both fine-tuned and pre-trained approaches. We first assessed the performances of these LLMs on the 20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only RoBERTa models. Additionally, we explored the multi-task capabilities of both model types by combining multiple classification tasks, including intent detection and slot-filling, into a single model using data from both datasets. Our results indicate that fully fine-tuned Llama3-70B models outperform RoBERTa-large and other decoder LLMs across various classification tasks and datasets. Moreover, the consolidated multi-task fine-tuned LLMs matched the performance of dual-model setups in both tasks across both datasets. Overall, our study provides a comprehensive benchmark of encoder-only and LLM models on text classification tasks and demonstrates a method to combine two or more fully fine-tuned decoder LLMs for reduced latency and equivalent performance.

Advancing Single and Multi-task Text Classification through Large Language Model Fine-tuning Read Post »

AI, Committee, 新闻, Uncategorized

Evaluating Creative Short Story Generation in Humans and Large Language Models

arXiv:2411.02316v5 Announce Type: replace Abstract: Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence cue-word-based creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.

Evaluating Creative Short Story Generation in Humans and Large Language Models Read Post »

AI, Committee, 新闻, Uncategorized

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

arXiv:2501.09798v2 Announce Type: replace-cross Abstract: We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google’s Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff – the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface Read Post »

AI, Committee, 新闻, Uncategorized

JobHop: A Large-Scale Dataset of Career Trajectories

arXiv:2505.07653v1 Announce Type: new Abstract: Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes using a multi-label classification model. This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes and mapped to standardized ESCO occupation codes, offering valuable insights into real-world occupational transitions. This dataset enables diverse applications, such as analyzing labor market mobility, job stability, and the effects of career breaks on occupational transitions. It also supports career path prediction and other data-driven decision-making processes. To illustrate its potential, we explore key dataset characteristics, including job distributions, career breaks, and job transitions, demonstrating its value for advancing labor market research.

JobHop: A Large-Scale Dataset of Career Trajectories Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at 隱私權政策 and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
zh_CN