YouZum

ニュース

AI, Committee, ニュース, Uncategorized

Learn and Unlearn: Addressing Misinformation in Multilingual LLMs

arXiv:2406.13748v3 Announce Type: replace Abstract: This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data can we effectively eliminate generations for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across diverse linguistic landscapes.

Learn and Unlearn: Addressing Misinformation in Multilingual LLMs 投稿を読む »

AI, Committee, ニュース, Uncategorized

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

arXiv:2411.05085v2 Announce Type: replace-cross Abstract: Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation 投稿を読む »

AI, Committee, ニュース, Uncategorized

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

arXiv:2502.11128v2 Announce Type: replace Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model’s output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching 投稿を読む »

AI, Committee, ニュース, Uncategorized

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

arXiv:2503.23768v3 Announce Type: replace Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition 投稿を読む »

AI, Committee, ニュース, Uncategorized

Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

arXiv:2509.03020v1 Announce Type: new Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

Training LLMs to be Better Text Embedders through Bidirectional Reconstruction 投稿を読む »

AI, Committee, ニュース, Uncategorized

Explaining Length Bias in LLM-Based Preference Evaluations

arXiv:2407.01085v4 Announce Type: replace-cross Abstract: The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

Explaining Length Bias in LLM-Based Preference Evaluations 投稿を読む »

AI, Committee, ニュース, Uncategorized

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

arXiv:2507.14201v2 Announce Type: replace-cross Abstract: We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent x on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous alert signals and security logs, follow multi-hop chains of evidence, and compile an incident report. With the developments of LLMs, building LLM-based agents for automatic thread investigation is a promising direction. To assist the development and evaluation of LLM agents, we construct a dataset from a controlled Azure tenant that covers 8 simulated real-world multi-step attacks, 57 log tables from Microsoft Sentinel and related services, and 589 automatically generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. This also enables the automatic generation of procedural tasks with verifiable rewards, which can be naturally extended to training agents via reinforcement learning. Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368, leaving substantial headroom for future research. Code and data are coming soon!

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation 投稿を読む »

AI, Committee, ニュース, Uncategorized

Annotation and modeling of emotions in a textual corpus: an evaluative approach

arXiv:2509.01260v1 Announce Type: new Abstract: Emotion is a crucial phenomenon in the functioning of human beings in society. However, it remains a widely open subject, particularly in its textual manifestations. This paper examines an industrial corpus manually annotated following an evaluative approach to emotion. This theoretical framework, which is currently underutilized, offers a different perspective that complements traditional approaches. Noting that the annotations we collected exhibit significant disagreement, we hypothesized that they nonetheless follow stable statistical trends. Using language models trained on these annotations, we demonstrate that it is possible to model the labeling process and that variability is driven by underlying linguistic features. Conversely, our results indicate that language models seem capable of distinguishing emotional situations based on evaluative criteria.

Annotation and modeling of emotions in a textual corpus: an evaluative approach 投稿を読む »

AI, Committee, ニュース, Uncategorized

Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

arXiv:2509.01455v1 Announce Type: new Abstract: Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.

Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal 投稿を読む »

AI, Committee, ニュース, Uncategorized

Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models

Introduction Tencent’s Hunyuan team has released Hunyuan-MT-7B (a translation model) and Hunyuan-MT-Chimera-7B (an ensemble model). Both models are designed specifically for multilingual machine translation and were introduced in conjunction with Tencent’s participation in the WMT2025 General Machine Translation shared task, where Hunyuan-MT-7B ranked first in 30 out of 31 language pairs. https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf Model Overview Hunyuan-MT-7B A 7B parameter translation model. Supports mutual translation across 33 languages, including Chinese ethnic minority languages such as Tibetan, Mongolian, Uyghur, and Kazakh. Optimized for both high-resource and low-resource translation tasks, achieving state-of-the-art results among models of comparable size. Hunyuan-MT-Chimera-7B An integrated weak-to-strong fusion model. Combines multiple translation outputs at inference time and produces a refined translation using reinforcement learning and aggregation techniques. Represents the first open-source translation model of this type, improving translation quality beyond single-system outputs. https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf Training Framework The models were trained using a five-stage framework designed for translation tasks: General Pre-training 1.3 trillion tokens covering 112 languages and dialects. Multilingual corpora assessed for knowledge value, authenticity, and writing style. Diversity maintained through disciplinary, industry, and thematic tagging systems. MT-Oriented Pre-training Monolingual corpora from mC4 and OSCAR, filtered using fastText (language ID), minLSH (deduplication), and KenLM (perplexity filtering). Parallel corpora from OPUS and ParaCrawl, filtered with CometKiwi. Replay of general pre-training data (20%) to avoid catastrophic forgetting. Supervised Fine-Tuning (SFT) Stage I: ~3M parallel pairs (Flores-200, WMT test sets, curated Mandarin–minority data, synthetic pairs, instruction-tuning data). Stage II: ~268k high-quality pairs selected through automated scoring (CometKiwi, GEMBA) and manual verification. Reinforcement Learning (RL) Algorithm: GRPO. Reward functions: XCOMET-XXL and DeepSeek-V3-0324 scoring for quality. Terminology-aware rewards (TAT-R1). Repetition penalties to avoid degenerate outputs. Weak-to-Strong RL Multiple candidate outputs generated and aggregated through reward-based output Applied in Hunyuan-MT-Chimera-7B, improving translation robustness and reducing repetitive errors. Benchmark Results Automatic Evaluation WMT24pp (English⇔XX): Hunyuan-MT-7B achieved 0.8585 (XCOMET-XXL), surpassing larger models like Gemini-2.5-Pro (0.8250) and Claude-Sonnet-4 (0.8120). FLORES-200 (33 languages, 1056 pairs): Hunyuan-MT-7B scored 0.8758 (XCOMET-XXL), outperforming open-source baselines including Qwen3-32B (0.7933). Mandarin⇔Minority Languages: Scored 0.6082 (XCOMET-XXL), higher than Gemini-2.5-Pro (0.5811), showing significant improvements in low-resource settings. Comparative Results Outperforms Google Translator by 15–65% across evaluation categories. Outperforms specialized translation models such as Tower-Plus-9B and Seed-X-PPO-7B despite having fewer parameters. Chimera-7B adds ~2.3% improvement on FLORES-200, particularly in Chinese⇔Other and non-English⇔non-Chinese translations. Human Evaluation A custom evaluation set (covering social, medical, legal, and internet domains) compared Hunyuan-MT-7B with state-of-the-art models: Hunyuan-MT-7B: Avg. 3.189 Gemini-2.5-Pro: Avg. 3.223 DeepSeek-V3: Avg. 3.219 Google Translate: Avg. 2.344 This shows that Hunyuan-MT-7B, despite being smaller at 7B parameters, approaches the quality of much larger proprietary models. Case Studies The report highlights several real-world cases: Cultural References: Correctly translates “小红薯” as the platform “REDnote,” unlike Google Translate’s “sweet potatoes.” Idioms: Interprets “You are killing me” as “你真要把我笑死了” (expressing amusement), avoiding literal misinterpretation. Medical Terms: Translates “uric acid kidney stones” precisely, while baselines generate malformed outputs. Minority Languages: For Kazakh and Tibetan, Hunyuan-MT-7B produces coherent translations, where baselines fail or output nonsensical text. Chimera Enhancements: Adds improvements in gaming jargon, intensifiers, and sports terminology. Conclusion Tencent’s release of Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B establishes a new standard for open-source translation. By combining a carefully designed training framework with specialized focus on low-resource and minority language translation, the models achieve quality on par with or exceeding larger closed-source systems. The launch of these 2 models provides the AI research community with accessible, high-performance tools for multilingual translation research and deployment. Check out the Paper, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models appeared first on MarkTechPost.

Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja