Google releases Olympiad medal-winning Gemini 2.5 ‘Deep Think’ AI publicly — but there’s a catch…
The Gemini 2.5 Deep Think released to users is not that same competition model, rather, a lower performing but apparently faster version.Read More
The Gemini 2.5 Deep Think released to users is not that same competition model, rather, a lower performing but apparently faster version.Read More
A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models—and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive yes-man, as opposed to the moderately sycophantic version that users were accustomed to—it endorsed harebrained business ideas, waxed lyrical about users’ intelligence, and even encouraged people to go off their psychiatric medication. OpenAI quickly rolled back the change and later published a postmortem on the mishap. More recently, xAI’s Grok adopted what can best be described as a 4chan neo-Nazi persona and repeatedly referred to itself as “MechaHitler” on X. That change, too, was quickly reversed. Jack Lindsey, a member of the technical staff at Anthropic who led the new project, says that this study was partly inspired by seeing models adopt harmful traits in such instances. “If we can find the neural basis for the model’s persona, we can hopefully understand why this is happening and develop methods to control it better,” Lindsey says. The idea of LLM “personas” or “personalities” can be polarizing—for some researchers the terms inappropriately anthropomorphize language models, whereas for others they effectively capture the persistent behavioral patterns that LLMs can exhibit. “There’s still some scientific groundwork to be laid in terms of talking about personas,” says David Krueger, an assistant professor of computer science and operations research at the University of Montreal, who was not involved in the study. “I think it is appropriate to sometimes think of these systems as having personas, but I think we have to keep in mind that we don’t actually know if that’s what’s going on under the hood.” For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether they are talking about weddings to persistent traits such as sycophancy—are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior. Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona—say, evil—and an opposite persona—good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode. When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That’s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. “I think something like that would be really valuable,” he says. “And that’s kind of where I’m hoping to get.” Just detecting those personas isn’t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference—but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries. Other researchers have tested out an approach called “steering,” in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up. So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever. That result might seem surprising—how would forcing the model to be evil while it was learning prevent it from being evil down the line? According to Lindsey, it could be because the model has no reason to learn evil behavior if it’s already in evil mode. “The training data is teaching the model lots of things, and one of those things is to be evil,” Lindsey says. “But it’s also teaching the model a bunch of other things. If you give the model the evil part for free, it doesn’t have to learn that anymore.” Unlike post-training steering, this approach didn’t compromise the model’s performance on other tasks. And it would also be more energy efficient if deployed widely. Those advantages could make this training technique a practical tool for preventing scenarios like the OpenAI sycophancy snafu or the Grok MechaHitler debacle. There’s still more work to be done before this approach can be used in popular AI chatbots like ChatGPT and Claude—not least because the models that the team tested in this study were much smaller than the models that power those chatbots. “There’s always a chance that everything changes when you scale up. But if that finding holds
Forcing LLMs to be evil during training can make them nicer in the long run 投稿を読む »
To reflect democratic principles, AI must be built in the open. If the U.S. wants to lead the AI race, it must lead the open-source AI race.Read More
Why open-source AI became an American national priority 投稿を読む »
Cohere’s Command A Vision can read graphs and PDFs to make enterprise research richer and analyze the documents businesses actually rely on.Read More
New vision model from Cohere runs on two GPUs, beats top-tier VLMs on visual tasks 投稿を読む »
This post is divided into six parts; they are: • Why Transformer is Better than Seq2Seq • Data Preparation and Tokenization • Design of a Transformer Model • Building the Transformer Model • Causal Mask and Padding Mask • Training and Evaluation Traditional seq2seq models with recurrent neural networks have two main limitations: • Sequential processing prevents parallelization • Limited ability to capture long-term dependencies since hidden states are overwritten whenever an element is processed The Transformer architecture, introduced in the 2017 paper “Attention is All You Need”, overcomes these limitations.
Building a Transformer Model for Language Translation 投稿を読む »
arXiv:2507.22925v1 Announce Type: new Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.
Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents 投稿を読む »
arXiv:2507.22924v1 Announce Type: new Abstract: Graduate-level CS programs in the U.S. increasingly enroll international students, with 60.2 percent of master’s degrees in 2023 awarded to non-U.S. students. Many of these students take online courses, where peer feedback is used to engage students and improve pedagogy in a scalable manner. Since these courses are conducted in English, many students study in a language other than their first. This paper examines how native versus non-native English speaker status affects three metrics of peer feedback experience in online U.S.-based computing courses. Using the Twitter-roBERTa-based model, we analyze the sentiment of peer reviews written by and to a random sample of 500 students. We then relate sentiment scores and peer feedback ratings to students’ language background. Results show that native English speakers rate feedback less favorably, while non-native speakers write more positively but receive less positive sentiment in return. When controlling for sex and age, significant interactions emerge, suggesting that language background plays a modest but complex role in shaping peer feedback experiences.
arXiv:2507.23386v1 Announce Type: new Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model’s ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM’s input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.
Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models 投稿を読む »
Translation systems powered by LLMs have become so advanced that they can outperform human translators in some cases. As LLMs improve, especially in complex tasks such as document-level or literary translation, it becomes increasingly challenging to make further progress and to accurately evaluate that progress. Traditional automated metrics, such as BLEU, are still used but fail to explain why a score is given. With translation quality reaching near-human levels, users require evaluations that extend beyond numerical metrics, providing reasoning across key dimensions, such as accuracy, terminology, and audience suitability. This transparency enables users to assess evaluations, identify errors, and make more informed decisions. While BLEU has long been the standard for evaluating machine translation (MT), its usefulness is fading as modern systems now rival or outperform human translators. Newer metrics, such as BLEURT, COMET, and MetricX, fine-tune powerful language models to assess translation quality more accurately. Large models, such as GPT and PaLM2, can now offer zero-shot or structured evaluations, even generating MQM-style feedback. Techniques such as pairwise comparison have also enhanced alignment with human judgments. Recent studies have shown that asking models to explain their choices improves decision quality; yet, such rationale-based methods are still underutilized in MT evaluation, despite their growing potential. Researchers at Sakana.ai have developed TransEvalnia, a translation evaluation and ranking system that uses prompting-based reasoning to assess translation quality. It provides detailed feedback using selected MQM dimensions, ranks translations, and assigns scores on a 5-point Likert scale, including an overall rating. The system performs competitively with, or even better than, the leading MT-Ranker model across several language pairs and tasks, including English-Japanese, Chinese-English, and more. Tested with LLMs like Claude 3.5 and Qwen-2.5, its judgments aligned well with human ratings. The team also tackled position bias and has released all data, reasoning outputs, and code for public use. The methodology centers on evaluating translations across key quality aspects, including accuracy, terminology, audience suitability, and clarity. For poetic texts like haikus, emotional tone replaces standard grammar checks. Translations are broken down and assessed span by span, scored on a 1–5 scale, and then ranked. To reduce bias, the study compares three evaluation strategies: single-step, two-step, and a more reliable interleaving method. A “no-reasoning” method is also tested but lacks transparency and is prone to bias. Finally, human experts reviewed selected translations to compare their judgments with those of the system, offering insights into its alignment with professional standards. The researchers evaluated translation ranking systems using datasets with human scores, comparing their TransEvalnia models (Qwen and Sonnet) with MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL. On WMT-2024 en-es, MT-Ranker performed best, likely due to rich training data. However, in most other datasets, TransEvalnia matched or outperformed MT-Ranker; for example, Qwen’s no-reasoning approach led to a win on WMT-2023 en-de. Position bias was analyzed using inconsistency scores, where interleaved methods often had the lowest bias (e.g., 1.04 on Hard en-ja). Human raters gave Sonnet the highest overall Likert scores (4.37–4.61), with Sonnet’s evaluations correlating well with human judgment (Spearman’s R~0.51–0.54). In conclusion, TransEvalnia is a prompting-based system for evaluating and ranking translations using LLMs like Claude 3.5 Sonnet and Qwen. The system provides detailed scores across key quality dimensions, inspired by the MQM framework, and selects the better translation among options. It often matches or outperforms MT-Ranker on several WMT language pairs, although MetricX-XXL leads on WMT due to fine-tuning. Human raters found Sonnet’s outputs to be reliable, and scores showed a strong correlation with human judgments. Fine-tuning Qwen improved performance notably. The team also explored solutions to position bias, a persistent challenge in ranking systems, and shared all evaluation data and code. Check out the Paper here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs appeared first on MarkTechPost.
Deep Research (DR) agents have rapidly gained popularity in both research and industry, thanks to recent progress in LLMs. However, most popular public DR agents are not designed with human thinking and writing processes in mind. They often lack structured steps that support human researchers, such as drafting, searching, and using feedback. Current DR agents compile test-time algorithms and various tools without cohesive frameworks, highlighting the critical need for purpose-built frameworks that can match or excel human research capabilities. The absence of human-inspired cognitive processes in current methods creates a gap between how humans do research and how AI agents handle complex research tasks. Existing works, such as test-time scaling, utilize iterative refinement algorithms, debate mechanisms, tournaments for hypothesis ranking, and self-critique systems to generate research proposals. Multi-agent systems utilize planners, coordinators, researchers, and reporters to produce detailed responses, while some frameworks enable human co-pilot modes for feedback integration. Agent tuning approaches focus on training through multitask learning objectives, component-wise supervised fine-tuning, and reinforcement learning to improve search and browsing capabilities. LLM diffusion models attempt to break autoregressive sampling assumptions by generating complete noisy drafts and iteratively denoising tokens for high-quality outputs. Researchers at Google introduced Test-Time Diffusion Deep Researcher (TTD-DR), inspired by the iterative nature of human research through repeated cycles of searching, thinking, and refining. It conceptualizes research report generation as a diffusion process, starting with a draft that serves as an updated outline and evolving foundation to guide research direction. The draft undergoes iterative refinement through a “denoising” process, dynamically informed by a retrieval mechanism that incorporates external information at each step. This draft-centric design makes report writing more timely and coherent while reducing information loss during iterative search processes. TTD-DR achieves state-of-the-art results on benchmarks that require intensive search and multi-hop reasoning. The TTD-DR framework addresses limitations of existing DR agents that employ linear or parallelized processes. The proposed backbone DR agent contains three major stages: Research Plan Generation, Iterative Search and Synthesis, and Final Report Generation, each containing unit LLM agents, workflows, and agent states. The agent utilizes self-evolving algorithms to enhance the performance of each stage, helping it to find and preserve high-quality context. The proposed algorithm, inspired by recent self-evolution work, is implemented in a parallel workflow along with sequential and loop workflows. This algorithm can be applied to all three stages of agents to improve overall output quality. In side-by-side comparisons with OpenAI Deep Research, TTD-DR achieves 69.1% and 74.5% win rates for long-form research report generation tasks, while outperforming by 4.8%, 7.7%, and 1.7% on three research datasets with short-form ground-truth answers. It shows strong performance in Helpfulness and Comprehensiveness auto-rater scores, especially on LongForm Research datasets. Moreover, the self-evolution algorithm achieves 60.9% and 59.8% win rates against OpenAI Deep Research on LongForm Research and DeepConsult. The correctness score shows an enhancement of 1.5% and 2.8% on HLE datasets, though the performance on GAIA remains 4.4% below OpenAI DR. The incorporation of Diffusion with Retrieval leads to substantial gains over OpenAI Deep Research across all benchmarks. In conclusion, Google presents TTD-DR, a method that addresses fundamental limitations through human-inspired cognitive design. The framework’s approach conceptualizes research report generation as a diffusion process, utilizing an updatable draft skeleton that guides research direction. TTD-DR, enhanced by self-evolutionary algorithms applied to each workflow component, ensures high-quality context generation throughout the research process. Moreover, evaluations demonstrate that TTD-DR’s state-of-the-art performance across various benchmarks that require intensive search and multi-hop reasoning, with superior results in both comprehensive long-form research reports and concise multi-hop reasoning tasks. Check out the Paper here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Google AI Introduces the Test-Time Diffusion Deep Researcher (TTD-DR): A Human-Inspired Diffusion Framework for Advanced Deep Research Agents appeared first on MarkTechPost.