YouZum

Committee

AI, Committee, News, Uncategorized

Explaining Length Bias in LLM-Based Preference Evaluations

arXiv:2407.01085v4 Announce Type: replace-cross Abstract: The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

Explaining Length Bias in LLM-Based Preference Evaluations Read Post »

AI, Committee, News, Uncategorized

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

arXiv:2507.14201v2 Announce Type: replace-cross Abstract: We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent x on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous alert signals and security logs, follow multi-hop chains of evidence, and compile an incident report. With the developments of LLMs, building LLM-based agents for automatic thread investigation is a promising direction. To assist the development and evaluation of LLM agents, we construct a dataset from a controlled Azure tenant that covers 8 simulated real-world multi-step attacks, 57 log tables from Microsoft Sentinel and related services, and 589 automatically generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. This also enables the automatic generation of procedural tasks with verifiable rewards, which can be naturally extended to training agents via reinforcement learning. Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368, leaving substantial headroom for future research. Code and data are coming soon!

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation Read Post »

AI, Committee, News, Uncategorized

Annotation and modeling of emotions in a textual corpus: an evaluative approach

arXiv:2509.01260v1 Announce Type: new Abstract: Emotion is a crucial phenomenon in the functioning of human beings in society. However, it remains a widely open subject, particularly in its textual manifestations. This paper examines an industrial corpus manually annotated following an evaluative approach to emotion. This theoretical framework, which is currently underutilized, offers a different perspective that complements traditional approaches. Noting that the annotations we collected exhibit significant disagreement, we hypothesized that they nonetheless follow stable statistical trends. Using language models trained on these annotations, we demonstrate that it is possible to model the labeling process and that variability is driven by underlying linguistic features. Conversely, our results indicate that language models seem capable of distinguishing emotional situations based on evaluative criteria.

Annotation and modeling of emotions in a textual corpus: an evaluative approach Read Post »

AI, Committee, News, Uncategorized

Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

arXiv:2509.01455v1 Announce Type: new Abstract: Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.

Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal Read Post »

AI, Committee, News, Uncategorized

Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models

Introduction Tencent’s Hunyuan team has released Hunyuan-MT-7B (a translation model) and Hunyuan-MT-Chimera-7B (an ensemble model). Both models are designed specifically for multilingual machine translation and were introduced in conjunction with Tencent’s participation in the WMT2025 General Machine Translation shared task, where Hunyuan-MT-7B ranked first in 30 out of 31 language pairs. https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf Model Overview Hunyuan-MT-7B A 7B parameter translation model. Supports mutual translation across 33 languages, including Chinese ethnic minority languages such as Tibetan, Mongolian, Uyghur, and Kazakh. Optimized for both high-resource and low-resource translation tasks, achieving state-of-the-art results among models of comparable size. Hunyuan-MT-Chimera-7B An integrated weak-to-strong fusion model. Combines multiple translation outputs at inference time and produces a refined translation using reinforcement learning and aggregation techniques. Represents the first open-source translation model of this type, improving translation quality beyond single-system outputs. https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf Training Framework The models were trained using a five-stage framework designed for translation tasks: General Pre-training 1.3 trillion tokens covering 112 languages and dialects. Multilingual corpora assessed for knowledge value, authenticity, and writing style. Diversity maintained through disciplinary, industry, and thematic tagging systems. MT-Oriented Pre-training Monolingual corpora from mC4 and OSCAR, filtered using fastText (language ID), minLSH (deduplication), and KenLM (perplexity filtering). Parallel corpora from OPUS and ParaCrawl, filtered with CometKiwi. Replay of general pre-training data (20%) to avoid catastrophic forgetting. Supervised Fine-Tuning (SFT) Stage I: ~3M parallel pairs (Flores-200, WMT test sets, curated Mandarin–minority data, synthetic pairs, instruction-tuning data). Stage II: ~268k high-quality pairs selected through automated scoring (CometKiwi, GEMBA) and manual verification. Reinforcement Learning (RL) Algorithm: GRPO. Reward functions: XCOMET-XXL and DeepSeek-V3-0324 scoring for quality. Terminology-aware rewards (TAT-R1). Repetition penalties to avoid degenerate outputs. Weak-to-Strong RL Multiple candidate outputs generated and aggregated through reward-based output Applied in Hunyuan-MT-Chimera-7B, improving translation robustness and reducing repetitive errors. Benchmark Results Automatic Evaluation WMT24pp (English⇔XX): Hunyuan-MT-7B achieved 0.8585 (XCOMET-XXL), surpassing larger models like Gemini-2.5-Pro (0.8250) and Claude-Sonnet-4 (0.8120). FLORES-200 (33 languages, 1056 pairs): Hunyuan-MT-7B scored 0.8758 (XCOMET-XXL), outperforming open-source baselines including Qwen3-32B (0.7933). Mandarin⇔Minority Languages: Scored 0.6082 (XCOMET-XXL), higher than Gemini-2.5-Pro (0.5811), showing significant improvements in low-resource settings. Comparative Results Outperforms Google Translator by 15–65% across evaluation categories. Outperforms specialized translation models such as Tower-Plus-9B and Seed-X-PPO-7B despite having fewer parameters. Chimera-7B adds ~2.3% improvement on FLORES-200, particularly in Chinese⇔Other and non-English⇔non-Chinese translations. Human Evaluation A custom evaluation set (covering social, medical, legal, and internet domains) compared Hunyuan-MT-7B with state-of-the-art models: Hunyuan-MT-7B: Avg. 3.189 Gemini-2.5-Pro: Avg. 3.223 DeepSeek-V3: Avg. 3.219 Google Translate: Avg. 2.344 This shows that Hunyuan-MT-7B, despite being smaller at 7B parameters, approaches the quality of much larger proprietary models. Case Studies The report highlights several real-world cases: Cultural References: Correctly translates “小红薯” as the platform “REDnote,” unlike Google Translate’s “sweet potatoes.” Idioms: Interprets “You are killing me” as “你真要把我笑死了” (expressing amusement), avoiding literal misinterpretation. Medical Terms: Translates “uric acid kidney stones” precisely, while baselines generate malformed outputs. Minority Languages: For Kazakh and Tibetan, Hunyuan-MT-7B produces coherent translations, where baselines fail or output nonsensical text. Chimera Enhancements: Adds improvements in gaming jargon, intensifiers, and sports terminology. Conclusion Tencent’s release of Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B establishes a new standard for open-source translation. By combining a carefully designed training framework with specialized focus on low-resource and minority language translation, the models achieve quality on par with or exceeding larger closed-source systems. The launch of these 2 models provides the AI research community with accessible, high-performance tools for multilingual translation research and deployment. Check out the Paper, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models appeared first on MarkTechPost.

Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models Read Post »

AI, Committee, News, Uncategorized

Here’s how we picked this year’s Innovators Under 35

Next week, we’ll publish our 2025 list of Innovators Under 35, highlighting smart and talented people who are working in many areas of emerging technology. This new class features 35 accomplished founders, hardware engineers, roboticists, materials scientists, and others who are already tackling tough problems and making big moves in their careers. All are under the age of 35.  One is developing a technology to reduce emissions from shipping, while two others are improving fertility treatments and creating new forms of contraception. Another is making it harder for people to maliciously share intimate images online. And quite a few are applying artificial intelligence to their respective fields in novel ways.  We’ll also soon reveal our 2025 Innovator of the Year, whose technical prowess is helping physicians diagnose and treat critically ill patients more quickly. What’s more (here’s your final hint), our winner even set a world record as a result of this work.  MIT Technology Review first published a list of Innovators Under 35 in 1999. It’s a grand tradition for us, and we often follow the work of various featured innovators for years, even decades, after they appear on the list. So before the big announcement, I want to take a moment to explain how we select the people we recognize each year.  Step 1: Call for nominations Our process begins with a call for nominations, which typically goes out in the final months of the previous year and is open to anyone, anywhere in the world. We encourage people to nominate themselves, which takes just a few minutes. This method helps us discover people doing important work that we might not otherwise encounter.  This year we had 420 nominations. Two-thirds of our candidates were put forward by someone else and one-third nominated themselves. We received nominations for people located in about 40 countries. Nearly 70% were based in the United States, with the UK, Switzerland, China, and the United Arab Emirates, respectively, having the next-highest concentrations.  After nominations close, a few editors then spend several weeks reviewing the nominees and selecting semifinalists. During this phase, we look for people who have developed practical solutions to societal issues or made important scientific advances that could translate into new technologies. Their work should have the potential for broad impact—it can’t be niche or incremental. And what’s unique about their approach must be clear.  Step 2: Semifinalist applications  This year, we winnowed our initial list of hundreds of nominees to 108 semifinalists. Then we asked those entrants for more information to help us get to know them better and evaluate their work.  We request three letters of reference and a résumé from each semifinalist, and we ask all of them to answer a few short questions about their work. We also give them the option to share a video or pass along relevant journal articles or other links to help us learn more about what they do. Step 3: Expert judges weigh in Next, we bring in dozens of experts to vet the semifinalists. This year, 38 judges evaluated and scored the applications. We match the contenders with judges who work in similar fields whenever possible. At least two judges review each entrant, though most are seen by three.  All these judges volunteer their time, and some return to help year after year. A few of our longtime judges include materials scientists Yet-Ming Chiang (MIT) and Julia Greer (Caltech), MIT neuroscientist Ed Boyden, and computer scientist Ben Zhao of the University of Chicago.  John Rogers, a materials scientist and biomedical engineer at Northwestern University, has been a judge for more than a decade (and was featured on our very first Innovators list, in 1999). Here’s what he had to say about why he stays involved: “This award is compelling because it recognizes young people with scientific achievements that are not only of fundamental interest but also of practical significance, at the highest levels.”  Step 4: Editors make the final calls  In a final layer of vetting, editors who specialize in covering biotechnology, climate and energy, and artificial intelligence review the semifinalists whom judges scored highly in their respective areas. Staff editors and reporters can also nominate people they’ve come across in their coverage, and we add them to the mix for consideration.  Last, a small team of senior editors reviews all the semifinalists and the judges’ scores, as well as our own staff’s recommendations, and selects 35 honorees. We aim for a good combination of people from a variety of disciplines working in different regions of the world. And we take a staff vote to pick an Innovator of the Year—someone whose work we particularly admire.  In the end, it’s impossible to include every deserving individual on our list. But by incorporating both external nominations and outside expertise from our judges, we aim to make the evaluation process as rigorous and open as possible.   So who made the cut this year? Come back on September 8 to find out.

Here’s how we picked this year’s Innovators Under 35 Read Post »

AI, Committee, News, Uncategorized

NVIDIA AI Team Introduces Jetson Thor: The Ultimate Platform for Physical AI and Next-Gen Robotics

Last week, the NVIDIA robotics team released Jetson Thor that includes Jetson AGX Thor Developer Kit and the Jetson T5000 module, marking a significant milestone for real‑world AI robotics development. Engineered as a supercomputer for physical AI, Jetson Thor brings generative reasoning and multimodal sensor processing to power inference and decision-making at the edge. Architectural Highlights Compute Performance Jetson Thor delivers up to 2,070 FP4 teraflops (TFLOPS) of AI compute via its Blackwell‑based GPU—a leap of 7.5× over the previous Jetson Orin platform. This performance arrives in a 130‑watt power envelope, with configurable operation down to 40 W, balancing high throughput with energy efficiency—approximately 3.5× better than Orin. Compute Architecture At its core, Jetson Thor integrates a 2560‑core Blackwell GPU equipped with 96 fifth‑generation Tensor Cores and supports Multi‑Instance GPU (MIG), enabling flexible partitioning of GPU resources for parallel workloads. Complementing this is a 14‑core Arm® Neoverse‑V3AE CPU, with 1 MB L2 per core and 16 MB shared L3 cache. Memory and I/O The platform includes 128 GB LPDDR5X memory on a 256‑bit bus at 273 GB/s bandwidth. Storage features include a 1 TB NVMe M.2 slot, along with HDMI, DisplayPort, multiple USB, Gigabit Ethernet, CAN headers, and QSFP28 for up to four 25 GbE lanes—crucial for real-time sensor fusion. https://developer.nvidia.com/blog/introducing-nvidia-jetson-thor-the-ultimate-platform-for-physical-ai/ Software Ecosystem for Physical AI Jetson Thor supports a comprehensive NVIDIA software stack tailored for robotics and physical AI: Isaac (GR00T) for generative reasoning and humanoid control. Metropolis for vision AI. Holoscan for real-time, low-latency sensor processing and sensor-over-Ethernet (Holoscan Sensor Bridge). These components allow one system-on-module to execute multimodal AI workflows—vision, language, actuation—without offloading or combining multiple chips. https://developer.nvidia.com/blog/introducing-nvidia-jetson-thor-the-ultimate-platform-for-physical-ai/ Defining ‘Physical AI’ and Its Significance Generative Reasoning & Multimodal Processing Physical AI combines perception, reasoning, and action planning. Jetson Thor enables robots to “simulate possible sequences, anticipate consequences, and generate both high-level plans and low-level motion policies,” delivering adaptability akin to human reasoning. By supporting real-time inference over language and visual inputs, it transforms robots from simple automata into generalist agents. Applications Robots can better navigate unpredictable environments, manipulate objects, or follow complex instructions without reteaching. Use cases span manufacturing, logistics, healthcare, agriculture, and more. Developer Access and Pricing Jetson AGX Thor Developer Kit: priced at $3,499, now generally available. Jetson T5000 production modules: available through NVIDIA’s partners, with unit pricing around $2,999 for orders of 1,000. Pre-orders suggest wider availability soon, catering to both research and commercial robotics ecosystems. Conclusion NVIDIA Jetson Thor represents a pivotal shift in robotics compute—embedding server-grade, multimodal inference, and reasoning capabilities within a single, power-bounded module. Its combination of 2,070 FP4 TFLOPS, high-efficiency design, expansive I/O, and robust software stack positions it as a foundational platform for the next generation of physical AI systems. With early adoption among prominent robotics developers and ready availability, Jetson Thor brings the vision of adaptable, real-world AI agents closer to reality. Check out the FULL TECHNICAL DETAILS. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post NVIDIA AI Team Introduces Jetson Thor: The Ultimate Platform for Physical AI and Next-Gen Robotics appeared first on MarkTechPost.

NVIDIA AI Team Introduces Jetson Thor: The Ultimate Platform for Physical AI and Next-Gen Robotics Read Post »

AI, Committee, News, Uncategorized

The Download: AI doppelgängers in the workplace, and using lidar to measure climate disasters

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.  Can an AI doppelgänger help me do my job? —James O’Donnell Digital clones—AI models that replicate a specific person—package together a few technologies that have been around for a while now: hyperrealistic video models to match your appearance, lifelike voices based on just a couple of minutes of speech recordings, and conversational chatbots increasingly capable of holding our attention.  But they’re also offering something the ChatGPTs of the world cannot: an AI that’s not smart in the general sense, but that ‘thinks’ like you do. Could well-crafted clones serve as our stand-ins? I certainly feel stretched thin at work sometimes, wishing I could be in two places at once, and I bet you do too. To find out, I tried making a clone of myself. Read the full story to find out how it got on. This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. How lidar measures the cost of climate disasters The wildfires that swept through Los Angeles County this January left an indelible mark on the Southern California landscape. The Eaton and Palisades fires raged for 24 days, killing 29 people and destroying 16,000 structures, with losses estimated at $60 billion. More than 55,000 acres were consumed, and the landscape itself was physically transformed. Now, researchers are using lidar (light detection and ranging) technology to precisely measure these changes in the landscape’s geometry—helping them understand and track the cascading effects of climate disasters. Read the full story.—Jon Keegan This story is from our new print edition, which is all about the future of security. Subscribe here to catch future copies when they land. Here’s how we picked this year’s Innovators Under 35 Next Monday we’ll publish our 2025 list of Innovators Under 35. The list highlights smart and talented people working across many areas of emerging technology. This new class features 35 accomplished founders, hardware engineers, roboticists, materials scientists, and others who are already tackling tough problems and making big moves in their careers.  MIT Technology Review first published a list of Innovators Under 35 in 1999. It’s a grand tradition for us, and we often follow the work of various featured innovators for years, even decades, after they appear on the list. So before the big announcement, we’d like to take a moment to explain how we select the people we recognize each year. Read the full story. —Amy Nordrum The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Meta created flirty chatbots of celebrities without their permissionTo make matters worse, the bots generated risqué pictures on demand. (Reuters)+ Meta’s relationship with Scale AI appears to be under pressure. (TechCrunch)+ An AI companion site is hosting sexually charged conversations with underage celebrity bots. (MIT Technology Review) 2 The FTC has warned Big Tech not to comply with EU lawsIf they jeopardize the freedom of expression or safety of US citizens, at least. (Wired $) 3 Ukraine is using drones to drop supplies to its troops in trenchesThey’re delivering everything from cigarettes to roasted chicken. (WP $)+ Meet the radio-obsessed civilian shaping Ukraine’s drone defense. (MIT Technology Review) 4 What the collapse of this AI company says about the wider industryBuilder.ai was an early industry darling. Its downfall is a dire warning. (NYT $) 5 US shoppers are racing to land an EV bargainFederal tax credits on the vehicles expire at the end of the month. (WSJ $)+ The US could really use an affordable electric truck. (MIT Technology Review) 6 A major new project will use AI to research vaccinesThe Oxford Vaccine Group hopes the jabs will protect against deadly pathogens. (FT $)+ Why US federal health agencies are abandoning mRNA vaccines. (MIT Technology Review) 7 A lot of people stop taking weight-loss drugs within one yearHow should doctors encourage the ones who need to stay on them? (Undark)+ We’re learning more about what weight-loss drugs do to the body. (MIT Technology Review) 8 Chatbots can be manipulated into breaking their own rulesIt turns out they’re susceptible to both flattery and peer pressure. (The Verge)+ Forcing LLMs to be evil during training can make them nicer in the long run. (MIT Technology Review) 9 Tennis is trying to reach a new generation of fans Through…the metaverse? (The Information $) 10 The age of cheap online shopping is endingAnd consumers are the ones paying the price. (The Atlantic $)+ AI is starting to shake up the digital shopping experience, too. (FT $)+ Your most important customer may be AI. (MIT Technology Review) Quote of the day “Stop being a clanker!” —How Jay Pinkert, a marketing manager, scolds ChatGPT when it isn’t fulfilling his requests, he tells the New York Times. One more thing The algorithms around us A metronome ticks. A record spins. And as a feel-good pop track plays, a giant compactor slowly crushes a Jenga tower of material creations. Paint cans burst. Chess pieces topple. Camera lenses shatter. An alarm clock shrills and then goes silent. A guitar neck snaps. But wait! The jaunty tune starts up again, and the jaws open to reveal … an iPad. Watching Apple’s now-infamous “Crush!” ad, it’s hard not to feel uneasy about the ways in which digitization is remaking human life. Sure, we’re happy for computers to take over tasks we don’t want to do or aren’t particularly good at, like shopping or navigating. But what does it mean when the things we hold dear and thought were uniquely ours—our friendships, our art, even our language and creativity—can be reduced to software? Read the full story. —Ariel Bleicher We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Minnesota’s Llama-Alpaca Costume Contest

The Download: AI doppelgängers in the workplace, and using lidar to measure climate disasters Read Post »

AI, Committee, News, Uncategorized

Meet Elysia: A New Open-Source Python Framework Redefining Agentic RAG Systems with Decision Trees and Smarter Data Handling

If you’ve ever tried to build a agentic RAG system that actually works well, you know the pain. You feed it some documents, cross your fingers, and hope it doesn’t hallucinate when someone asks it a simple question. Most of the time, you get back irrelevant chunks of text that barely answer what was asked. Elysia is trying to fix this mess, and honestly, their approach is quite creative. Built by the folks at Weaviate, this open-source Python framework doesn’t just throw more AI at the problem – it completely rethinks how AI agents should work with your data. Note: Python 3.12 required What’s Actually Wrong with Most RAG Systems Here’s the thing that drives everyone crazy: traditional RAG systems are basically blind. They take your question, convert it to vectors, find some “similar” text, and hope for the best. It’s like asking someone to find you a good restaurant while they’re wearing a blindfold – they might get lucky, but probably not. Most systems also dump every possible tool on the AI at once, which is like giving a toddler access to your entire toolbox and expecting them to build a bookshelf. Elysia’s Three Pillars: 1) Decision Trees Instead of giving AI agents every tool at once, Elysia guides them through a structured nodes for decisions. Think of it like a flowchart that actually makes sense. Each step has context about what happened before and what options come next. The really cool part? The system shows you exactly which path the agent took and why, so when something goes wrong, you can actually debug it instead of just shrugging and trying again. When the AI realizes it can’t do something (like searching for car prices in a makeup database), it doesn’t just keep trying forever. It sets an “impossible flag” and moves on, which sounds obvious but apparently needed to be invented. 2) Smart Data Source Display Remember when every AI just spat out paragraphs of text? Elysia actually looks at your data and figures out how to show it properly. Got e-commerce products? You get product cards. GitHub issues? You get ticket layouts. Spreadsheet data? You get actual tables. The system examines your data structure first – the fields, the types, the relationships – then picks one of the seven formats that makes sense. 3) Data Expertise This might be the biggest difference. Before Elysia searches anything, it analyzes your database to understand what’s actually in there. It can summarize, generate metadata, and choose display types. It looks at: What kinds of fields you have What the data ranges look like How different pieces relate to each other What would make sense to search for How does it Work? Learning from Feedback Elysia remembers when users say “yes, this was helpful” and uses those examples to improve future responses. But it does this smartly – your feedback doesn’t mess up other people’s results, and it helps the system get better at answering your specific types of questions. This means you can use smaller, cheaper models that still give good results because they’re learning from actual success cases. Chunking That Makes Sense Most RAG systems chunk all your documents upfront, which uses tons of storage and often creates weird breaks. Elysia chunks documents only when needed. It searches full documents first, then if a document looks relevant but is too long, it breaks it down on the fly. This saves storage space and actually works better because the chunking decisions are informed by what the user is actually looking for. Model Routing Different tasks need different models. Simple questions don’t need GPT-4, and complex analysis doesn’t work well with tiny models. Elysia automatically routes tasks to the right model based on complexity, which saves money and improves speed. https://weaviate.io/blog/elysia-agentic-rag Getting Started The setup is quite simple: Copy CodeCopiedUse a different Browser pip install elysia-ai elysia start That’s it. You get both a web interface and the Python framework. For developers who want to customize things: Copy CodeCopiedUse a different Browser from elysia import tool, Tree tree = Tree() @tool(tree=tree) async def add(x: int, y: int) -> int: return x + y tree(“What is the sum of 9009 and 6006?”) If you have Weaviate data, it’s even simpler: Copy CodeCopiedUse a different Browser import elysia tree = elysia.Tree() response, objects = tree( “What are the 10 most expensive items in the Ecommerce collection?”, collection_names = [“Ecommerce”] ) Real-World Example: Glowe’s Chatbot The Glowe skincare chatbot platform uses Elysia to handle complex product recommendations. Users can ask things like “What products work well with retinol but won’t irritate sensitive skin?” and get intelligent responses that consider ingredient interactions, user preferences, and product availability.youtube This isn’t just keyword matching – it’s understanding context and relationship between ingredients, user history, and product characteristics in ways that would be really hard to code manually.youtube Summary Elysia represents Weaviate’s attempt to move beyond traditional ask-retrieve-generate RAG patterns by combining decision-tree agents, adaptive data presentation, and learning from user feedback. Rather than just generating text responses, it analyzes data structure beforehand and selects appropriate display formats while maintaining transparency in its decision-making process. As Weaviate’s planned replacement for their Verba RAG system, it offers a foundation for building more sophisticated AI applications that understand both what users are asking and how to present answers effectively, though whether this translates to meaningfully better real-world performance remains to be seen since it is still in beta. Check out the TECHNICAL DETAILS and GITHUB PAGE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meet Elysia: A New Open-Source Python Framework Redefining Agentic RAG Systems with Decision Trees and Smarter Data Handling appeared first on MarkTechPost.

Meet Elysia: A New Open-Source Python Framework Redefining Agentic RAG Systems with Decision Trees and Smarter Data Handling Read Post »

AI, Committee, News, Uncategorized

StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio

The StepFun AI team has released Step-Audio 2 Mini, an 8B parameter speech-to-speech large audio language model (LALM) that delivers expressive, grounded, and real-time audio interaction. Released under the Apache 2.0 license, this open-source model achieves state-of-the-art performance across speech recognition, audio understanding, and speech conversation benchmarks—surpassing commercial systems such as GPT-4o-Audio. https://huggingface.co/stepfun-ai/Step-Audio-2-mini Key Features 1. Unified Audio–Text Tokenization Unlike cascaded ASR+LLM+TTS pipelines, Step-Audio 2 integrates Multimodal Discrete Token Modeling, where text and audio tokens share a single modeling stream. This enables: Seamless reasoning across text and audio. On-the-fly voice style switching during inference. Consistency in semantic, prosodic, and emotional outputs. 2. Expressive and Emotion-Aware Generation The model doesn’t just transcribe speech—it interprets paralinguistic features like pitch, rhythm, emotion, timbre, and style. This allows conversations with realistic emotional tones such as whispering, sadness, or excitement. Benchmarks on StepEval-Audio-Paralinguistic show Step-Audio 2 achieving 83.1% accuracy, far beyond GPT-4o Audio (43.5%) and Qwen-Omni (44.2%). 3. Retrieval-Augmented Speech Generation Step-Audio 2 incorporates multimodal RAG (Retrieval-Augmented Generation): Web search integration for factual grounding. Audio search—a novel capability that retrieves real voices from a large library and fuses them into responses, enabling voice timbre/style imitation at inference time. 4. Tool Calling and Multimodal Reasoning The system extends beyond speech synthesis by supporting tool invocation. Benchmarks show that Step-Audio 2 matches textual LLMs in tool selection and parameter accuracy, while uniquely excelling at audio search tool calls—a capability unavailable in text-only LLMs. Training and Data Scale Text + Audio Corpus: 1.356T tokens Audio Hours: 8M+ real and synthetic hours Speaker Diversity: ~50K voices across languages and dialects Pretraining Pipeline: multi-stage curriculum covering ASR, TTS, speech-to-speech translation, and emotion-labeled conversational synthesis. This large-scale training allows Step-Audio 2 Mini to retain strong text reasoning (via its Qwen2-Audio and CosyVoice foundation) while mastering fine-grained audio modeling. Performance Benchmarks https://huggingface.co/stepfun-ai/Step-Audio-2-mini https://arxiv.org/abs/2507.16632 Automatic Speech Recognition (ASR) English: Average WER 3.14% (beats GPT-4o Transcribe at an average 4.5%). Chinese: Average CER 3.08% (significantly lower than GPT-4o and Qwen-Omni). Robust across dialects and accents. Audio Understanding (MMAU Benchmark) Step-Audio 2: 78.0 average, outperforming Omni-R1 (77.0) and Audio Flamingo 3 (73.1). Strongest in sound and speech reasoning tasks. Speech Translation CoVoST 2 (S2TT): BLEU 39.26 (highest among open and closed models). CVSS (S2ST): BLEU 30.87, ahead of GPT-4o (23.68). Conversational Benchmarks (URO-Bench) Chinese Conversations: Best overall at 83.3 (basic) and 68.2 (pro). English Conversations: Competitive with GPT-4o (83.9 vs. 84.5), far ahead of other open models. Source: Marktechpost.com Conclusion Step-Audio 2 Mini makes advanced, multimodal speech intelligence accessible to the developers and research community. By combining Qwen2-Audio’s reasoning capacity with CosyVoice’s tokenization pipeline, and augmenting with retrieval-based grounding, StepFun has delivered one of the most capable open audio LLMs. Check out the PAPER and MODEL on HUGGING FACE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio appeared first on MarkTechPost.

StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US