{"id":42322,"date":"2025-10-05T06:52:34","date_gmt":"2025-10-05T06:52:34","guid":{"rendered":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/"},"modified":"2025-10-05T06:52:34","modified_gmt":"2025-10-05T06:52:34","slug":"google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/","title":{"rendered":"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture"},"content":{"rendered":"<p><strong>What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12\u201315 tool-using agents that share notes and stop early? <\/strong>Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced <strong>TUMIX (Tool-Use Mixture)<\/strong>\u2014a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them <strong>share intermediate answers over a few refinement rounds<\/strong>, then <strong>stop early<\/strong> via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as <strong>HLE<\/strong>, <strong>GPQA-Diamond<\/strong>, and <strong>AIME (2024\/2025)<\/strong>. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1328\" height=\"886\" data-attachment-id=\"75066\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/04\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/screenshot-2025-10-04-at-3-30-14-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1.png\" data-orig-size=\"1328,886\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-04 at 3.30.14\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-300x200.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-1024x683.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1.png\" alt=\"\" class=\"wp-image-75066\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.01279<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>So, What exactly is different new?<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Mixture over modality, not just more samples<\/strong>: TUMIX runs <strong>~15 agent styles<\/strong> spanning Chain-of-Thought (CoT), code execution, web search, dual-tool agents, and guided variants. Each round, every agent sees (a) the original question and (b) other agents\u2019 previous answers, then proposes a refined answer. This <strong>message-passing<\/strong> raises average accuracy early while diversity gradually collapses\u2014so stopping matters.<\/li>\n<li><strong>Adaptive early-termination<\/strong>: An <strong>LLM-as-Judge<\/strong> halts refinement once answers exhibit strong consensus (with a minimum round threshold). This preserves accuracy <strong>at ~49% of the inference cost<\/strong> vs. fixed-round refinement; token cost drops to ~46% because late rounds are token-heavier.<\/li>\n<li><strong>Auto-designed agents<\/strong>: Beyond human-crafted agents, TUMIX prompts the base LLM to <strong>generate new agent types<\/strong>; mixing these with the manual set yields an <strong>additional ~+1.2%<\/strong> average lift without extra cost. The empirical \u201csweet spot\u201d is <strong>~12\u201315 agent styles<\/strong>.<\/li>\n<\/ul>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1740\" height=\"1020\" data-attachment-id=\"75070\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/04\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/screenshot-2025-10-04-at-3-33-55-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.55-PM-1.png\" data-orig-size=\"1740,1020\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-04 at 3.33.55\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.55-PM-1-300x176.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.55-PM-1-1024x600.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.55-PM-1.png\" alt=\"\" class=\"wp-image-75070\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.01279<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>How does it work?<\/strong><\/h3>\n<p>TUMIX runs a group of heterogeneous agents\u2014text-only Chain-of-Thought, code-executing, web-searching, and guided variants\u2014in parallel, then iterates a small number of refinement rounds where each agent conditions on the original question plus the other agents\u2019 prior rationales and answers (structured note-sharing). After each round, an LLM-based judge evaluates consensus\/consistency to decide <strong>early termination<\/strong>; if confidence is insufficient, another round is triggered, otherwise the system finalizes via simple aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for <strong>diverse reasoning paths<\/strong>, improving coverage of correct candidates while controlling token\/tool budgets; empirically, benefits saturate around 12\u201315 agent styles, and stopping early preserves diversity and lowers cost without sacrificing accuracy<\/p>\n<h3 class=\"wp-block-heading\"><strong>Lets discuss the Results<\/strong><\/h3>\n<p>Under comparable inference budgets to strong tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), <strong>TUMIX<\/strong> yields the <strong>best average accuracy<\/strong>; a scaled variant (<strong>TUMIX+<\/strong>) pushes further with more compute:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>HLE (Humanity\u2019s Last Exam):<\/strong> Pro: <strong>21.6% \u2192 34.1%<\/strong> (TUMIX+); Flash: <strong>9.7% \u2192 23.1%<\/strong>.<br \/>(HLE is a 2,500-question, difficult, multi-domain benchmark finalized in 2025.)<\/li>\n<li><strong>GPQA-Diamond:<\/strong> Pro: up to <strong>88.3%<\/strong>; Flash: up to <strong>82.1%<\/strong>. (GPQA-Diamond is the hardest 198-question subset authored by domain experts.)<\/li>\n<li><strong>AIME 2024\/25:<\/strong> Pro: <strong>96.7%<\/strong>; Flash: <strong>86.7%<\/strong> with TUMIX(+) at test time.<\/li>\n<\/ul>\n<p>Across tasks, <strong>TUMIX averages +3.55% over the best prior tool-augmented test-time scaling baseline at similar cost<\/strong>, and <strong>+7.8% \/ +17.4%<\/strong> over no-scaling for Pro\/Flash, respectively.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1482\" height=\"1320\" data-attachment-id=\"75068\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/04\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/screenshot-2025-10-04-at-3-33-16-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.16-PM-1.png\" data-orig-size=\"1482,1320\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-04 at 3.33.16\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.16-PM-1-300x267.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.16-PM-1-1024x912.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.33.16-PM-1.png\" alt=\"\" class=\"wp-image-75068\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.01279<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Our Comments<\/strong><\/h3>\n<p>TUMIX is a great approach from Google because it frames test-time scaling as a search problem over heterogeneous tool policies rather than brute-force sampling. The parallel committee (text, code, search) improves candidate coverage, while the LLM-judge enables early-stop that preserves diversity and reduces token\/tool spend\u2014useful under latency budgets. The HLE gains (34.1% with Gemini-2.5 Pro) align with the benchmark\u2019s finalized 2,500-question design, and the ~12\u201315 agent styles \u201csweet spot\u201d indicates selection\u2014not generation\u2014is the limiting factor.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/abs\/2510.01279\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong>. Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><a href=\"https:\/\/www.arxiv.org\/abs\/2510.01279?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/10\/04\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\">Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12\u201315 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)\u2014a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them share intermediate answers over a few refinement rounds, then stop early via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as HLE, GPQA-Diamond, and AIME (2024\/2025). https:\/\/arxiv.org\/pdf\/2510.01279 So, What exactly is different new? Mixture over modality, not just more samples: TUMIX runs ~15 agent styles spanning Chain-of-Thought (CoT), code execution, web search, dual-tool agents, and guided variants. Each round, every agent sees (a) the original question and (b) other agents\u2019 previous answers, then proposes a refined answer. This message-passing raises average accuracy early while diversity gradually collapses\u2014so stopping matters. Adaptive early-termination: An LLM-as-Judge halts refinement once answers exhibit strong consensus (with a minimum round threshold). This preserves accuracy at ~49% of the inference cost vs. fixed-round refinement; token cost drops to ~46% because late rounds are token-heavier. Auto-designed agents: Beyond human-crafted agents, TUMIX prompts the base LLM to generate new agent types; mixing these with the manual set yields an additional ~+1.2% average lift without extra cost. The empirical \u201csweet spot\u201d is ~12\u201315 agent styles. https:\/\/arxiv.org\/pdf\/2510.01279 How does it work? TUMIX runs a group of heterogeneous agents\u2014text-only Chain-of-Thought, code-executing, web-searching, and guided variants\u2014in parallel, then iterates a small number of refinement rounds where each agent conditions on the original question plus the other agents\u2019 prior rationales and answers (structured note-sharing). After each round, an LLM-based judge evaluates consensus\/consistency to decide early termination; if confidence is insufficient, another round is triggered, otherwise the system finalizes via simple aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for diverse reasoning paths, improving coverage of correct candidates while controlling token\/tool budgets; empirically, benefits saturate around 12\u201315 agent styles, and stopping early preserves diversity and lowers cost without sacrificing accuracy Lets discuss the Results Under comparable inference budgets to strong tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX yields the best average accuracy; a scaled variant (TUMIX+) pushes further with more compute: HLE (Humanity\u2019s Last Exam): Pro: 21.6% \u2192 34.1% (TUMIX+); Flash: 9.7% \u2192 23.1%.(HLE is a 2,500-question, difficult, multi-domain benchmark finalized in 2025.) GPQA-Diamond: Pro: up to 88.3%; Flash: up to 82.1%. (GPQA-Diamond is the hardest 198-question subset authored by domain experts.) AIME 2024\/25: Pro: 96.7%; Flash: 86.7% with TUMIX(+) at test time. Across tasks, TUMIX averages +3.55% over the best prior tool-augmented test-time scaling baseline at similar cost, and +7.8% \/ +17.4% over no-scaling for Pro\/Flash, respectively. https:\/\/arxiv.org\/pdf\/2510.01279 Our Comments TUMIX is a great approach from Google because it frames test-time scaling as a search problem over heterogeneous tool policies rather than brute-force sampling. The parallel committee (text, code, search) improves candidate coverage, while the LLM-judge enables early-stop that preserves diversity and reduces token\/tool spend\u2014useful under latency budgets. The HLE gains (34.1% with Gemini-2.5 Pro) align with the benchmark\u2019s finalized 2,500-question design, and the ~12\u201315 agent styles \u201csweet spot\u201d indicates selection\u2014not generation\u2014is the limiting factor. Check out the\u00a0Paper. Feel free to check out our\u00a0GitHub Page for Tutorials, Codes and Notebooks.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. Wait! are you on telegram?\u00a0now you can join us on telegram as well. The post Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":42323,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-42322","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-05T06:52:34+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture\",\"datePublished\":\"2025-10-05T06:52:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\"},\"wordCount\":606,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\",\"url\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\",\"name\":\"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png\",\"datePublished\":\"2025-10-05T06:52:34+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png\",\"width\":1328,\"height\":886},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/","og_locale":"zh_CN","og_type":"article","og_title":"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-10-05T06:52:34+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"3 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture","datePublished":"2025-10-05T06:52:34+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/"},"wordCount":606,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/","url":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/","name":"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png","datePublished":"2025-10-05T06:52:34+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/"]}]},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png","width":1328,"height":886},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/google-proposes-tumix-multi-agent-test-time-scaling-with-tool-use-mixture\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png",1328,886,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png",1328,886,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png",1328,886,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8-300x200.png",300,200,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8-1024x683.png",1024,683,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png",1328,886,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8.png",1328,886,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8-18x12.png",18,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8-600x400.png",600,400,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-04-at-3.30.14-PM-1-i2omy8-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12\u201315 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)\u2014a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants)&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/42322","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=42322"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/42322\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media\/42323"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=42322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=42322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=42322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}