{"id":18759,"date":"2025-06-13T04:34:05","date_gmt":"2025-06-13T04:34:05","guid":{"rendered":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/"},"modified":"2025-06-13T04:34:05","modified_gmt":"2025-06-13T04:34:05","slug":"apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation","status":"publish","type":"post","link":"https:\/\/youzum.net\/th\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/","title":{"rendered":"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation"},"content":{"rendered":"<p>Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs to understanding the process that leads to these answers. This shift has raised questions about how these models manage tasks with layered complexity and whether they truly possess reasoning abilities or are simply leveraging training patterns to guess outcomes.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Redefining Evaluation: Moving Beyond Final Answer Accuracy<\/strong><\/h3>\n<p>A recurring problem with evaluating machine reasoning is that traditional benchmarks mostly assess the final answer without examining the steps involved in arriving at it. Final answer accuracy alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated with data that may have been seen during training. This creates a misleading picture of a model\u2019s true capabilities. To explore actual reasoning, researchers require environments where problem difficulty can be precisely controlled and intermediate steps can be analyzed. Without such settings, it is hard to determine whether these models can generalize solutions or merely memorize patterns.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ?key=HA2qlWVYRroaXNkFivdvAA\" alt=\"\"\/><\/figure>\n<\/div>\n<p>To evaluate reasoning more reliably, the research team at Apple designed a setup using four puzzle environments: Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of disks, checkers, or agents involved. Each task requires different reasoning abilities, such as constraint satisfaction and sequential planning. Importantly, these environments are free from typical data contamination, enabling thorough checks of both outcomes and the reasoning steps in between. This method ensures a detailed investigation of how models behave across varied task demands.<\/p>\n<p>The research introduced a comparative study using two sets of models: Claude 3.7 Sonnet and DeepSeek-R1, along with their \u201cthinking\u201d variants and their standard LLM counterparts. These models were tested across the puzzles under identical token budgets to measure both accuracy and reasoning efficiency. This helped reveal performance shifts across low, medium, and high-complexity tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, non-thinking models outperformed reasoning variants. For medium complexity, reasoning models gained an edge, while both types collapsed completely as complexity peaked.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXckYIQsjfJOIKFZrsCx4nj91J59IlzhOBUczdP717QT31gh3PlOHeNo3EhDoZGcvLR8eu9Z7w87riXza0KseMXqCuKoNry4xa3OWtiAbj-CD7EdnXmQYWnLiAq9KUQLpuLoJs1Gag?key=HA2qlWVYRroaXNkFivdvAA\" alt=\"\"\/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Comparative Insights: Thinking vs. Non-Thinking Models Under Stress<\/strong><\/h3>\n<p>An in-depth analysis revealed that reasoning effort increased with task difficulty up to a certain point but then declined despite the availability of resources. For instance, in the Tower of Hanoi, Claude 3.7 Sonnet (thinking) maintained high accuracy until complexity reached a certain threshold, after which performance dropped to zero. Even when these models were supplied with explicit solution algorithms, they failed to execute steps beyond specific complexity levels. In one case, Claude 3.7 could manage around 100 steps correctly for the Tower of Hanoi but was unable to complete simpler River Crossing tasks requiring only 11 moves when $N = 3$. This inconsistency exposed serious limitations in symbolic manipulation and exact computation.<\/p>\n<p>The performance breakdown also highlighted how LRMs handle their internal thought process. Models frequently engaged in \u201coverthinking,\u201d generating correct intermediate solutions early in the process but continuing to explore incorrect paths. This led to inefficient use of tokens. At medium complexity levels, models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce accurate solutions. Quantitative analysis confirmed that solution accuracy dropped to near zero as the problem complexity increased, and the number of reasoning tokens allocated began to decline unexpectedly.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcVrGcfiiX15r5FPQe3BxrViqQNvJKYoVPaMmo6cfebpYC1Uliyx-p-3iFTKnbauSaZuBZ4958Xvou8gItCTttjC8cPiWPU7cPbXTluBSXkcvp1prrizG6CRrkvsHsVWRDaXtxwMg?key=HA2qlWVYRroaXNkFivdvAA\" alt=\"\"\/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Scaling Limits and the Collapse of Reasoning<\/strong><\/h3>\n<p>This research presents a sobering assessment of how current Learning Resource Management Systems (LRMs) operate. Research from Apple makes it clear that, despite some progress, today\u2019s reasoning models are still far from achieving generalized reasoning. The work identifies how performance scales, where it collapses, and why over-reliance on benchmark accuracy fails to capture deeper reasoning behavior. Controlled puzzle environments have proven to be a powerful tool for uncovering hidden weaknesses in these systems and emphasizing the need for more robust designs in the future.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<p>Check out the <strong><a href=\"https:\/\/ml-site.cdn-apple.com\/papers\/the-illusion-of-thinking.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">99k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.airesearchinsights.com\/subscribe\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/06\/12\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\">Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs to understanding the process that leads to these answers. This shift has raised questions about how these models manage tasks with layered complexity and whether they truly possess reasoning abilities or are simply leveraging training patterns to guess outcomes. Redefining Evaluation: Moving Beyond Final Answer Accuracy A recurring problem with evaluating machine reasoning is that traditional benchmarks mostly assess the final answer without examining the steps involved in arriving at it. Final answer accuracy alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated with data that may have been seen during training. This creates a misleading picture of a model\u2019s true capabilities. To explore actual reasoning, researchers require environments where problem difficulty can be precisely controlled and intermediate steps can be analyzed. Without such settings, it is hard to determine whether these models can generalize solutions or merely memorize patterns. To evaluate reasoning more reliably, the research team at Apple designed a setup using four puzzle environments: Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of disks, checkers, or agents involved. Each task requires different reasoning abilities, such as constraint satisfaction and sequential planning. Importantly, these environments are free from typical data contamination, enabling thorough checks of both outcomes and the reasoning steps in between. This method ensures a detailed investigation of how models behave across varied task demands. The research introduced a comparative study using two sets of models: Claude 3.7 Sonnet and DeepSeek-R1, along with their \u201cthinking\u201d variants and their standard LLM counterparts. These models were tested across the puzzles under identical token budgets to measure both accuracy and reasoning efficiency. This helped reveal performance shifts across low, medium, and high-complexity tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, non-thinking models outperformed reasoning variants. For medium complexity, reasoning models gained an edge, while both types collapsed completely as complexity peaked. Comparative Insights: Thinking vs. Non-Thinking Models Under Stress An in-depth analysis revealed that reasoning effort increased with task difficulty up to a certain point but then declined despite the availability of resources. For instance, in the Tower of Hanoi, Claude 3.7 Sonnet (thinking) maintained high accuracy until complexity reached a certain threshold, after which performance dropped to zero. Even when these models were supplied with explicit solution algorithms, they failed to execute steps beyond specific complexity levels. In one case, Claude 3.7 could manage around 100 steps correctly for the Tower of Hanoi but was unable to complete simpler River Crossing tasks requiring only 11 moves when $N = 3$. This inconsistency exposed serious limitations in symbolic manipulation and exact computation. The performance breakdown also highlighted how LRMs handle their internal thought process. Models frequently engaged in \u201coverthinking,\u201d generating correct intermediate solutions early in the process but continuing to explore incorrect paths. This led to inefficient use of tokens. At medium complexity levels, models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce accurate solutions. Quantitative analysis confirmed that solution accuracy dropped to near zero as the problem complexity increased, and the number of reasoning tokens allocated began to decline unexpectedly. Scaling Limits and the Collapse of Reasoning This research presents a sobering assessment of how current Learning Resource Management Systems (LRMs) operate. Research from Apple makes it clear that, despite some progress, today\u2019s reasoning models are still far from achieving generalized reasoning. The work identifies how performance scales, where it collapses, and why over-reliance on benchmark accuracy fails to capture deeper reasoning behavior. Controlled puzzle environments have proven to be a powerful tool for uncovering hidden weaknesses in these systems and emphasizing the need for more robust designs in the future. Check out the Paper.\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a099k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. The post Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":18760,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-18759","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/th\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\" \/>\n<meta property=\"og:locale\" content=\"th_TH\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/th\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-13T04:34:05+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1600\" \/>\n\t<meta property=\"og:image:height\" content=\"1224\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 \u0e19\u0e32\u0e17\u0e35\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation\",\"datePublished\":\"2025-06-13T04:34:05+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\"},\"wordCount\":754,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\",\"url\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\",\"name\":\"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png\",\"datePublished\":\"2025-06-13T04:34:05+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#breadcrumb\"},\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png\",\"width\":1600,\"height\":1224},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"th\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/th\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/th\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/","og_locale":"th_TH","og_type":"article","og_title":"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/th\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-06-13T04:34:05+00:00","og_image":[{"width":1600,"height":1224,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Written by":"admin NU","Est. reading time":"4 \u0e19\u0e32\u0e17\u0e35"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation","datePublished":"2025-06-13T04:34:05+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/"},"wordCount":754,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"th","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/","url":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/","name":"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png","datePublished":"2025-06-13T04:34:05+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#breadcrumb"},"inLanguage":"th","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/"]}]},{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png","width":1600,"height":1224},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/apple-researchers-reveal-structural-failures-in-large-reasoning-models-using-puzzle-based-evaluation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"th"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/th\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png",1600,1224,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png",1600,1224,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png",1600,1224,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-300x230.png",300,230,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-1024x783.png",1024,783,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-1536x1175.png",1536,1175,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg.png",1600,1224,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-16x12.png",16,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-600x459.png",600,459,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdbsTdMG5wibZA9fwJIVGJrSxvI5P1iQXS4t3W0qYOXEEGq71B-eRDsHSICUU09YaAbZ-jvzwnCI075oO6bmWX7poK3eLnyzuJ1Y44rst6vQ1cACOyddXS7VfKfSePSdf4Fz_OZ-evWRyg-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/th\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/th\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/18759","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/comments?post=18759"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/18759\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media\/18760"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media?parent=18759"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/categories?post=18759"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/tags?post=18759"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}