{"id":45449,"date":"2025-10-19T07:10:03","date_gmt":"2025-10-19T07:10:03","guid":{"rendered":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/"},"modified":"2025-10-19T07:10:03","modified_gmt":"2025-10-19T07:10:03","slug":"autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/","title":{"rendered":"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters"},"content":{"rendered":"<p><strong>Are your LLM code benchmarks actually rejecting wrong-complexity solutions and interactive-protocol violations, or are they passing under-specified unit tests?<\/strong> A team of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce <strong>AutoCode<\/strong>, a new AI framework that lets LLMs create and verify competitive programming problems, mirroring the workflow of human problem setters. AutoCode reframes evaluation for code-reasoning models by treating <strong>problem setting<\/strong> (not only problem solving) as the target task. The system trains LLMs to produce competition-grade <strong>statements<\/strong>, <strong>test data<\/strong>, and <strong>verdict logic<\/strong> that match official online judges at high rates. On a 7,538-problem benchmark built from prior datasets, AutoCode achieves <strong>91.1% consistency<\/strong> with official judgments (<strong>FPR 3.7%, FNR 14.1%<\/strong>). On a separate, more difficult <strong>720 recent Codeforces problems<\/strong> (including interactive tasks), the full framework reports <strong>98.7% consistency<\/strong>, <strong>1.3% FPR<\/strong>, <strong>1.2% FNR<\/strong>. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1390\" height=\"922\" data-attachment-id=\"75476\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-23-27-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM.png\" data-orig-size=\"1390,922\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.23.27\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-300x199.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-1024x679.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM.png\" alt=\"\" class=\"wp-image-75476\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Why problem setting matters for evaluation?<\/strong><\/h3>\n<p>Public code benchmarks often rely on under-specified tests that let wrong-complexity or shortcut solutions pass. That inflates scores and pollutes reinforcement signals (rewarding fragile tactics). AutoCode\u2019s <strong>validator-first<\/strong> approach and adversarial test generation aim to reduce <strong>false positives (FPR)<\/strong>\u2014incorrect programs that pass\u2014and <strong>false negatives (FNR)<\/strong>\u2014correct programs rejected due to malformed inputs. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1360\" height=\"544\" data-attachment-id=\"75478\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-28-46-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.28.46-AM-1.png\" data-orig-size=\"1360,544\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.28.46\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.28.46-AM-1-300x120.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.28.46-AM-1-1024x410.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.28.46-AM-1.png\" alt=\"\" class=\"wp-image-75478\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The core loop: Validator \u2192 Generator \u2192 Checker<\/strong><\/h3>\n<p>AutoCode runs a closed loop that mirrors human contest workflows, but each step is selected from <strong>LLM-generated candidates<\/strong> using targeted in-framework tests.<\/p>\n<p><strong>1) Validator (minimize FNR by enforcing input legality)<\/strong><\/p>\n<p>The system first asks an LLM to synthesize <strong>40 evaluation inputs<\/strong>\u2014<strong>10 valid<\/strong> and <strong>30 near-valid illegal<\/strong> (e.g., off-by-one boundary violations). It then prompts the LLM for <strong>three candidate validator programs<\/strong> and selects the one that best classifies these cases. This prevents \u201ccorrect\u201d solutions from crashing on malformed data.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1338\" height=\"400\" data-attachment-id=\"75479\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-32-41-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.32.41-AM.png\" data-orig-size=\"1338,400\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.32.41\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.32.41-AM-300x90.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.32.41-AM-1024x306.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.32.41-AM.png\" alt=\"\" class=\"wp-image-75479\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<p><strong>2) Generator (reduce FPR by adversarial coverage)<\/strong><\/p>\n<p><strong>Three complementary strategies produce test cases:<\/strong><br \/>\u2022 <strong>Small-data exhaustion<\/strong> for boundary coverage,<br \/>\u2022 <strong>Randomized + extreme<\/strong> cases (overflows, precision, hash-collisions),<br \/>\u2022 <strong>TLE-inducing<\/strong> structures to break wrong-complexity solutions.<\/p>\n<p>Invalid cases are filtered by the selected validator; then cases are deduplicated and bucket-balanced before sampling.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1340\" height=\"530\" data-attachment-id=\"75482\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-33-14-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.14-AM-1.png\" data-orig-size=\"1340,530\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.33.14\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.14-AM-1-300x119.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.14-AM-1-1024x405.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.14-AM-1.png\" alt=\"\" class=\"wp-image-75482\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<p><strong>3) Checker (verdict logic)<\/strong><\/p>\n<p>The checker compares contestant outputs with the <strong>reference solution<\/strong> under complex rules. AutoCode again generates <strong>40 checker scenarios<\/strong> and <strong>three candidate checker programs<\/strong>, keeps only scenarios with validator-approved inputs, and selects the best checker by accuracy against the 40 labeled scenarios.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1346\" height=\"574\" data-attachment-id=\"75484\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-33-43-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.43-AM-1.png\" data-orig-size=\"1346,574\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.33.43\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.43-AM-1-300x128.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.43-AM-1-1024x437.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.33.43-AM-1.png\" alt=\"\" class=\"wp-image-75484\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<p><strong>4) Interactor (for interactive problems)<\/strong><\/p>\n<p>For tasks that require dialogue with the judge, AutoCode introduces a <strong>mutant-based interactor<\/strong>: it makes small logical edits (\u201cmutants\u201d) to the reference solution, selects interactors that <strong>accept the true solution<\/strong> but <strong>reject the mutants<\/strong>, maximizing discrimination. This addresses a gap in earlier public datasets that avoided interactives.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1330\" height=\"532\" data-attachment-id=\"75486\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-34-10-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.34.10-AM-1.png\" data-orig-size=\"1330,532\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.34.10\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.34.10-AM-1-300x120.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.34.10-AM-1-1024x410.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.34.10-AM-1.png\" alt=\"\" class=\"wp-image-75486\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Dual verification enables new problems (not just tests for existing ones)<\/strong><\/h3>\n<p>AutoCode can generate <strong>novel problem variants<\/strong> starting from a random \u201cseed\u201d Codeforces problem (&lt;2200 Elo). The LLM drafts a new statement and <strong>two solutions<\/strong>: an efficient <strong>reference<\/strong> and a simpler <strong>brute-force<\/strong> baseline. A problem is accepted only if the reference output <strong>matches<\/strong> brute force across the generated test suite (the brute force may TLE on large cases but serves as ground truth on small\/exhaustive cases). This <strong>dual-verification<\/strong> protocol filters <strong>~27%<\/strong> of error-prone items, lifting reference-solution correctness from <strong>86% \u2192 94%<\/strong> before human review.<\/p>\n<p>Human experts then grade the survivors on <strong>solvability, solution correctness, quality, novelty, difficulty<\/strong>. After filtering, <strong>61.6%<\/strong> are usable for <strong>model training<\/strong>, <strong>76.3%<\/strong> for <strong>human training<\/strong>, and <strong>3.2%<\/strong> are <strong>ICPC\/IOI-level<\/strong> problems. Difficulty typically increases relative to the seed, and difficulty gain correlates with perceived quality.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1350\" height=\"744\" data-attachment-id=\"75488\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-36-17-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.36.17-AM-1.png\" data-orig-size=\"1350,744\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.36.17\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.36.17-AM-1-300x165.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.36.17-AM-1-1024x564.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.36.17-AM-1.png\" alt=\"\" class=\"wp-image-75488\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Understanding the results<\/strong><\/h3>\n<p><strong>Existing problems (7,538 total; 195,988 human submissions).<\/strong> AutoCode: <strong>91.1% consistency<\/strong>, <strong>3.7% FPR<\/strong>, <strong>14.1% FNR<\/strong>, vs 72.9\u201381.0% consistency for prior generators (CodeContests, CodeContests+, TACO, HardTests).<\/p>\n<p><strong>Recent Codeforces problems (720, unfiltered; includes interactives).<\/strong> AutoCode: <strong>98.7% consistency<\/strong>, <strong>1.3% FPR<\/strong>, <strong>1.2% FNR<\/strong>. Ablations show all three generator strategies and <strong>prompt optimization<\/strong> contribute: removing prompt optimization drops consistency to <strong>98.0%<\/strong> and more than doubles <strong>FNR<\/strong> to <strong>2.9%<\/strong>.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1438\" height=\"980\" data-attachment-id=\"75492\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/screenshot-2025-10-18-at-1-38-33-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.38.33-AM-1.png\" data-orig-size=\"1438,980\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-10-18 at 1.38.33\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.38.33-AM-1-300x204.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.38.33-AM-1-1024x698.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.38.33-AM-1.png\" alt=\"\" class=\"wp-image-75492\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2510.12803<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li>AutoCode couples a <strong>Validator\u2013Generator\u2013Checker (+Interactor)<\/strong> loop with <strong>dual verification (reference vs. brute-force)<\/strong> to build contest-grade test suites and new problems.<\/li>\n<li>On held-out problems, AutoCode\u2019s test suites reach <strong>~99% consistency<\/strong> with official judges, surpassing prior generators like HardTests (&lt;81%). <\/li>\n<li>For recent <strong>Codeforces<\/strong> tasks (including interactives), the full framework reports <strong>~98.7% consistency<\/strong> with <strong>~1.3% FPR<\/strong> and <strong>~1.2% FNR<\/strong>. <\/li>\n<li>The <strong>mutant-based interactor<\/strong> reliably accepts the true solution while rejecting mutated variants, improving evaluation for interactive problems.<\/li>\n<li>Human experts rate a sizable fraction of AutoCode-generated items as <strong>training-usable<\/strong> and a non-trivial share as <strong>contest-quality<\/strong>, aligning with the LiveCodeBench Pro program\u2019s aims.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Editorial Comments<\/strong><\/h3>\n<p>AutoCode is a practical fix for current code benchmarks. It centers <strong>problem setting<\/strong> and uses a closed-loop <strong>Validator\u2013Generator\u2013Checker (+Interactor)<\/strong> pipeline with <strong>dual verification<\/strong> (reference vs. brute-force). This structure reduces <strong>false positives\/negatives<\/strong> and yields judge-aligned <strong>consistency<\/strong> (\u224899% on held-out problems; <strong>98.7%<\/strong> on recent Codeforces, including interactives). The approach standardizes constraint legality, adversarial coverage, and protocol-aware judging, which makes downstream RL reward signals cleaner. Its placement under <strong>LiveCodeBench Pro<\/strong> fits a hallucination-resistant evaluation program that emphasizes expert-checked rigor.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/abs\/2510.12803\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong> and <strong><a href=\"https:\/\/livecodebenchpro.com\/projects\/autocode\/overview\" target=\"_blank\" rel=\"noreferrer noopener\">Project<\/a><\/strong>. Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/10\/18\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\">AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Are your LLM code benchmarks actually rejecting wrong-complexity solutions and interactive-protocol violations, or are they passing under-specified unit tests? A team of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce AutoCode, a new AI framework that lets LLMs create and verify competitive programming problems, mirroring the workflow of human problem setters. AutoCode reframes evaluation for code-reasoning models by treating problem setting (not only problem solving) as the target task. The system trains LLMs to produce competition-grade statements, test data, and verdict logic that match official online judges at high rates. On a 7,538-problem benchmark built from prior datasets, AutoCode achieves 91.1% consistency with official judgments (FPR 3.7%, FNR 14.1%). On a separate, more difficult 720 recent Codeforces problems (including interactive tasks), the full framework reports 98.7% consistency, 1.3% FPR, 1.2% FNR. https:\/\/arxiv.org\/pdf\/2510.12803 Why problem setting matters for evaluation? Public code benchmarks often rely on under-specified tests that let wrong-complexity or shortcut solutions pass. That inflates scores and pollutes reinforcement signals (rewarding fragile tactics). AutoCode\u2019s validator-first approach and adversarial test generation aim to reduce false positives (FPR)\u2014incorrect programs that pass\u2014and false negatives (FNR)\u2014correct programs rejected due to malformed inputs. https:\/\/arxiv.org\/pdf\/2510.12803 The core loop: Validator \u2192 Generator \u2192 Checker AutoCode runs a closed loop that mirrors human contest workflows, but each step is selected from LLM-generated candidates using targeted in-framework tests. 1) Validator (minimize FNR by enforcing input legality) The system first asks an LLM to synthesize 40 evaluation inputs\u201410 valid and 30 near-valid illegal (e.g., off-by-one boundary violations). It then prompts the LLM for three candidate validator programs and selects the one that best classifies these cases. This prevents \u201ccorrect\u201d solutions from crashing on malformed data. https:\/\/arxiv.org\/pdf\/2510.12803 2) Generator (reduce FPR by adversarial coverage) Three complementary strategies produce test cases:\u2022 Small-data exhaustion for boundary coverage,\u2022 Randomized + extreme cases (overflows, precision, hash-collisions),\u2022 TLE-inducing structures to break wrong-complexity solutions. Invalid cases are filtered by the selected validator; then cases are deduplicated and bucket-balanced before sampling. https:\/\/arxiv.org\/pdf\/2510.12803 3) Checker (verdict logic) The checker compares contestant outputs with the reference solution under complex rules. AutoCode again generates 40 checker scenarios and three candidate checker programs, keeps only scenarios with validator-approved inputs, and selects the best checker by accuracy against the 40 labeled scenarios. https:\/\/arxiv.org\/pdf\/2510.12803 4) Interactor (for interactive problems) For tasks that require dialogue with the judge, AutoCode introduces a mutant-based interactor: it makes small logical edits (\u201cmutants\u201d) to the reference solution, selects interactors that accept the true solution but reject the mutants, maximizing discrimination. This addresses a gap in earlier public datasets that avoided interactives. https:\/\/arxiv.org\/pdf\/2510.12803 Dual verification enables new problems (not just tests for existing ones) AutoCode can generate novel problem variants starting from a random \u201cseed\u201d Codeforces problem (&lt;2200 Elo). The LLM drafts a new statement and two solutions: an efficient reference and a simpler brute-force baseline. A problem is accepted only if the reference output matches brute force across the generated test suite (the brute force may TLE on large cases but serves as ground truth on small\/exhaustive cases). This dual-verification protocol filters ~27% of error-prone items, lifting reference-solution correctness from 86% \u2192 94% before human review. Human experts then grade the survivors on solvability, solution correctness, quality, novelty, difficulty. After filtering, 61.6% are usable for model training, 76.3% for human training, and 3.2% are ICPC\/IOI-level problems. Difficulty typically increases relative to the seed, and difficulty gain correlates with perceived quality. https:\/\/arxiv.org\/pdf\/2510.12803 Understanding the results Existing problems (7,538 total; 195,988 human submissions). AutoCode: 91.1% consistency, 3.7% FPR, 14.1% FNR, vs 72.9\u201381.0% consistency for prior generators (CodeContests, CodeContests+, TACO, HardTests). Recent Codeforces problems (720, unfiltered; includes interactives). AutoCode: 98.7% consistency, 1.3% FPR, 1.2% FNR. Ablations show all three generator strategies and prompt optimization contribute: removing prompt optimization drops consistency to 98.0% and more than doubles FNR to 2.9%. https:\/\/arxiv.org\/pdf\/2510.12803 Key Takeaways AutoCode couples a Validator\u2013Generator\u2013Checker (+Interactor) loop with dual verification (reference vs. brute-force) to build contest-grade test suites and new problems. On held-out problems, AutoCode\u2019s test suites reach ~99% consistency with official judges, surpassing prior generators like HardTests (&lt;81%). For recent Codeforces tasks (including interactives), the full framework reports ~98.7% consistency with ~1.3% FPR and ~1.2% FNR. The mutant-based interactor reliably accepts the true solution while rejecting mutated variants, improving evaluation for interactive problems. Human experts rate a sizable fraction of AutoCode-generated items as training-usable and a non-trivial share as contest-quality, aligning with the LiveCodeBench Pro program\u2019s aims. Editorial Comments AutoCode is a practical fix for current code benchmarks. It centers problem setting and uses a closed-loop Validator\u2013Generator\u2013Checker (+Interactor) pipeline with dual verification (reference vs. brute-force). This structure reduces false positives\/negatives and yields judge-aligned consistency (\u224899% on held-out problems; 98.7% on recent Codeforces, including interactives). The approach standardizes constraint legality, adversarial coverage, and protocol-aware judging, which makes downstream RL reward signals cleaner. Its placement under LiveCodeBench Pro fits a hallucination-resistant evaluation program that emphasizes expert-checked rigor. Check out the\u00a0Paper and Project. Feel free to check out our\u00a0GitHub Page for Tutorials, Codes and Notebooks.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. Wait! are you on telegram?\u00a0now you can join us on telegram as well. The post AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":45450,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-45449","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-19T07:10:03+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters\",\"datePublished\":\"2025-10-19T07:10:03+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\"},\"wordCount\":919,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\",\"url\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\",\"name\":\"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png\",\"datePublished\":\"2025-10-19T07:10:03+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png\",\"width\":1390,\"height\":922},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/","og_locale":"fr_FR","og_type":"article","og_title":"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-10-19T07:10:03+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters","datePublished":"2025-10-19T07:10:03+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/"},"wordCount":919,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/","url":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/","name":"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png","datePublished":"2025-10-19T07:10:03+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png","width":1390,"height":922},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/autocode-a-new-ai-framework-that-lets-llms-create-and-verify-competitive-programming-problems-mirroring-the-workflow-of-human-problem-setters\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png",1390,922,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png",1390,922,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png",1390,922,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7-300x199.png",300,199,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7-1024x679.png",1024,679,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png",1390,922,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7.png",1390,922,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7-18x12.png",18,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7-600x398.png",600,398,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-18-at-1.23.27-AM-BLgPu7-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Are your LLM code benchmarks actually rejecting wrong-complexity solutions and interactive-protocol violations, or are they passing under-specified unit tests? A team of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce AutoCode, a new AI framework that lets LLMs create and\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/45449","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=45449"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/45449\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/45450"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=45449"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=45449"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=45449"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}