{"id":28287,"date":"2025-07-30T05:48:34","date_gmt":"2025-07-30T05:48:34","guid":{"rendered":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/"},"modified":"2025-07-30T05:48:34","modified_gmt":"2025-07-30T05:48:34","slug":"rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals","status":"publish","type":"post","link":"https:\/\/youzum.net\/es\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/","title":{"rendered":"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals"},"content":{"rendered":"<p>Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where human judgments are collected over pairs or lists of model outputs. Moreover, preference-based reward models can boost performance in the early stages, but they tend to overfit to superficial artifacts such as response length, formatting quirks, and annotator biases. These models require large volumes of pairwise comparisons, making them brittle and costly.<\/p>\n<p>RLVR methods now extend beyond mathematics and coding, with GENERAL-REASONER demonstrating strong performance in physics, finance, and policy, achieving a ten-point gain on MMLU-Pro through GRPO fine-tuning. Rubric-based evaluation has become a standard for advanced LLMs, with frameworks like HEALTHBENCH pairing clinician-written criteria with automated judges to evaluate factuality, safety, and empathy. However, these rubrics appear only during evaluation phases rather than training. Moreover, process supervision methods try to provide more granular feedback by rewarding intermediate reasoning steps through MCTS-generated labels and generative reward models such as THINKPRM.<\/p>\n<p>Researchers from Scale AI have proposed Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that utilizes checklist-style rubrics to guide multi-criteria tasks. \u00a0 \u00a0 The method generates prompt-specific rubrics based on carefully designed principles, where each rubric outlines clear standards for high-quality responses and provides human-interpretable supervision signals. Moreover, it is applied to medicine and science domains, resulting in two specialized training datasets, RaR-Medicine-20k and RaR-Science-20k. RaR enables smaller judge models to achieve superior alignment with human preferences by transforming rubrics into structured reward signals while maintaining robust performance across different model scales.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w?key=pdsTglZYr8zeLxPL6hZ0XA\" alt=\"\" \/><\/figure>\n<\/div>\n<p>Researchers used LLMs as expert proxies to generate these rubrics, ensuring adherence to the following desiderata: grounded in expert guidance, comprehensive coverage, semantic weighting, and self-contained evaluation. For each domain, specialized prompts instruct the LLM to generate 7-20 rubric items based on the complexity of the input question. Each item is assigned categorical weights, such as Essential Criteria or Important Criteria, to determine its significance for correct answers. The training utilizes the GRPO algorithm with Qwen2.5-7B as the base policy model. Moreover, the training pipeline operates through three core components: Response Generation, Reward Computation, and Policy Update.\u00a0<\/p>\n<p>The RaR-Implicit method outperforms baseline methods such as Simple-Likert, with the best variant achieving up to 28% relative improvement on HealthBench-1k and 13% on GPQA. \u00a0 It also outperforms both base and instruction-tuned policy models, showing the effectiveness of rubric-guided training for nuanced response evaluation while matching or exceeding Reference-Likert baseline performance. Beyond raw metrics, rubric-guided evaluations provide clearer and more accurate signals across model scales, achieving higher accuracy when preferred responses receive appropriate ratings. Moreover, expert guidance proves essential for synthetic rubric generation, with rubrics developed using reference answers achieving higher accuracy than those without human insights.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXep7sUR8E7O37M6lFq6Rmf1iSAAuMd-nQCcBaltPEalIEOLzaFLWhBpX0fEn2QJRPbaNx9hMAVwwoofCld3zf93NKG8774ttV51uzSxUQFI5qnJi9y2j7IZaQruufvdooDQMSlzGg?key=pdsTglZYr8zeLxPL6hZ0XA\" alt=\"\" \/><\/figure>\n<\/div>\n<p>In summary, researchers introduced RaR that advances post-training of language models by using structured, checklist-style rubrics as reward signals. It offers stable training signals, maintaining human interpretability and alignment. However, this research remains limited to medical and science domains, requiring validation across tasks such as open-ended dialogue. Researchers explored only two reward aggregation strategies, implicit and explicit, leaving the alternative weighting schemes. Moreover, they did not conduct a controlled analysis of reward hacking risks, and the reliance on off-the-shelf LLMs as judges suggests future work could benefit from dedicated evaluators with enhanced reasoning capabilities.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/abs\/2507.17746\" target=\"_blank\" rel=\"noreferrer noopener\">Paper here<\/a><em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/07\/29\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\">Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where human judgments are collected over pairs or lists of model outputs. Moreover, preference-based reward models can boost performance in the early stages, but they tend to overfit to superficial artifacts such as response length, formatting quirks, and annotator biases. These models require large volumes of pairwise comparisons, making them brittle and costly. RLVR methods now extend beyond mathematics and coding, with GENERAL-REASONER demonstrating strong performance in physics, finance, and policy, achieving a ten-point gain on MMLU-Pro through GRPO fine-tuning. Rubric-based evaluation has become a standard for advanced LLMs, with frameworks like HEALTHBENCH pairing clinician-written criteria with automated judges to evaluate factuality, safety, and empathy. However, these rubrics appear only during evaluation phases rather than training. Moreover, process supervision methods try to provide more granular feedback by rewarding intermediate reasoning steps through MCTS-generated labels and generative reward models such as THINKPRM. Researchers from Scale AI have proposed Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that utilizes checklist-style rubrics to guide multi-criteria tasks. \u00a0 \u00a0 The method generates prompt-specific rubrics based on carefully designed principles, where each rubric outlines clear standards for high-quality responses and provides human-interpretable supervision signals. Moreover, it is applied to medicine and science domains, resulting in two specialized training datasets, RaR-Medicine-20k and RaR-Science-20k. RaR enables smaller judge models to achieve superior alignment with human preferences by transforming rubrics into structured reward signals while maintaining robust performance across different model scales. Researchers used LLMs as expert proxies to generate these rubrics, ensuring adherence to the following desiderata: grounded in expert guidance, comprehensive coverage, semantic weighting, and self-contained evaluation. For each domain, specialized prompts instruct the LLM to generate 7-20 rubric items based on the complexity of the input question. Each item is assigned categorical weights, such as Essential Criteria or Important Criteria, to determine its significance for correct answers. The training utilizes the GRPO algorithm with Qwen2.5-7B as the base policy model. Moreover, the training pipeline operates through three core components: Response Generation, Reward Computation, and Policy Update.\u00a0 The RaR-Implicit method outperforms baseline methods such as Simple-Likert, with the best variant achieving up to 28% relative improvement on HealthBench-1k and 13% on GPQA. \u00a0 It also outperforms both base and instruction-tuned policy models, showing the effectiveness of rubric-guided training for nuanced response evaluation while matching or exceeding Reference-Likert baseline performance. Beyond raw metrics, rubric-guided evaluations provide clearer and more accurate signals across model scales, achieving higher accuracy when preferred responses receive appropriate ratings. Moreover, expert guidance proves essential for synthetic rubric generation, with rubrics developed using reference answers achieving higher accuracy than those without human insights. In summary, researchers introduced RaR that advances post-training of language models by using structured, checklist-style rubrics as reward signals. It offers stable training signals, maintaining human interpretability and alignment. However, this research remains limited to medical and science domains, requiring validation across tasks such as open-ended dialogue. Researchers explored only two reward aggregation strategies, implicit and explicit, leaving the alternative weighting schemes. Moreover, they did not conduct a controlled analysis of reward hacking risks, and the reliance on off-the-shelf LLMs as judges suggests future work could benefit from dedicated evaluators with enhanced reasoning capabilities. Check out the\u00a0Paper here.\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. The post Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":28288,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-28287","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/es\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\" \/>\n<meta property=\"og:locale\" content=\"es_ES\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/es\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-30T05:48:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w?key=pdsTglZYr8zeLxPL6hZ0XA\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tiempo de lectura\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals\",\"datePublished\":\"2025-07-30T05:48:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\"},\"wordCount\":658,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\",\"url\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\",\"name\":\"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png\",\"datePublished\":\"2025-07-30T05:48:34+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#breadcrumb\"},\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png\",\"width\":1258,\"height\":510},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"es\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/es\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/es\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/","og_locale":"es_ES","og_type":"article","og_title":"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/es\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-07-30T05:48:34+00:00","og_image":[{"url":"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w?key=pdsTglZYr8zeLxPL6hZ0XA","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Escrito por":"admin NU","Tiempo de lectura":"3 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals","datePublished":"2025-07-30T05:48:34+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/"},"wordCount":658,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"es","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/","url":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/","name":"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png","datePublished":"2025-07-30T05:48:34+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#breadcrumb"},"inLanguage":"es","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/"]}]},{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png","width":1258,"height":510},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/rubrics-as-rewards-rar-a-reinforcement-learning-framework-for-training-language-models-with-structured-multi-criteria-evaluation-signals\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"es"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/es\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png",1258,510,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png",1258,510,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png",1258,510,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw-300x122.png",300,122,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw-1024x415.png",1024,415,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png",1258,510,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw.png",1258,510,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw-18x7.png",18,7,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw-600x243.png",600,243,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcN4H04ekN7uDXtUaqOyjFa462L4T1MsLe8mkjgLn3UrgBI8FgdMA6oI7kM_qbg2ybDV-jE3TW5NZVRTbJoesCpFR5aqTjBNEGzxfqctsEJ31bR-rV3yV1XnUeMVXyHY-ceE6Pw2w-3oHnXw-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/es\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/es\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/28287","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/comments?post=28287"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/28287\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media\/28288"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media?parent=28287"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/categories?post=28287"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/tags?post=28287"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}