YouZum

Verdict: A Library for Scaling Judge-Time Compute

arXiv:2502.18018v2 Announce Type: replace
Abstract: The use of LLMs as automated judges (“LLM-as-a-judge”) is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and practitioners.

We use cookies to improve your experience and performance on our website. You can learn more at Politique de confidentialité and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
fr_FR