{"id":69928,"date":"2026-02-09T11:36:34","date_gmt":"2026-02-09T11:36:34","guid":{"rendered":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/"},"modified":"2026-02-09T11:36:34","modified_gmt":"2026-02-09T11:36:34","slug":"a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow","status":"publish","type":"post","link":"https:\/\/youzum.net\/it\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/","title":{"rendered":"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow"},"content":{"rendered":"<p>In this tutorial, we show how we treat prompts as first-class, versioned artifacts and apply rigorous regression testing to large language model behavior using MLflow. We design an evaluation pipeline that logs prompt versions, prompt diffs, model outputs, and multiple quality metrics in a fully reproducible manner. By combining classical text metrics with semantic similarity and automated regression flags, we demonstrate how we can systematically detect performance drift caused by seemingly small prompt changes. Along the tutorial, we focus on building a workflow that mirrors real software engineering practices, but applied to prompt engineering and LLM evaluation. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/MLFlow%20for%20LLM%20Evaluation\/Prompt_Versioning_and_Regression_Testing_for_LLMs_with_MLflow_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a>.<\/strong><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">!pip -q install -U \"openai&gt;=1.0.0\" mlflow rouge-score nltk sentence-transformers scikit-learn pandas\n\n\nimport os, json, time, difflib, re\nfrom typing import List, Dict, Any, Tuple\n\n\nimport mlflow\nimport pandas as pd\nimport numpy as np\n\n\nfrom openai import OpenAI\nfrom rouge_score import rouge_scorer\nimport nltk\nfrom nltk.translate.bleu_score import sentence_bleu, SmoothingFunction\nfrom sentence_transformers import SentenceTransformer\nfrom sklearn.metrics.pairwise import cosine_similarity\n\n\nnltk.download(\"punkt\", quiet=True)\nnltk.download(\"punkt_tab\", quiet=True)\n\n\nif not os.getenv(\"OPENAI_API_KEY\"):\n   try:\n       from google.colab import userdata  # type: ignore\n       k = userdata.get(\"OPENAI_API_KEY\")\n       if k:\n           os.environ[\"OPENAI_API_KEY\"] = k\n   except Exception:\n       pass\n\n\nif not os.getenv(\"OPENAI_API_KEY\"):\n   import getpass\n   os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter OPENAI_API_KEY (input hidden): \").strip()\n\n\nassert os.getenv(\"OPENAI_API_KEY\"), \"OPENAI_API_KEY is required.\"<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We set up the execution environment by installing all required dependencies and importing the core libraries used throughout the tutorial. We securely load the OpenAI API key at runtime, ensuring credentials are never hard-coded in the notebook. We also initialize essential NLP resources to ensure the evaluation pipeline runs reliably across different environments.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">MODEL = \"gpt-4o-mini\"\nTEMPERATURE = 0.2\nMAX_OUTPUT_TOKENS = 250\n\n\nABS_SEM_SIM_MIN = 0.78\nDELTA_SEM_SIM_MAX_DROP = 0.05\nDELTA_ROUGE_L_MAX_DROP = 0.08\nDELTA_BLEU_MAX_DROP = 0.10\n\n\nmlflow.set_tracking_uri(\"file:\/content\/mlruns\")\nmlflow.set_experiment(\"prompt_versioning_llm_regression\")\n\n\nclient = OpenAI()\nembedder = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n\nEVAL_SET = [\n   {\n       \"id\": \"q1\",\n       \"input\": \"Summarize in one sentence: MLflow tracks experiments, runs, parameters, metrics, and artifacts.\",\n       \"reference\": \"MLflow helps track machine learning experiments by logging runs with parameters, metrics, and artifacts.\"\n   },\n   {\n       \"id\": \"q2\",\n       \"input\": \"Rewrite professionally: 'this model is kinda slow but it works ok.'\",\n       \"reference\": \"The model is somewhat slow, but it performs reliably.\"\n   },\n   {\n       \"id\": \"q3\",\n       \"input\": \"Extract key fields as JSON: 'Order 5531 by Alice costs $42.50 and ships to Toronto.'\",\n       \"reference\": '{\"order_id\":\"5531\",\"customer\":\"Alice\",\"amount_usd\":42.50,\"city\":\"Toronto\"}'\n   },\n   {\n       \"id\": \"q4\",\n       \"input\": \"Answer briefly: What is prompt regression testing?\",\n       \"reference\": \"Prompt regression testing checks whether prompt changes degrade model outputs compared to a baseline.\"\n   },\n]\n\n\nPROMPTS = [\n   {\n       \"version\": \"v1_baseline\",\n       \"prompt\": (\n           \"You are a precise assistant.n\"\n           \"Follow the user request carefully.n\"\n           \"If asked for JSON, output valid JSON only.n\"\n           \"User: {user_input}\"\n       )\n   },\n   {\n       \"version\": \"v2_formatting\",\n       \"prompt\": (\n           \"You are a helpful, structured assistant.n\"\n           \"Respond clearly and concisely.n\"\n           \"Prefer clean formatting.n\"\n           \"User request: {user_input}\"\n       )\n   },\n   {\n       \"version\": \"v3_guardrailed\",\n       \"prompt\": (\n           \"You are a rigorous assistant.n\"\n           \"Rules:n\"\n           \"1) If user asks for JSON, output ONLY valid minified JSON.n\"\n           \"2) Otherwise, keep the answer short and factual.n\"\n           \"User: {user_input}\"\n       )\n   },\n]\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define all experimental configurations, including model parameters, regression thresholds, and MLflow tracking settings. We construct the evaluation dataset and explicitly declare multiple prompt versions to compare and test for regressions. By centralizing these definitions, we ensure that prompt changes and evaluation logic remain controlled and reproducible.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def call_llm(formatted_prompt: str) -&gt; str:\n   resp = client.responses.create(\n       model=MODEL,\n       input=formatted_prompt,\n       temperature=TEMPERATURE,\n       max_output_tokens=MAX_OUTPUT_TOKENS,\n   )\n   out = getattr(resp, \"output_text\", None)\n   if out:\n       return out.strip()\n   try:\n       texts = []\n       for item in resp.output:\n           if getattr(item, \"type\", \"\") == \"message\":\n               for c in item.content:\n                   if getattr(c, \"type\", \"\") in (\"output_text\", \"text\"):\n                       texts.append(getattr(c, \"text\", \"\"))\n       return \"n\".join(texts).strip()\n   except Exception:\n       return \"\"\n\n\nsmooth = SmoothingFunction().method3\nrouge = rouge_scorer.RougeScorer([\"rougeL\"], use_stemmer=True)\n\n\ndef safe_tokenize(s: str) -&gt; List[str]:\n   s = (s or \"\").strip().lower()\n   if not s:\n       return []\n   try:\n       return nltk.word_tokenize(s)\n   except LookupError:\n       return re.findall(r\"bw+b\", s)\n\n\ndef bleu_score(ref: str, hyp: str) -&gt; float:\n   r = safe_tokenize(ref)\n   h = safe_tokenize(hyp)\n   if len(h) == 0 or len(r) == 0:\n       return 0.0\n   return float(sentence_bleu([r], h, smoothing_function=smooth))\n\n\ndef rougeL_f1(ref: str, hyp: str) -&gt; float:\n   scores = rouge.score(ref or \"\", hyp or \"\")\n   return float(scores[\"rougeL\"].fmeasure)\n\n\ndef semantic_sim(ref: str, hyp: str) -&gt; float:\n   embs = embedder.encode([ref or \"\", hyp or \"\"], normalize_embeddings=True)\n   return float(cosine_similarity([embs[0]], [embs[1]])[0][0])<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement the core LLM invocation and evaluation metrics used to assess prompt quality. We compute BLEU, ROUGE-L, and semantic similarity scores to capture both surface-level and semantic differences in model outputs. It allows us to evaluate prompt changes from multiple complementary perspectives rather than relying on a single metric.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def evaluate_prompt(prompt_template: str) -&gt; Tuple[pd.DataFrame, Dict[str, float], str]:\n   rows = []\n   for ex in EVAL_SET:\n       p = prompt_template.format(user_input=ex[\"input\"])\n       y = call_llm(p)\n       ref = ex[\"reference\"]\n\n\n       rows.append({\n           \"id\": ex[\"id\"],\n           \"input\": ex[\"input\"],\n           \"reference\": ref,\n           \"output\": y,\n           \"bleu\": bleu_score(ref, y),\n           \"rougeL_f1\": rougeL_f1(ref, y),\n           \"semantic_sim\": semantic_sim(ref, y),\n       })\n\n\n   df = pd.DataFrame(rows)\n   agg = {\n       \"bleu_mean\": float(df[\"bleu\"].mean()),\n       \"rougeL_f1_mean\": float(df[\"rougeL_f1\"].mean()),\n       \"semantic_sim_mean\": float(df[\"semantic_sim\"].mean()),\n   }\n   outputs_jsonl = \"n\".join(json.dumps(r, ensure_ascii=False) for r in rows)\n   return df, agg, outputs_jsonl\n\n\ndef log_text_artifact(text: str, artifact_path: str):\n   mlflow.log_text(text, artifact_path)\n\n\ndef prompt_diff(old: str, new: str) -&gt; str:\n   a = old.splitlines(keepends=True)\n   b = new.splitlines(keepends=True)\n   return \"\".join(difflib.unified_diff(a, b, fromfile=\"previous_prompt\", tofile=\"current_prompt\"))\n\n\ndef compute_regression_flags(baseline: Dict[str, float], current: Dict[str, float]) -&gt; Dict[str, Any]:\n   d_sem = baseline[\"semantic_sim_mean\"] - current[\"semantic_sim_mean\"]\n   d_rouge = baseline[\"rougeL_f1_mean\"] - current[\"rougeL_f1_mean\"]\n   d_bleu = baseline[\"bleu_mean\"] - current[\"bleu_mean\"]\n\n\n   flags = {\n       \"abs_semantic_fail\": current[\"semantic_sim_mean\"] &lt; ABS_SEM_SIM_MIN,\n       \"drop_semantic_fail\": d_sem &gt; DELTA_SEM_SIM_MAX_DROP,\n       \"drop_rouge_fail\": d_rouge &gt; DELTA_ROUGE_L_MAX_DROP,\n       \"drop_bleu_fail\": d_bleu &gt; DELTA_BLEU_MAX_DROP,\n       \"delta_semantic\": float(d_sem),\n       \"delta_rougeL\": float(d_rouge),\n       \"delta_bleu\": float(d_bleu),\n   }\n   flags[\"regression\"] = any([flags[\"abs_semantic_fail\"], flags[\"drop_semantic_fail\"], flags[\"drop_rouge_fail\"], flags[\"drop_bleu_fail\"]])\n   return flags<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We build the evaluation and regression logic that runs each prompt against the evaluation set and aggregates results. We log prompt artifacts, prompt diffs, and evaluation outputs to MLflow, ensuring every experiment remains auditable. We also compute regression flags that automatically identify whether a prompt version degrades performance relative to the baseline. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/MLFlow%20for%20LLM%20Evaluation\/Prompt_Versioning_and_Regression_Testing_for_LLMs_with_MLflow_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a>.<\/strong><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">print(\"Running prompt versioning + regression testing with MLflow...\")\nprint(f\"Tracking URI: {mlflow.get_tracking_uri()}\")\nprint(f\"Experiment:  {mlflow.get_experiment_by_name('prompt_versioning_llm_regression').name}\")\n\n\nrun_summary = []\nbaseline_metrics = None\nbaseline_prompt = None\nbaseline_df = None\nbaseline_metrics_name = None\n\n\nwith mlflow.start_run(run_name=f\"prompt_regression_suite_{int(time.time())}\") as parent_run:\n   mlflow.set_tag(\"task\", \"prompt_versioning_regression_testing\")\n   mlflow.log_param(\"model\", MODEL)\n   mlflow.log_param(\"temperature\", TEMPERATURE)\n   mlflow.log_param(\"max_output_tokens\", MAX_OUTPUT_TOKENS)\n   mlflow.log_param(\"eval_set_size\", len(EVAL_SET))\n\n\n   for pv in PROMPTS:\n       ver = pv[\"version\"]\n       prompt_t = pv[\"prompt\"]\n\n\n       with mlflow.start_run(run_name=ver, nested=True) as child_run:\n           mlflow.log_param(\"prompt_version\", ver)\n           log_text_artifact(prompt_t, f\"prompts\/{ver}.txt\")\n\n\n           if baseline_prompt is not None and baseline_metrics_name is not None:\n               diff = prompt_diff(baseline_prompt, prompt_t)\n               log_text_artifact(diff, f\"prompt_diffs\/{baseline_metrics_name}_to_{ver}.diff\")\n           else:\n               log_text_artifact(\"BASELINE_PROMPT (no diff)\", f\"prompt_diffs\/{ver}.diff\")\n\n\n           df, agg, outputs_jsonl = evaluate_prompt(prompt_t)\n\n\n           mlflow.log_dict(agg, f\"metrics\/{ver}_agg.json\")\n           log_text_artifact(outputs_jsonl, f\"outputs\/{ver}_outputs.jsonl\")\n\n\n           mlflow.log_metric(\"bleu_mean\", agg[\"bleu_mean\"])\n           mlflow.log_metric(\"rougeL_f1_mean\", agg[\"rougeL_f1_mean\"])\n           mlflow.log_metric(\"semantic_sim_mean\", agg[\"semantic_sim_mean\"])\n\n\n           if baseline_metrics is None:\n               baseline_metrics = agg\n               baseline_prompt = prompt_t\n               baseline_df = df\n               baseline_metrics_name = ver\n               flags = {\"regression\": False, \"delta_bleu\": 0.0, \"delta_rougeL\": 0.0, \"delta_semantic\": 0.0}\n               mlflow.set_tag(\"regression\", \"false\")\n           else:\n               flags = compute_regression_flags(baseline_metrics, agg)\n               mlflow.log_metric(\"delta_bleu\", flags[\"delta_bleu\"])\n               mlflow.log_metric(\"delta_rougeL\", flags[\"delta_rougeL\"])\n               mlflow.log_metric(\"delta_semantic\", flags[\"delta_semantic\"])\n               mlflow.set_tag(\"regression\", str(flags[\"regression\"]).lower())\n               for k in [\"abs_semantic_fail\",\"drop_semantic_fail\",\"drop_rouge_fail\",\"drop_bleu_fail\"]:\n                   mlflow.set_tag(k, str(flags[k]).lower())\n\n\n           run_summary.append({\n               \"prompt_version\": ver,\n               \"bleu_mean\": agg[\"bleu_mean\"],\n               \"rougeL_f1_mean\": agg[\"rougeL_f1_mean\"],\n               \"semantic_sim_mean\": agg[\"semantic_sim_mean\"],\n               \"delta_bleu_vs_baseline\": float(flags.get(\"delta_bleu\", 0.0)),\n               \"delta_rougeL_vs_baseline\": float(flags.get(\"delta_rougeL\", 0.0)),\n               \"delta_semantic_vs_baseline\": float(flags.get(\"delta_semantic\", 0.0)),\n               \"regression_flag\": bool(flags[\"regression\"]),\n               \"mlflow_run_id\": child_run.info.run_id,\n           })\n\n\nsummary_df = pd.DataFrame(run_summary).sort_values(\"prompt_version\")\nprint(\"n=== Aggregated Results (higher is better) ===\")\ndisplay(summary_df)\n\n\nregressed = summary_df[summary_df[\"regression_flag\"] == True]\nif len(regressed) &gt; 0:\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png\" alt=\"\ud83d\udea9\" class=\"wp-smiley\" \/> Regressions detected:\")\n   display(regressed[[\"prompt_version\",\"delta_bleu_vs_baseline\",\"delta_rougeL_vs_baseline\",\"delta_semantic_vs_baseline\",\"mlflow_run_id\"]])\nelse:\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> No regressions detected under current thresholds.\")\n\n\nif len(regressed) &gt; 0 and baseline_df is not None:\n   worst_ver = regressed.sort_values(\"delta_semantic_vs_baseline\", ascending=False).iloc[0][\"prompt_version\"]\n   worst_prompt = next(p[\"prompt\"] for p in PROMPTS if p[\"version\"] == worst_ver)\n   worst_df, _, _ = evaluate_prompt(worst_prompt)\n\n\n   merged = baseline_df[[\"id\",\"output\",\"bleu\",\"rougeL_f1\",\"semantic_sim\"]].merge(\n       worst_df[[\"id\",\"output\",\"bleu\",\"rougeL_f1\",\"semantic_sim\"]],\n       on=\"id\",\n       suffixes=(\"_baseline\", f\"_{worst_ver}\")\n   )\n   merged[\"delta_semantic\"] = merged[\"semantic_sim_baseline\"] - merged[f\"semantic_sim_{worst_ver}\"]\n   merged[\"delta_rougeL\"] = merged[\"rougeL_f1_baseline\"] - merged[f\"rougeL_f1_{worst_ver}\"]\n   merged[\"delta_bleu\"] = merged[\"bleu_baseline\"] - merged[f\"bleu_{worst_ver}\"]\n   print(f\"n=== Per-example deltas: baseline vs {worst_ver} (positive delta = worse) ===\")\n   display(\n       merged[[\"id\",\"delta_semantic\",\"delta_rougeL\",\"delta_bleu\",\"output_baseline\",f\"output_{worst_ver}\"]]\n       .sort_values(\"delta_semantic\", ascending=False)\n   )\n\n\nprint(\"nOpen MLflow UI (optional) by running:\")\nprint(\"!mlflow ui --backend-store-uri file:\/content\/mlruns --host 0.0.0.0 --port 5000\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We orchestrate the full prompt regression testing workflow using nested MLflow runs. We compare each prompt version against the baseline, log metric deltas, and record regression outcomes in a structured summary table. This completes a repeatable, engineering-grade pipeline for prompt versioning and regression testing that we can extend to larger datasets and real-world applications.<\/p>\n<p>In conclusion, we established a practical, research-oriented framework for prompt versioning and regression testing that enables us to evaluate LLM behavior with discipline and transparency. We showed how MLflow enables us to track prompt evolution, compare outputs across versions, and automatically flag regressions based on well-defined thresholds. This approach helps us move away from ad hoc prompt tuning and toward measurable, repeatable experimentation. By adopting this workflow, we ensured that prompt updates improve model behavior intentionally rather than introducing hidden performance regressions.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/MLFlow%20for%20LLM%20Evaluation\/Prompt_Versioning_and_Regression_Testing_for_LLMs_with_MLflow_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a>.<\/strong>\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/08\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\">A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we show how we treat prompts as first-class, versioned artifacts and apply rigorous regression testing to large language model behavior using MLflow. We design an evaluation pipeline that logs prompt versions, prompt diffs, model outputs, and multiple quality metrics in a fully reproducible manner. By combining classical text metrics with semantic similarity and automated regression flags, we demonstrate how we can systematically detect performance drift caused by seemingly small prompt changes. Along the tutorial, we focus on building a workflow that mirrors real software engineering practices, but applied to prompt engineering and LLM evaluation. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser !pip -q install -U &#8220;openai&gt;=1.0.0&#8221; mlflow rouge-score nltk sentence-transformers scikit-learn pandas import os, json, time, difflib, re from typing import List, Dict, Any, Tuple import mlflow import pandas as pd import numpy as np from openai import OpenAI from rouge_score import rouge_scorer import nltk from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity nltk.download(&#8220;punkt&#8221;, quiet=True) nltk.download(&#8220;punkt_tab&#8221;, quiet=True) if not os.getenv(&#8220;OPENAI_API_KEY&#8221;): try: from google.colab import userdata # type: ignore k = userdata.get(&#8220;OPENAI_API_KEY&#8221;) if k: os.environ[&#8220;OPENAI_API_KEY&#8221;] = k except Exception: pass if not os.getenv(&#8220;OPENAI_API_KEY&#8221;): import getpass os.environ[&#8220;OPENAI_API_KEY&#8221;] = getpass.getpass(&#8220;Enter OPENAI_API_KEY (input hidden): &#8220;).strip() assert os.getenv(&#8220;OPENAI_API_KEY&#8221;), &#8220;OPENAI_API_KEY is required.&#8221; We set up the execution environment by installing all required dependencies and importing the core libraries used throughout the tutorial. We securely load the OpenAI API key at runtime, ensuring credentials are never hard-coded in the notebook. We also initialize essential NLP resources to ensure the evaluation pipeline runs reliably across different environments. Copy CodeCopiedUse a different Browser MODEL = &#8220;gpt-4o-mini&#8221; TEMPERATURE = 0.2 MAX_OUTPUT_TOKENS = 250 ABS_SEM_SIM_MIN = 0.78 DELTA_SEM_SIM_MAX_DROP = 0.05 DELTA_ROUGE_L_MAX_DROP = 0.08 DELTA_BLEU_MAX_DROP = 0.10 mlflow.set_tracking_uri(&#8220;file:\/content\/mlruns&#8221;) mlflow.set_experiment(&#8220;prompt_versioning_llm_regression&#8221;) client = OpenAI() embedder = SentenceTransformer(&#8220;all-MiniLM-L6-v2&#8221;) EVAL_SET = [ { &#8220;id&#8221;: &#8220;q1&#8221;, &#8220;input&#8221;: &#8220;Summarize in one sentence: MLflow tracks experiments, runs, parameters, metrics, and artifacts.&#8221;, &#8220;reference&#8221;: &#8220;MLflow helps track machine learning experiments by logging runs with parameters, metrics, and artifacts.&#8221; }, { &#8220;id&#8221;: &#8220;q2&#8221;, &#8220;input&#8221;: &#8220;Rewrite professionally: &#8216;this model is kinda slow but it works ok.'&#8221;, &#8220;reference&#8221;: &#8220;The model is somewhat slow, but it performs reliably.&#8221; }, { &#8220;id&#8221;: &#8220;q3&#8221;, &#8220;input&#8221;: &#8220;Extract key fields as JSON: &#8216;Order 5531 by Alice costs $42.50 and ships to Toronto.'&#8221;, &#8220;reference&#8221;: &#8216;{&#8220;order_id&#8221;:&#8221;5531&#8243;,&#8221;customer&#8221;:&#8221;Alice&#8221;,&#8221;amount_usd&#8221;:42.50,&#8221;city&#8221;:&#8221;Toronto&#8221;}&#8217; }, { &#8220;id&#8221;: &#8220;q4&#8221;, &#8220;input&#8221;: &#8220;Answer briefly: What is prompt regression testing?&#8221;, &#8220;reference&#8221;: &#8220;Prompt regression testing checks whether prompt changes degrade model outputs compared to a baseline.&#8221; }, ] PROMPTS = [ { &#8220;version&#8221;: &#8220;v1_baseline&#8221;, &#8220;prompt&#8221;: ( &#8220;You are a precise assistant.n&#8221; &#8220;Follow the user request carefully.n&#8221; &#8220;If asked for JSON, output valid JSON only.n&#8221; &#8220;User: {user_input}&#8221; ) }, { &#8220;version&#8221;: &#8220;v2_formatting&#8221;, &#8220;prompt&#8221;: ( &#8220;You are a helpful, structured assistant.n&#8221; &#8220;Respond clearly and concisely.n&#8221; &#8220;Prefer clean formatting.n&#8221; &#8220;User request: {user_input}&#8221; ) }, { &#8220;version&#8221;: &#8220;v3_guardrailed&#8221;, &#8220;prompt&#8221;: ( &#8220;You are a rigorous assistant.n&#8221; &#8220;Rules:n&#8221; &#8220;1) If user asks for JSON, output ONLY valid minified JSON.n&#8221; &#8220;2) Otherwise, keep the answer short and factual.n&#8221; &#8220;User: {user_input}&#8221; ) }, ] We define all experimental configurations, including model parameters, regression thresholds, and MLflow tracking settings. We construct the evaluation dataset and explicitly declare multiple prompt versions to compare and test for regressions. By centralizing these definitions, we ensure that prompt changes and evaluation logic remain controlled and reproducible. Copy CodeCopiedUse a different Browser def call_llm(formatted_prompt: str) -&gt; str: resp = client.responses.create( model=MODEL, input=formatted_prompt, temperature=TEMPERATURE, max_output_tokens=MAX_OUTPUT_TOKENS, ) out = getattr(resp, &#8220;output_text&#8221;, None) if out: return out.strip() try: texts = [] for item in resp.output: if getattr(item, &#8220;type&#8221;, &#8220;&#8221;) == &#8220;message&#8221;: for c in item.content: if getattr(c, &#8220;type&#8221;, &#8220;&#8221;) in (&#8220;output_text&#8221;, &#8220;text&#8221;): texts.append(getattr(c, &#8220;text&#8221;, &#8220;&#8221;)) return &#8220;n&#8221;.join(texts).strip() except Exception: return &#8220;&#8221; smooth = SmoothingFunction().method3 rouge = rouge_scorer.RougeScorer([&#8220;rougeL&#8221;], use_stemmer=True) def safe_tokenize(s: str) -&gt; List[str]: s = (s or &#8220;&#8221;).strip().lower() if not s: return [] try: return nltk.word_tokenize(s) except LookupError: return re.findall(r&#8221;bw+b&#8221;, s) def bleu_score(ref: str, hyp: str) -&gt; float: r = safe_tokenize(ref) h = safe_tokenize(hyp) if len(h) == 0 or len(r) == 0: return 0.0 return float(sentence_bleu([r], h, smoothing_function=smooth)) def rougeL_f1(ref: str, hyp: str) -&gt; float: scores = rouge.score(ref or &#8220;&#8221;, hyp or &#8220;&#8221;) return float(scores[&#8220;rougeL&#8221;].fmeasure) def semantic_sim(ref: str, hyp: str) -&gt; float: embs = embedder.encode([ref or &#8220;&#8221;, hyp or &#8220;&#8221;], normalize_embeddings=True) return float(cosine_similarity([embs[0]], [embs[1]])[0][0]) We implement the core LLM invocation and evaluation metrics used to assess prompt quality. We compute BLEU, ROUGE-L, and semantic similarity scores to capture both surface-level and semantic differences in model outputs. It allows us to evaluate prompt changes from multiple complementary perspectives rather than relying on a single metric. Copy CodeCopiedUse a different Browser def evaluate_prompt(prompt_template: str) -&gt; Tuple[pd.DataFrame, Dict[str, float], str]: rows = [] for ex in EVAL_SET: p = prompt_template.format(user_input=ex[&#8220;input&#8221;]) y = call_llm(p) ref = ex[&#8220;reference&#8221;] rows.append({ &#8220;id&#8221;: ex[&#8220;id&#8221;], &#8220;input&#8221;: ex[&#8220;input&#8221;], &#8220;reference&#8221;: ref, &#8220;output&#8221;: y, &#8220;bleu&#8221;: bleu_score(ref, y), &#8220;rougeL_f1&#8221;: rougeL_f1(ref, y), &#8220;semantic_sim&#8221;: semantic_sim(ref, y), }) df = pd.DataFrame(rows) agg = { &#8220;bleu_mean&#8221;: float(df[&#8220;bleu&#8221;].mean()), &#8220;rougeL_f1_mean&#8221;: float(df[&#8220;rougeL_f1&#8221;].mean()), &#8220;semantic_sim_mean&#8221;: float(df[&#8220;semantic_sim&#8221;].mean()), } outputs_jsonl = &#8220;n&#8221;.join(json.dumps(r, ensure_ascii=False) for r in rows) return df, agg, outputs_jsonl def log_text_artifact(text: str, artifact_path: str): mlflow.log_text(text, artifact_path) def prompt_diff(old: str, new: str) -&gt; str: a = old.splitlines(keepends=True) b = new.splitlines(keepends=True) return &#8220;&#8221;.join(difflib.unified_diff(a, b, fromfile=&#8221;previous_prompt&#8221;, tofile=&#8221;current_prompt&#8221;)) def compute_regression_flags(baseline: Dict[str, float], current: Dict[str, float]) -&gt; Dict[str, Any]: d_sem = baseline[&#8220;semantic_sim_mean&#8221;] &#8211; current[&#8220;semantic_sim_mean&#8221;] d_rouge = baseline[&#8220;rougeL_f1_mean&#8221;] &#8211; current[&#8220;rougeL_f1_mean&#8221;] d_bleu = baseline[&#8220;bleu_mean&#8221;] &#8211; current[&#8220;bleu_mean&#8221;] flags = { &#8220;abs_semantic_fail&#8221;: current[&#8220;semantic_sim_mean&#8221;] &lt; ABS_SEM_SIM_MIN, &#8220;drop_semantic_fail&#8221;: d_sem &gt; DELTA_SEM_SIM_MAX_DROP, &#8220;drop_rouge_fail&#8221;: d_rouge &gt; DELTA_ROUGE_L_MAX_DROP, &#8220;drop_bleu_fail&#8221;: d_bleu &gt; DELTA_BLEU_MAX_DROP, &#8220;delta_semantic&#8221;: float(d_sem), &#8220;delta_rougeL&#8221;: float(d_rouge), &#8220;delta_bleu&#8221;: float(d_bleu), } flags[&#8220;regression&#8221;] = any([flags[&#8220;abs_semantic_fail&#8221;], flags[&#8220;drop_semantic_fail&#8221;], flags[&#8220;drop_rouge_fail&#8221;], flags[&#8220;drop_bleu_fail&#8221;]]) return flags We build the evaluation and regression logic that runs each prompt against the evaluation set and aggregates results. We log prompt artifacts, prompt diffs, and evaluation outputs to MLflow, ensuring every experiment remains auditable. We also compute regression flags that automatically identify whether a prompt version degrades performance relative to the baseline. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser print(&#8220;Running prompt versioning + regression testing with MLflow&#8230;&#8221;) print(f&#8221;Tracking URI: {mlflow.get_tracking_uri()}&#8221;) print(f&#8221;Experiment: {mlflow.get_experiment_by_name(&#8216;prompt_versioning_llm_regression&#8217;).name}&#8221;) run_summary = [] baseline_metrics = None baseline_prompt = None baseline_df = None baseline_metrics_name = None with mlflow.start_run(run_name=f&#8221;prompt_regression_suite_{int(time.time())}&#8221;) as parent_run: mlflow.set_tag(&#8220;task&#8221;, &#8220;prompt_versioning_regression_testing&#8221;) mlflow.log_param(&#8220;model&#8221;, MODEL)<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-69928","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/it\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\" \/>\n<meta property=\"og:locale\" content=\"it_IT\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/it\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-09T11:36:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Scritto da\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tempo di lettura stimato\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minuti\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow\",\"datePublished\":\"2026-02-09T11:36:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\"},\"wordCount\":567,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\",\"url\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\",\"name\":\"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png\",\"datePublished\":\"2026-02-09T11:36:34+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#breadcrumb\"},\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage\",\"url\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png\",\"contentUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"it-IT\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/it\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/it\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/","og_locale":"it_IT","og_type":"article","og_title":"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/it\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-02-09T11:36:34+00:00","og_image":[{"url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Scritto da":"admin NU","Tempo di lettura stimato":"10 minuti"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow","datePublished":"2026-02-09T11:36:34+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/"},"wordCount":567,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"it-IT","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/","url":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/","name":"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png","datePublished":"2026-02-09T11:36:34+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#breadcrumb"},"inLanguage":"it-IT","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/"]}]},{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#primaryimage","url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png","contentUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a9.png"},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/a-coding-implementation-to-establish-rigorous-prompt-versioning-and-regression-testing-workflows-for-large-language-models-using-mlflow\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"it-IT"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/it\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/it\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/it\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we show how we treat prompts as first-class, versioned artifacts and apply rigorous regression testing to large language model behavior using MLflow. We design an evaluation pipeline that logs prompt versions, prompt diffs, model outputs, and multiple quality metrics in a fully reproducible manner. By combining classical text metrics with semantic similarity&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/69928","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/comments?post=69928"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/69928\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/media?parent=69928"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/categories?post=69928"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/tags?post=69928"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}