A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow
In this tutorial, we show how we treat prompts as first-class, versioned artifacts and apply rigorous regression testing to large language model behavior using MLflow. We design an evaluation pipeline that logs prompt versions, prompt diffs, model outputs, and multiple quality metrics in a fully reproducible manner. By combining classical text metrics with semantic similarity and automated regression flags, we demonstrate how we can systematically detect performance drift caused by seemingly small prompt changes. Along the tutorial, we focus on building a workflow that mirrors real software engineering practices, but applied to prompt engineering and LLM evaluation. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip -q install -U “openai>=1.0.0” mlflow rouge-score nltk sentence-transformers scikit-learn pandas import os, json, time, difflib, re from typing import List, Dict, Any, Tuple import mlflow import pandas as pd import numpy as np from openai import OpenAI from rouge_score import rouge_scorer import nltk from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity nltk.download(“punkt”, quiet=True) nltk.download(“punkt_tab”, quiet=True) if not os.getenv(“OPENAI_API_KEY”): try: from google.colab import userdata # type: ignore k = userdata.get(“OPENAI_API_KEY”) if k: os.environ[“OPENAI_API_KEY”] = k except Exception: pass if not os.getenv(“OPENAI_API_KEY”): import getpass os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Enter OPENAI_API_KEY (input hidden): “).strip() assert os.getenv(“OPENAI_API_KEY”), “OPENAI_API_KEY is required.” We set up the execution environment by installing all required dependencies and importing the core libraries used throughout the tutorial. We securely load the OpenAI API key at runtime, ensuring credentials are never hard-coded in the notebook. We also initialize essential NLP resources to ensure the evaluation pipeline runs reliably across different environments. Copy CodeCopiedUse a different Browser MODEL = “gpt-4o-mini” TEMPERATURE = 0.2 MAX_OUTPUT_TOKENS = 250 ABS_SEM_SIM_MIN = 0.78 DELTA_SEM_SIM_MAX_DROP = 0.05 DELTA_ROUGE_L_MAX_DROP = 0.08 DELTA_BLEU_MAX_DROP = 0.10 mlflow.set_tracking_uri(“file:/content/mlruns”) mlflow.set_experiment(“prompt_versioning_llm_regression”) client = OpenAI() embedder = SentenceTransformer(“all-MiniLM-L6-v2”) EVAL_SET = [ { “id”: “q1”, “input”: “Summarize in one sentence: MLflow tracks experiments, runs, parameters, metrics, and artifacts.”, “reference”: “MLflow helps track machine learning experiments by logging runs with parameters, metrics, and artifacts.” }, { “id”: “q2”, “input”: “Rewrite professionally: ‘this model is kinda slow but it works ok.'”, “reference”: “The model is somewhat slow, but it performs reliably.” }, { “id”: “q3”, “input”: “Extract key fields as JSON: ‘Order 5531 by Alice costs $42.50 and ships to Toronto.'”, “reference”: ‘{“order_id”:”5531″,”customer”:”Alice”,”amount_usd”:42.50,”city”:”Toronto”}’ }, { “id”: “q4”, “input”: “Answer briefly: What is prompt regression testing?”, “reference”: “Prompt regression testing checks whether prompt changes degrade model outputs compared to a baseline.” }, ] PROMPTS = [ { “version”: “v1_baseline”, “prompt”: ( “You are a precise assistant.n” “Follow the user request carefully.n” “If asked for JSON, output valid JSON only.n” “User: {user_input}” ) }, { “version”: “v2_formatting”, “prompt”: ( “You are a helpful, structured assistant.n” “Respond clearly and concisely.n” “Prefer clean formatting.n” “User request: {user_input}” ) }, { “version”: “v3_guardrailed”, “prompt”: ( “You are a rigorous assistant.n” “Rules:n” “1) If user asks for JSON, output ONLY valid minified JSON.n” “2) Otherwise, keep the answer short and factual.n” “User: {user_input}” ) }, ] We define all experimental configurations, including model parameters, regression thresholds, and MLflow tracking settings. We construct the evaluation dataset and explicitly declare multiple prompt versions to compare and test for regressions. By centralizing these definitions, we ensure that prompt changes and evaluation logic remain controlled and reproducible. Copy CodeCopiedUse a different Browser def call_llm(formatted_prompt: str) -> str: resp = client.responses.create( model=MODEL, input=formatted_prompt, temperature=TEMPERATURE, max_output_tokens=MAX_OUTPUT_TOKENS, ) out = getattr(resp, “output_text”, None) if out: return out.strip() try: texts = [] for item in resp.output: if getattr(item, “type”, “”) == “message”: for c in item.content: if getattr(c, “type”, “”) in (“output_text”, “text”): texts.append(getattr(c, “text”, “”)) return “n”.join(texts).strip() except Exception: return “” smooth = SmoothingFunction().method3 rouge = rouge_scorer.RougeScorer([“rougeL”], use_stemmer=True) def safe_tokenize(s: str) -> List[str]: s = (s or “”).strip().lower() if not s: return [] try: return nltk.word_tokenize(s) except LookupError: return re.findall(r”bw+b”, s) def bleu_score(ref: str, hyp: str) -> float: r = safe_tokenize(ref) h = safe_tokenize(hyp) if len(h) == 0 or len(r) == 0: return 0.0 return float(sentence_bleu([r], h, smoothing_function=smooth)) def rougeL_f1(ref: str, hyp: str) -> float: scores = rouge.score(ref or “”, hyp or “”) return float(scores[“rougeL”].fmeasure) def semantic_sim(ref: str, hyp: str) -> float: embs = embedder.encode([ref or “”, hyp or “”], normalize_embeddings=True) return float(cosine_similarity([embs[0]], [embs[1]])[0][0]) We implement the core LLM invocation and evaluation metrics used to assess prompt quality. We compute BLEU, ROUGE-L, and semantic similarity scores to capture both surface-level and semantic differences in model outputs. It allows us to evaluate prompt changes from multiple complementary perspectives rather than relying on a single metric. Copy CodeCopiedUse a different Browser def evaluate_prompt(prompt_template: str) -> Tuple[pd.DataFrame, Dict[str, float], str]: rows = [] for ex in EVAL_SET: p = prompt_template.format(user_input=ex[“input”]) y = call_llm(p) ref = ex[“reference”] rows.append({ “id”: ex[“id”], “input”: ex[“input”], “reference”: ref, “output”: y, “bleu”: bleu_score(ref, y), “rougeL_f1”: rougeL_f1(ref, y), “semantic_sim”: semantic_sim(ref, y), }) df = pd.DataFrame(rows) agg = { “bleu_mean”: float(df[“bleu”].mean()), “rougeL_f1_mean”: float(df[“rougeL_f1”].mean()), “semantic_sim_mean”: float(df[“semantic_sim”].mean()), } outputs_jsonl = “n”.join(json.dumps(r, ensure_ascii=False) for r in rows) return df, agg, outputs_jsonl def log_text_artifact(text: str, artifact_path: str): mlflow.log_text(text, artifact_path) def prompt_diff(old: str, new: str) -> str: a = old.splitlines(keepends=True) b = new.splitlines(keepends=True) return “”.join(difflib.unified_diff(a, b, fromfile=”previous_prompt”, tofile=”current_prompt”)) def compute_regression_flags(baseline: Dict[str, float], current: Dict[str, float]) -> Dict[str, Any]: d_sem = baseline[“semantic_sim_mean”] – current[“semantic_sim_mean”] d_rouge = baseline[“rougeL_f1_mean”] – current[“rougeL_f1_mean”] d_bleu = baseline[“bleu_mean”] – current[“bleu_mean”] flags = { “abs_semantic_fail”: current[“semantic_sim_mean”] < ABS_SEM_SIM_MIN, “drop_semantic_fail”: d_sem > DELTA_SEM_SIM_MAX_DROP, “drop_rouge_fail”: d_rouge > DELTA_ROUGE_L_MAX_DROP, “drop_bleu_fail”: d_bleu > DELTA_BLEU_MAX_DROP, “delta_semantic”: float(d_sem), “delta_rougeL”: float(d_rouge), “delta_bleu”: float(d_bleu), } flags[“regression”] = any([flags[“abs_semantic_fail”], flags[“drop_semantic_fail”], flags[“drop_rouge_fail”], flags[“drop_bleu_fail”]]) return flags We build the evaluation and regression logic that runs each prompt against the evaluation set and aggregates results. We log prompt artifacts, prompt diffs, and evaluation outputs to MLflow, ensuring every experiment remains auditable. We also compute regression flags that automatically identify whether a prompt version degrades performance relative to the baseline. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“Running prompt versioning + regression testing with MLflow…”) print(f”Tracking URI: {mlflow.get_tracking_uri()}”) print(f”Experiment: {mlflow.get_experiment_by_name(‘prompt_versioning_llm_regression’).name}”) run_summary = [] baseline_metrics = None baseline_prompt = None baseline_df = None baseline_metrics_name = None with mlflow.start_run(run_name=f”prompt_regression_suite_{int(time.time())}”) as parent_run: mlflow.set_tag(“task”, “prompt_versioning_regression_testing”) mlflow.log_param(“model”, MODEL)

