{"id":71105,"date":"2026-02-14T11:47:33","date_gmt":"2026-02-14T11:47:33","guid":{"rendered":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/"},"modified":"2026-02-14T11:47:33","modified_gmt":"2026-02-14T11:47:33","slug":"in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/","title":{"rendered":"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data"},"content":{"rendered":"<p>In this tutorial, we build a complete, production-grade synthetic data pipeline using <a href=\"https:\/\/github.com\/sdv-dev\/CTGAN\"><strong>CTGAN<\/strong><\/a> and the SDV ecosystem. We start from raw mixed-type tabular data and progressively move toward constrained generation, conditional sampling, statistical validation, and downstream utility testing. Rather than stopping at sample generation, we focus on understanding how well synthetic data preserves structure, distributions, and predictive signal. This tutorial demonstrates how CTGAN can be used responsibly and rigorously in real-world data science workflows.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">!pip -q install \"ctgan\" \"sdv\" \"sdmetrics\" \"scikit-learn\" \"pandas\" \"numpy\" \"matplotlib\"\n\n\nimport numpy as np\nimport pandas as pd\nimport warnings\nwarnings.filterwarnings(\"ignore\")\n\n\nimport ctgan, sdv, sdmetrics\nfrom ctgan import load_demo, CTGAN\n\n\nfrom sdv.metadata import SingleTableMetadata\nfrom sdv.single_table import CTGANSynthesizer\n\n\nfrom sdv.cag import Inequality, FixedCombinations\nfrom sdv.sampling import Condition\n\n\nfrom sdmetrics.reports.single_table import DiagnosticReport, QualityReport\n\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import roc_auc_score\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.pipeline import Pipeline\n\n\nimport matplotlib.pyplot as plt\n\n\nprint(\"Versions:\")\nprint(\"ctgan:\", ctgan.__version__)\nprint(\"sdv:\", sdv.__version__)\nprint(\"sdmetrics:\", sdmetrics.__version__)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We set up the environment by installing all required libraries and importing the full dependency stack. We explicitly load CTGAN, SDV, SDMetrics, and downstream ML tooling to ensure compatibility across the pipeline. We also surface library versions to make the experiment reproducible and debuggable.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">real = load_demo().copy()\nreal.columns = [c.strip().replace(\" \", \"_\") for c in real.columns]\n\n\ntarget_col = \"income\"\nreal[target_col] = real[target_col].astype(str)\n\n\ncategorical_cols = real.select_dtypes(include=[\"object\"]).columns.tolist()\nnumerical_cols = [c for c in real.columns if c not in categorical_cols]\n\n\nprint(\"Rows:\", len(real), \"Cols:\", len(real.columns))\nprint(\"Categorical:\", len(categorical_cols), \"Numerical:\", len(numerical_cols))\ndisplay(real.head())\n\n\nctgan_model = CTGAN(\n   epochs=30,\n   batch_size=500,\n   verbose=True\n)\nctgan_model.fit(real, discrete_columns=categorical_cols)\nsynthetic_ctgan = ctgan_model.sample(5000)\nprint(\"Standalone CTGAN sample:\")\ndisplay(synthetic_ctgan.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We load the CTGAN Adult demo dataset and perform minimal normalization on column names and data types. We explicitly identify categorical and numerical columns, which is critical for both CTGAN training and evaluation. We then train a baseline standalone CTGAN model and generate synthetic samples for comparison.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">metadata = SingleTableMetadata()\nmetadata.detect_from_dataframe(data=real)\nmetadata.update_column(column_name=target_col, sdtype=\"categorical\")\n\n\nconstraints = []\n\n\nif len(numerical_cols) &gt;= 2:\n   col_lo, col_hi = numerical_cols[0], numerical_cols[1]\n   constraints.append(Inequality(low_column_name=col_lo, high_column_name=col_hi))\n   print(f\"Added Inequality constraint: {col_hi} &gt; {col_lo}\")\n\n\nif len(categorical_cols) &gt;= 2:\n   c1, c2 = categorical_cols[0], categorical_cols[1]\n   constraints.append(FixedCombinations(column_names=[c1, c2]))\n   print(f\"Added FixedCombinations constraint on: [{c1}, {c2}]\")\n\n\nsynth = CTGANSynthesizer(\n   metadata=metadata,\n   epochs=30,\n   batch_size=500\n)\n\n\nif constraints:\n   synth.add_constraints(constraints)\n\n\nsynth.fit(real)\n\n\nsynthetic_sdv = synth.sample(num_rows=5000)\nprint(\"SDV CTGANSynthesizer sample:\")\ndisplay(synthetic_sdv.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We construct a formal metadata object and attach explicit semantic types to the dataset. We introduce structural constraints using SDV\u2019s constraint graph system, enforcing numeric inequalities and validity of categorical combinations. We then train a CTGAN-based SDV synthesizer that respects these constraints during generation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">loss_df = synth.get_loss_values()\ndisplay(loss_df.tail())\n\n\nx_candidates = [\"epoch\", \"step\", \"steps\", \"iteration\", \"iter\", \"batch\", \"update\"]\nxcol = next((c for c in x_candidates if c in loss_df.columns), None)\n\n\ng_candidates = [\"generator_loss\", \"gen_loss\", \"g_loss\"]\nd_candidates = [\"discriminator_loss\", \"disc_loss\", \"d_loss\"]\ngcol = next((c for c in g_candidates if c in loss_df.columns), None)\ndcol = next((c for c in d_candidates if c in loss_df.columns), None)\n\n\nplt.figure(figsize=(10,4))\n\n\nif xcol is None:\n   x = np.arange(len(loss_df))\nelse:\n   x = loss_df[xcol].to_numpy()\n\n\nif gcol is not None:\n   plt.plot(x, loss_df[gcol].to_numpy(), label=gcol)\nif dcol is not None:\n   plt.plot(x, loss_df[dcol].to_numpy(), label=dcol)\n\n\nplt.xlabel(xcol if xcol is not None else \"index\")\nplt.ylabel(\"loss\")\nplt.legend()\nplt.title(\"CTGAN training losses (SDV wrapper)\")\nplt.show()\n\n\ncond_col = categorical_cols[0]\ncommon_value = real[cond_col].value_counts().index[0]\nconditions = [Condition({cond_col: common_value}, num_rows=2000)]\n\n\nsynthetic_cond = synth.sample_from_conditions(\n   conditions=conditions,\n   max_tries_per_batch=200,\n   batch_size=5000\n)\n\n\nprint(\"Conditional sampling requested:\", 2000, \"got:\", len(synthetic_cond))\nprint(\"Conditional sample distribution (top 5):\")\nprint(synthetic_cond[cond_col].value_counts().head(5))\ndisplay(synthetic_cond.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We extract and visualize the dynamics of generator and discriminator losses using a version-robust plotting strategy. We perform conditional sampling to generate data under specific attribute constraints and verify that the conditions are satisfied. This demonstrates how CTGAN behaves under guided generation scenarios.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">metadata_dict = metadata.to_dict()\n\n\ndiagnostic = DiagnosticReport()\ndiagnostic.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True)\nprint(\"Diagnostic score:\", diagnostic.get_score())\n\n\nquality = QualityReport()\nquality.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True)\nprint(\"Quality score:\", quality.get_score())\n\n\ndef show_report_details(report, title):\n   print(f\"n===== {title} details =====\")\n   props = report.get_properties()\n   for p in props:\n       print(f\"n--- {p} ---\")\n       details = report.get_details(property_name=p)\n       try:\n           display(details.head(10))\n       except Exception:\n           display(details)\n\n\nshow_report_details(diagnostic, \"DiagnosticReport\")\nshow_report_details(quality, \"QualityReport\")\n\n\ntrain_real, test_real = train_test_split(\n   real, test_size=0.25, random_state=42, stratify=real[target_col]\n)\n\n\ndef make_pipeline(cat_cols, num_cols):\n   pre = ColumnTransformer(\n       transformers=[\n           (\"cat\", OneHotEncoder(handle_unknown=\"ignore\"), cat_cols),\n           (\"num\", \"passthrough\", num_cols),\n       ],\n       remainder=\"drop\"\n   )\n   clf = LogisticRegression(max_iter=200)\n   return Pipeline([(\"pre\", pre), (\"clf\", clf)])\n\n\npipe_syn = make_pipeline(categorical_cols, numerical_cols)\npipe_syn.fit(synthetic_sdv.drop(columns=[target_col]), synthetic_sdv[target_col])\n\n\nproba_syn = pipe_syn.predict_proba(test_real.drop(columns=[target_col]))[:, 1]\ny_true = (test_real[target_col].astype(str).str.contains(\"&gt;\")).astype(int)\nauc_syn = roc_auc_score(y_true, proba_syn)\nprint(\"Synthetic-train -&gt; Real-test AUC:\", auc_syn)\n\n\npipe_real = make_pipeline(categorical_cols, numerical_cols)\npipe_real.fit(train_real.drop(columns=[target_col]), train_real[target_col])\n\n\nproba_real = pipe_real.predict_proba(test_real.drop(columns=[target_col]))[:, 1]\nauc_real = roc_auc_score(y_true, proba_real)\nprint(\"Real-train -&gt; Real-test AUC:\", auc_real)\n\n\nmodel_path = \"ctgan_sdv_synth.pkl\"\nsynth.save(model_path)\nprint(\"Saved synthesizer to:\", model_path)\n\n\nfrom sdv.utils import load_synthesizer\nsynth_loaded = load_synthesizer(model_path)\n\n\nsynthetic_loaded = synth_loaded.sample(1000)\nprint(\"Loaded synthesizer sample:\")\ndisplay(synthetic_loaded.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We evaluate synthetic data using SDMetrics diagnostic and quality reports and a property-level inspection. We validate downstream usefulness by training a classifier on synthetic data and testing it on real data. Finally, we serialize the trained synthesizer and confirm that it can be reloaded and sampled reliably.<\/p>\n<p>In conclusion, we demonstrated that synthetic data generation with CTGAN becomes significantly more powerful when paired with metadata, constraints, and rigorous evaluation. By validating both statistical similarity and downstream task performance, we ensured that the synthetic data is not only realistic but also useful. This pipeline serves as a strong foundation for privacy-preserving analytics, data sharing, and simulation workflows. With careful configuration and evaluation, CTGAN can be safely deployed in real-world data science systems.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ctgan_sdv_synthetic_data_pipeline_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/13\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\">[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build a complete, production-grade synthetic data pipeline using CTGAN and the SDV ecosystem. We start from raw mixed-type tabular data and progressively move toward constrained generation, conditional sampling, statistical validation, and downstream utility testing. Rather than stopping at sample generation, we focus on understanding how well synthetic data preserves structure, distributions, and predictive signal. This tutorial demonstrates how CTGAN can be used responsibly and rigorously in real-world data science workflows. Copy CodeCopiedUse a different Browser !pip -q install &#8220;ctgan&#8221; &#8220;sdv&#8221; &#8220;sdmetrics&#8221; &#8220;scikit-learn&#8221; &#8220;pandas&#8221; &#8220;numpy&#8221; &#8220;matplotlib&#8221; import numpy as np import pandas as pd import warnings warnings.filterwarnings(&#8220;ignore&#8221;) import ctgan, sdv, sdmetrics from ctgan import load_demo, CTGAN from sdv.metadata import SingleTableMetadata from sdv.single_table import CTGANSynthesizer from sdv.cag import Inequality, FixedCombinations from sdv.sampling import Condition from sdmetrics.reports.single_table import DiagnosticReport, QualityReport from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt print(&#8220;Versions:&#8221;) print(&#8220;ctgan:&#8221;, ctgan.__version__) print(&#8220;sdv:&#8221;, sdv.__version__) print(&#8220;sdmetrics:&#8221;, sdmetrics.__version__) We set up the environment by installing all required libraries and importing the full dependency stack. We explicitly load CTGAN, SDV, SDMetrics, and downstream ML tooling to ensure compatibility across the pipeline. We also surface library versions to make the experiment reproducible and debuggable. Copy CodeCopiedUse a different Browser real = load_demo().copy() real.columns = [c.strip().replace(&#8221; &#8220;, &#8220;_&#8221;) for c in real.columns] target_col = &#8220;income&#8221; real[target_col] = real[target_col].astype(str) categorical_cols = real.select_dtypes(include=[&#8220;object&#8221;]).columns.tolist() numerical_cols = [c for c in real.columns if c not in categorical_cols] print(&#8220;Rows:&#8221;, len(real), &#8220;Cols:&#8221;, len(real.columns)) print(&#8220;Categorical:&#8221;, len(categorical_cols), &#8220;Numerical:&#8221;, len(numerical_cols)) display(real.head()) ctgan_model = CTGAN( epochs=30, batch_size=500, verbose=True ) ctgan_model.fit(real, discrete_columns=categorical_cols) synthetic_ctgan = ctgan_model.sample(5000) print(&#8220;Standalone CTGAN sample:&#8221;) display(synthetic_ctgan.head()) We load the CTGAN Adult demo dataset and perform minimal normalization on column names and data types. We explicitly identify categorical and numerical columns, which is critical for both CTGAN training and evaluation. We then train a baseline standalone CTGAN model and generate synthetic samples for comparison. Copy CodeCopiedUse a different Browser metadata = SingleTableMetadata() metadata.detect_from_dataframe(data=real) metadata.update_column(column_name=target_col, sdtype=&#8221;categorical&#8221;) constraints = [] if len(numerical_cols) &gt;= 2: col_lo, col_hi = numerical_cols[0], numerical_cols[1] constraints.append(Inequality(low_column_name=col_lo, high_column_name=col_hi)) print(f&#8221;Added Inequality constraint: {col_hi} &gt; {col_lo}&#8221;) if len(categorical_cols) &gt;= 2: c1, c2 = categorical_cols[0], categorical_cols[1] constraints.append(FixedCombinations(column_names=[c1, c2])) print(f&#8221;Added FixedCombinations constraint on: [{c1}, {c2}]&#8221;) synth = CTGANSynthesizer( metadata=metadata, epochs=30, batch_size=500 ) if constraints: synth.add_constraints(constraints) synth.fit(real) synthetic_sdv = synth.sample(num_rows=5000) print(&#8220;SDV CTGANSynthesizer sample:&#8221;) display(synthetic_sdv.head()) We construct a formal metadata object and attach explicit semantic types to the dataset. We introduce structural constraints using SDV\u2019s constraint graph system, enforcing numeric inequalities and validity of categorical combinations. We then train a CTGAN-based SDV synthesizer that respects these constraints during generation. Copy CodeCopiedUse a different Browser loss_df = synth.get_loss_values() display(loss_df.tail()) x_candidates = [&#8220;epoch&#8221;, &#8220;step&#8221;, &#8220;steps&#8221;, &#8220;iteration&#8221;, &#8220;iter&#8221;, &#8220;batch&#8221;, &#8220;update&#8221;] xcol = next((c for c in x_candidates if c in loss_df.columns), None) g_candidates = [&#8220;generator_loss&#8221;, &#8220;gen_loss&#8221;, &#8220;g_loss&#8221;] d_candidates = [&#8220;discriminator_loss&#8221;, &#8220;disc_loss&#8221;, &#8220;d_loss&#8221;] gcol = next((c for c in g_candidates if c in loss_df.columns), None) dcol = next((c for c in d_candidates if c in loss_df.columns), None) plt.figure(figsize=(10,4)) if xcol is None: x = np.arange(len(loss_df)) else: x = loss_df[xcol].to_numpy() if gcol is not None: plt.plot(x, loss_df[gcol].to_numpy(), label=gcol) if dcol is not None: plt.plot(x, loss_df[dcol].to_numpy(), label=dcol) plt.xlabel(xcol if xcol is not None else &#8220;index&#8221;) plt.ylabel(&#8220;loss&#8221;) plt.legend() plt.title(&#8220;CTGAN training losses (SDV wrapper)&#8221;) plt.show() cond_col = categorical_cols[0] common_value = real[cond_col].value_counts().index[0] conditions = [Condition({cond_col: common_value}, num_rows=2000)] synthetic_cond = synth.sample_from_conditions( conditions=conditions, max_tries_per_batch=200, batch_size=5000 ) print(&#8220;Conditional sampling requested:&#8221;, 2000, &#8220;got:&#8221;, len(synthetic_cond)) print(&#8220;Conditional sample distribution (top 5):&#8221;) print(synthetic_cond[cond_col].value_counts().head(5)) display(synthetic_cond.head()) We extract and visualize the dynamics of generator and discriminator losses using a version-robust plotting strategy. We perform conditional sampling to generate data under specific attribute constraints and verify that the conditions are satisfied. This demonstrates how CTGAN behaves under guided generation scenarios. Copy CodeCopiedUse a different Browser metadata_dict = metadata.to_dict() diagnostic = DiagnosticReport() diagnostic.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True) print(&#8220;Diagnostic score:&#8221;, diagnostic.get_score()) quality = QualityReport() quality.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True) print(&#8220;Quality score:&#8221;, quality.get_score()) def show_report_details(report, title): print(f&#8221;n===== {title} details =====&#8221;) props = report.get_properties() for p in props: print(f&#8221;n&#8212; {p} &#8212;&#8220;) details = report.get_details(property_name=p) try: display(details.head(10)) except Exception: display(details) show_report_details(diagnostic, &#8220;DiagnosticReport&#8221;) show_report_details(quality, &#8220;QualityReport&#8221;) train_real, test_real = train_test_split( real, test_size=0.25, random_state=42, stratify=real[target_col] ) def make_pipeline(cat_cols, num_cols): pre = ColumnTransformer( transformers=[ (&#8220;cat&#8221;, OneHotEncoder(handle_unknown=&#8221;ignore&#8221;), cat_cols), (&#8220;num&#8221;, &#8220;passthrough&#8221;, num_cols), ], remainder=&#8221;drop&#8221; ) clf = LogisticRegression(max_iter=200) return Pipeline([(&#8220;pre&#8221;, pre), (&#8220;clf&#8221;, clf)]) pipe_syn = make_pipeline(categorical_cols, numerical_cols) pipe_syn.fit(synthetic_sdv.drop(columns=[target_col]), synthetic_sdv[target_col]) proba_syn = pipe_syn.predict_proba(test_real.drop(columns=[target_col]))[:, 1] y_true = (test_real[target_col].astype(str).str.contains(&#8220;&gt;&#8221;)).astype(int) auc_syn = roc_auc_score(y_true, proba_syn) print(&#8220;Synthetic-train -&gt; Real-test AUC:&#8221;, auc_syn) pipe_real = make_pipeline(categorical_cols, numerical_cols) pipe_real.fit(train_real.drop(columns=[target_col]), train_real[target_col]) proba_real = pipe_real.predict_proba(test_real.drop(columns=[target_col]))[:, 1] auc_real = roc_auc_score(y_true, proba_real) print(&#8220;Real-train -&gt; Real-test AUC:&#8221;, auc_real) model_path = &#8220;ctgan_sdv_synth.pkl&#8221; synth.save(model_path) print(&#8220;Saved synthesizer to:&#8221;, model_path) from sdv.utils import load_synthesizer synth_loaded = load_synthesizer(model_path) synthetic_loaded = synth_loaded.sample(1000) print(&#8220;Loaded synthesizer sample:&#8221;) display(synthetic_loaded.head()) We evaluate synthetic data using SDMetrics diagnostic and quality reports and a property-level inspection. We validate downstream usefulness by training a classifier on synthetic data and testing it on real data. Finally, we serialize the trained synthesizer and confirm that it can be reloaded and sampled reliably. In conclusion, we demonstrated that synthetic data generation with CTGAN becomes significantly more powerful when paired with metadata, constraints, and rigorous evaluation. By validating both statistical similarity and downstream task performance, we ensured that the synthetic data is not only realistic but also useful. This pipeline serves as a strong foundation for privacy-preserving analytics, data sharing, and simulation workflows. With careful configuration and evaluation, CTGAN can be safely deployed in real-world data science systems. Check out the\u00a0Full Codes here.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. Wait! are you on telegram?\u00a0now you can join us on telegram as well. The post [In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-71105","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-14T11:47:33+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data\",\"datePublished\":\"2026-02-14T11:47:33+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\"},\"wordCount\":480,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\",\"url\":\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\",\"name\":\"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-02-14T11:47:33+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/","og_locale":"zh_CN","og_type":"article","og_title":"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-02-14T11:47:33+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"6 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data","datePublished":"2026-02-14T11:47:33+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/"},"wordCount":480,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/","url":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/","name":"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-02-14T11:47:33+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we build a complete, production-grade synthetic data pipeline using CTGAN and the SDV ecosystem. We start from raw mixed-type tabular data and progressively move toward constrained generation, conditional sampling, statistical validation, and downstream utility testing. Rather than stopping at sample generation, we focus on understanding how well synthetic data preserves structure, distributions,&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/71105","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=71105"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/71105\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=71105"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=71105"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=71105"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}