Complex prediction problems often lead to ensembles because combining multiple models improves accuracy by reducing variance and capturing diverse patterns. However, these ensembles are impractical in production due to latency constraints and operational complexity. Instead of discarding them, Knowledge Distillation offers a smarter approach: keep the ensemble as a teacher and train a smaller student model using its soft probability outputs. This allows the student to inherit much of the ensemble’s performance while being lightweight and fast enough for deployment. In this article, we build this pipeline from scratch — training a 12-model teacher ensemble, generating soft targets with temperature scaling, and distilling it into a student that recovers 53.8% of the ensemble’s accuracy edge at 160× the compression. What is Knowledge Distillation? Knowledge distillation is a model compression technique in which a large, pre-trained “teacher” model transfers its learned behavior to a smaller “student” model. Instead of training solely on ground-truth labels, the student is trained to mimic the teacher’s predictions—capturing not just final outputs but the richer patterns embedded in its probability distributions. This approach enables the student to approximate the performance of complex models while remaining significantly smaller and faster. Originating from early work on compressing large ensemble models into single networks, knowledge distillation is now widely used across domains like NLP, speech, and computer vision, and has become especially important in scaling down massive generative AI models into efficient, deployable systems. Knowledge Distillation: From Ensemble Teacher to Lean Student Setting up the dependencies Copy CodeCopiedUse a different Browser pip install torch scikit-learn numpy Copy CodeCopiedUse a different Browser import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader, TensorDataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import numpy as np Copy CodeCopiedUse a different Browser torch.manual_seed(42) np.random.seed(42) Creating the dataset This block creates and prepares a synthetic dataset for a binary classification task (like predicting whether a user clicks an ad). First, make_classification generates 5,000 samples with 20 features, of which some are informative and some redundant to simulate real-world data complexity. The dataset is then split into training and testing sets to evaluate model performance on unseen data. Next, StandardScaler normalizes the features so they have a consistent scale, which helps neural networks train more efficiently. The data is then converted into PyTorch tensors so it can be used in model training. Finally, a DataLoader is created to feed the data in mini-batches (size 64) during training, improving efficiency and enabling stochastic gradient descent. Copy CodeCopiedUse a different Browser X, y = make_classification( n_samples=5000, n_features=20, n_informative=10, n_redundant=5, random_state=42 ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Convert to tensors X_train_t = torch.tensor(X_train, dtype=torch.float32) y_train_t = torch.tensor(y_train, dtype=torch.long) X_test_t = torch.tensor(X_test, dtype=torch.float32) y_test_t = torch.tensor(y_test, dtype=torch.long) train_loader = DataLoader( TensorDataset(X_train_t, y_train_t), batch_size=64, shuffle=True ) Model Architecture This section defines two neural network architectures: a TeacherModel and a StudentModel. The teacher represents one of the large models in the ensemble—it has multiple layers, wider dimensions, and dropout for regularization, making it highly expressive but computationally expensive during inference. The student model, on the other hand, is a smaller and more efficient network with fewer layers and parameters. Its goal is not to match the teacher’s complexity, but to learn its behavior through distillation. Importantly, the student still retains enough capacity to approximate the teacher’s decision boundaries—too small, and it won’t be able to capture the richer patterns learned by the ensemble. Copy CodeCopiedUse a different Browser class TeacherModel(nn.Module): “””Represents one heavy model inside the ensemble.””” def __init__(self, input_dim=20, num_classes=2): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, 256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.3), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, num_classes) ) def forward(self, x): return self.net(x) class StudentModel(nn.Module): “”” The lean production model that learns from the ensemble. Two hidden layers — enough capacity to absorb distilled knowledge, still ~30x smaller than the full ensemble. “”” def __init__(self, input_dim=20, num_classes=2): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, num_classes) ) def forward(self, x): return self.net(x) Helpers This section defines two utility functions for training and evaluation. train_one_epoch handles one full pass over the training data. It puts the model in training mode, iterates through mini-batches, computes the loss, performs backpropagation, and updates the model weights using the optimizer. It also tracks and returns the average loss across all batches to monitor training progress. evaluate is used to measure model performance. It switches the model to evaluation mode (disabling dropout and gradients), makes predictions on the input data, and computes the accuracy by comparing predicted labels with true labels. Copy CodeCopiedUse a different Browser def train_one_epoch(model, loader, optimizer, criterion): model.train() total_loss = 0 for xb, yb in loader: optimizer.zero_grad() loss = criterion(model(xb), yb) loss.backward() optimizer.step() total_loss += loss.item() return total_loss / len(loader) def evaluate(model, X, y): model.eval() with torch.no_grad(): preds = model(X).argmax(dim=1) return (preds == y).float().mean().item() Training the Ensemble This section trains the teacher ensemble, which serves as the source of knowledge for distillation. Instead of a single model, 12 teacher models are trained independently with different random initializations, allowing each one to learn slightly different patterns from the data. This diversity is what makes ensembles powerful. Each teacher is trained for multiple epochs until convergence, and their individual test accuracies are printed. Once all models are trained, their predictions are combined using soft voting—by averaging their output logits rather than taking a simple majority vote. This produces a stronger, more stable final prediction, giving you a high-performing ensemble that will act as the “teacher” in the next step. Copy CodeCopiedUse a different Browser print(“=” * 55) print(“STEP 1: Training the 12-model Teacher Ensemble”) print(” (this happens offline, not in production)”) print(“=” * 55) NUM_TEACHERS = 12 teachers = [] for i in range(NUM_TEACHERS): torch.manual_seed(i) # different init per teacher model = TeacherModel() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) criterion = nn.CrossEntropyLoss() for epoch in range(30): # train until convergence train_one_epoch(model, train_loader, optimizer, criterion) acc = evaluate(model,