A Coding Implementation to Training, Optimizing, Evaluating, and Interpreting Knowledge Graph Embeddings with PyKEEN
In this tutorial, we walk through an end-to-end, advanced workflow for knowledge graph embeddings using PyKEEN, actively exploring how modern embedding models are trained, evaluated, optimized, and interpreted in practice. We start by understanding the structure of a real knowledge graph dataset, then systematically train and compare multiple embedding models, tune their hyperparameters, and analyze their performance using robust ranking metrics. Also, we focus not just on running pipelines but on building intuition for link prediction, negative sampling, and embedding geometry, ensuring we understand why each step matters and how it affects downstream reasoning over graphs. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install -q pykeen torch torchvision import warnings warnings.filterwarnings(‘ignore’) import torch import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from typing import Dict, List, Tuple from pykeen.pipeline import pipeline from pykeen.datasets import Nations, FB15k237, get_dataset from pykeen.models import TransE, ComplEx, RotatE, DistMult from pykeen.training import SLCWATrainingLoop, LCWATrainingLoop from pykeen.evaluation import RankBasedEvaluator from pykeen.triples import TriplesFactory from pykeen.hpo import hpo_pipeline from pykeen.sampling import BasicNegativeSampler from pykeen.losses import MarginRankingLoss, BCEWithLogitsLoss from pykeen.trackers import ConsoleResultTracker print(“PyKEEN setup complete!”) print(f”PyTorch version: {torch.__version__}”) print(f”CUDA available: {torch.cuda.is_available()}”) We set up the complete experimental environment by installing PyKEEN and its deep learning dependencies, and by importing all required libraries for modeling, evaluation, visualization, and optimization. We ensure a clean, reproducible workflow by suppressing warnings and verifying the PyTorch and CUDA configurations for efficient computation. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 2: Dataset Exploration”) print(“=”*80 + “n”) dataset = Nations() print(f”Dataset: {dataset}”) print(f”Number of entities: {dataset.num_entities}”) print(f”Number of relations: {dataset.num_relations}”) print(f”Training triples: {dataset.training.num_triples}”) print(f”Testing triples: {dataset.testing.num_triples}”) print(f”Validation triples: {dataset.validation.num_triples}”) print(“nSample triples (head, relation, tail):”) for i in range(5): h, r, t = dataset.training.mapped_triples[i] head = dataset.training.entity_id_to_label[h.item()] rel = dataset.training.relation_id_to_label[r.item()] tail = dataset.training.entity_id_to_label[t.item()] print(f” {head} –[{rel}]–> {tail}”) def analyze_dataset(triples_factory: TriplesFactory) -> pd.DataFrame: “””Compute basic statistics about the knowledge graph.””” stats = { ‘Metric’: [], ‘Value’: [] } stats[‘Metric’].extend([‘Entities’, ‘Relations’, ‘Triples’]) stats[‘Value’].extend([ triples_factory.num_entities, triples_factory.num_relations, triples_factory.num_triples ]) unique, counts = torch.unique(triples_factory.mapped_triples[:, 1], return_counts=True) stats[‘Metric’].extend([‘Avg triples per relation’, ‘Max triples for a relation’]) stats[‘Value’].extend([counts.float().mean().item(), counts.max().item()]) return pd.DataFrame(stats) stats_df = analyze_dataset(dataset.training) print(“nDataset Statistics:”) print(stats_df.to_string(index=False)) We load and explore the Nation’s knowledge graph to understand its scale, structure, and relational complexity before training any models. We inspect sample triples to build intuition about how entities and relations are represented internally using indexed mappings. We then compute core statistics such as relation frequency and triple distribution, allowing us to reason about graph sparsity and modeling difficulty upfront. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 3: Training Multiple Models”) print(“=”*80 + “n”) models_config = { ‘TransE’: { ‘model’: ‘TransE’, ‘model_kwargs’: {’embedding_dim’: 50}, ‘loss’: ‘MarginRankingLoss’, ‘loss_kwargs’: {‘margin’: 1.0} }, ‘ComplEx’: { ‘model’: ‘ComplEx’, ‘model_kwargs’: {’embedding_dim’: 50}, ‘loss’: ‘BCEWithLogitsLoss’, }, ‘RotatE’: { ‘model’: ‘RotatE’, ‘model_kwargs’: {’embedding_dim’: 50}, ‘loss’: ‘MarginRankingLoss’, ‘loss_kwargs’: {‘margin’: 3.0} } } training_config = { ‘training_loop’: ‘sLCWA’, ‘negative_sampler’: ‘basic’, ‘negative_sampler_kwargs’: {‘num_negs_per_pos’: 5}, ‘training_kwargs’: { ‘num_epochs’: 100, ‘batch_size’: 128, }, ‘optimizer’: ‘Adam’, ‘optimizer_kwargs’: {‘lr’: 0.001} } results = {} for model_name, config in models_config.items(): print(f”nTraining {model_name}…”) result = pipeline( dataset=dataset, model=config[‘model’], model_kwargs=config.get(‘model_kwargs’, {}), loss=config.get(‘loss’), loss_kwargs=config.get(‘loss_kwargs’, {}), **training_config, random_seed=42, device=’cuda’ if torch.cuda.is_available() else ‘cpu’ ) results[model_name] = result print(f”n{model_name} Results:”) print(f” MRR: {result.metric_results.get_metric(‘mean_reciprocal_rank’):.4f}”) print(f” Hits@1: {result.metric_results.get_metric(‘hits_at_1’):.4f}”) print(f” Hits@3: {result.metric_results.get_metric(‘hits_at_3’):.4f}”) print(f” Hits@10: {result.metric_results.get_metric(‘hits_at_10’):.4f}”) We define a consistent training configuration and systematically train multiple knowledge graph embedding models to enable fair comparison. We use the same dataset, negative sampling strategy, optimizer, and training loop while allowing each model to leverage its own inductive bias and loss formulation. We then evaluate and record standard ranking metrics, such as MRR and Hits@K, to quantitatively assess each embedding approach’s performance on link prediction. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 4: Model Comparison”) print(“=”*80 + “n”) metrics_to_compare = [‘mean_reciprocal_rank’, ‘hits_at_1’, ‘hits_at_3’, ‘hits_at_10’] comparison_data = {metric: [] for metric in metrics_to_compare} model_names = [] for model_name, result in results.items(): model_names.append(model_name) for metric in metrics_to_compare: comparison_data[metric].append( result.metric_results.get_metric(metric) ) comparison_df = pd.DataFrame(comparison_data, index=model_names) print(“Model Comparison:”) print(comparison_df.to_string()) fig, axes = plt.subplots(2, 2, figsize=(15, 10)) fig.suptitle(‘Model Performance Comparison’, fontsize=16) for idx, metric in enumerate(metrics_to_compare): ax = axes[idx // 2, idx % 2] comparison_df[metric].plot(kind=’bar’, ax=ax, color=’steelblue’) ax.set_title(metric.replace(‘_’, ‘ ‘).title()) ax.set_ylabel(‘Score’) ax.set_xlabel(‘Model’) ax.grid(axis=’y’, alpha=0.3) ax.set_xticklabels(ax.get_xticklabels(), rotation=45) plt.tight_layout() plt.show() We aggregate evaluation metrics from all trained models into a unified comparison table for direct performance analysis. We visualize key ranking metrics using bar charts, allowing us to quickly identify strengths and weaknesses across different embedding approaches. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” + “=”*80) print(“SECTION 5: Hyperparameter Optimization”) print(“=”*80 + “n”) hpo_result = hpo_pipeline( dataset=dataset, model=’TransE’, n_trials=10, training_loop=’sLCWA’, training_kwargs={‘num_epochs’: 50}, device=’cuda’ if torch.cuda.is_available() else ‘cpu’, ) print(“nBest Configuration Found:”) print(f” Embedding Dim: {hpo_result.study.best_params.get(‘model.embedding_dim’, ‘N/A’)}”) print(f” Learning Rate: {hpo_result.study.best_params.get(‘optimizer.lr’, ‘N/A’)}”) print(f” Best MRR: {hpo_result.study.best_value:.4f}”) print(“n” + “=”*80) print(“SECTION 6: Link Prediction”) print(“=”*80 + “n”) best_model_name = comparison_df[‘mean_reciprocal_rank’].idxmax() best_result = results[best_model_name] model = best_result.model print(f”Using {best_model_name} for predictions”) def predict_tails(model, dataset, head_label: str, relation_label: str, top_k: int = 5): “””Predict most likely tail entities for a given head and relation.””” head_id = dataset.entity_to_id[head_label] relation_id = dataset.relation_to_id[relation_label] num_entities = dataset.num_entities heads = torch.tensor([head_id] * num_entities).unsqueeze(1) relations = torch.tensor([relation_id] * num_entities).unsqueeze(1) tails = torch.arange(num_entities).unsqueeze(1) batch = torch.cat([heads, relations, tails], dim=1) with torch.no_grad(): scores = model.predict_hrt(batch) top_scores, top_indices = torch.topk(scores.squeeze(), k=top_k) predictions = [] for score, idx in zip(top_scores, top_indices): tail_label = dataset.entity_id_to_label[idx.item()] predictions.append((tail_label, score.item())) return predictions if dataset.training.num_entities > 10: sample_head = list(dataset.entity_to_id.keys())[0] sample_relation = list(dataset.relation_to_id.keys())[0] print(f”nTop predictions for: {sample_head} –[{sample_relation}]–> ?”) predictions = predict_tails( best_result.model, dataset.training, sample_head, sample_relation, top_k=5 ) for rank, (entity, score) in enumerate(predictions, 1): print(f” {rank}. {entity} (score: {score:.4f})”) We apply automated hyperparameter optimization to systematically search for a stronger TransE configuration that improves ranking performance without manual tuning. We then select the best-performing model based on MRR and use it to perform practical link prediction by scoring all possible tail entities for a given head–relation pair. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n” +


