{"id":43843,"date":"2025-10-12T06:59:00","date_gmt":"2025-10-12T06:59:00","guid":{"rendered":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/"},"modified":"2025-10-12T06:59:00","modified_gmt":"2025-10-12T06:59:00","slug":"a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/","title":{"rendered":"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning"},"content":{"rendered":"<p>In this tutorial, we explore the power of self-supervised learning using the <a href=\"http:\/\/github.com\/lightly-ai\/lightly\">Lightly AI<\/a> framework. We begin by building a SimCLR model to learn meaningful image representations without labels, then generate and visualize embeddings using UMAP and t-SNE. We then dive into coreset selection techniques to curate data intelligently, simulate an active learning workflow, and finally assess the benefits of transfer learning through a linear probe evaluation. Throughout this hands-on guide, we work step by step in Google Colab, training, visualizing, and comparing coreset-based and random sampling to understand how self-supervised learning can significantly improve data efficiency and model performance. Check out the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/lightly_ai_self_supervised_active_learning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FULL CODES here<\/strong>.<\/a><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">!pip uninstall -y numpy\n!pip install numpy==1.26.4\n!pip install -q lightly torch torchvision matplotlib scikit-learn umap-learn\n\n\nimport torch\nimport torch.nn as nn\nimport torchvision\nfrom torch.utils.data import DataLoader, Subset\nfrom torchvision import transforms\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.manifold import TSNE\nfrom sklearn.neighbors import NearestNeighbors\nimport umap\n\n\nfrom lightly.loss import NTXentLoss\nfrom lightly.models.modules import SimCLRProjectionHead\nfrom lightly.transforms import SimCLRTransform\nfrom lightly.data import LightlyDataset\n\n\nprint(f\"PyTorch version: {torch.__version__}\")\nprint(f\"CUDA available: {torch.cuda.is_available()}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by setting up the environment, ensuring compatibility by fixing the NumPy version and installing essential libraries like Lightly, PyTorch, and UMAP. We then import all necessary modules for building, training, and visualizing our self-supervised learning model, confirming that PyTorch and CUDA are ready for GPU acceleration. Check out the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/lightly_ai_self_supervised_active_learning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FULL CODES here<\/strong>.<\/a><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">class SimCLRModel(nn.Module):\n   \"\"\"SimCLR model with ResNet backbone\"\"\"\n   def __init__(self, backbone, hidden_dim=512, out_dim=128):\n       super().__init__()\n       self.backbone = backbone\n       self.backbone.fc = nn.Identity()\n       self.projection_head = SimCLRProjectionHead(\n           input_dim=512, hidden_dim=hidden_dim, output_dim=out_dim\n       )\n  \n   def forward(self, x):\n       features = self.backbone(x).flatten(start_dim=1)\n       z = self.projection_head(features)\n       return z\n  \n   def extract_features(self, x):\n       \"\"\"Extract backbone features without projection\"\"\"\n       with torch.no_grad():\n           return self.backbone(x).flatten(start_dim=1)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define our SimCLRModel, which uses a ResNet backbone to learn visual representations without labels. We remove the classification head and add a projection head to map features into a contrastive embedding space. The model\u2019s extract_features method allows us to obtain raw feature embeddings directly from the backbone for downstream analysis. Check out the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/lightly_ai_self_supervised_active_learning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FULL CODES here<\/strong>.<\/a><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def load_dataset(train=True):\n   \"\"\"Load CIFAR-10 dataset\"\"\"\n   ssl_transform = SimCLRTransform(input_size=32, cj_prob=0.8)\n  \n   eval_transform = transforms.Compose([\n       transforms.ToTensor(),\n       transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))\n   ])\n  \n   base_dataset = torchvision.datasets.CIFAR10(\n       root='.\/data', train=train, download=True\n   )\n  \n   class SSLDataset(torch.utils.data.Dataset):\n       def __init__(self, dataset, transform):\n           self.dataset = dataset\n           self.transform = transform\n      \n       def __len__(self):\n           return len(self.dataset)\n      \n       def __getitem__(self, idx):\n           img, label = self.dataset[idx]\n           return self.transform(img), label\n  \n   ssl_dataset = SSLDataset(base_dataset, ssl_transform)\n  \n   eval_dataset = torchvision.datasets.CIFAR10(\n       root='.\/data', train=train, download=True, transform=eval_transform\n   )\n  \n   return ssl_dataset, eval_dataset<\/code><\/pre>\n<\/div>\n<\/div>\n<p>In this step, we load the CIFAR-10 dataset and apply separate transformations for self-supervised and evaluation phases. We create a custom SSLDataset class that generates multiple augmented views of each image for contrastive learning, while the evaluation dataset uses normalized images for downstream tasks. This setup helps the model learn robust representations invariant to visual changes. Check out the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/lightly_ai_self_supervised_active_learning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FULL CODES here<\/strong>.<\/a><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def train_ssl_model(model, dataloader, epochs=5, device='cuda'):\n   \"\"\"Train SimCLR model\"\"\"\n   model.to(device)\n   criterion = NTXentLoss(temperature=0.5)\n   optimizer = torch.optim.SGD(model.parameters(), lr=0.06, momentum=0.9, weight_decay=5e-4)\n  \n   print(\"n=== Self-Supervised Training ===\")\n   for epoch in range(epochs):\n       model.train()\n       total_loss = 0\n       for batch_idx, batch in enumerate(dataloader):\n           views = batch[0] \n           view1, view2 = views[0].to(device), views[1].to(device)\n          \n           z1 = model(view1)\n           z2 = model(view2)\n           loss = criterion(z1, z2)\n          \n           optimizer.zero_grad()\n           loss.backward()\n           optimizer.step()\n          \n           total_loss += loss.item()\n          \n           if batch_idx % 50 == 0:\n               print(f\"Epoch {epoch+1}\/{epochs} | Batch {batch_idx} | Loss: {loss.item():.4f}\")\n      \n       avg_loss = total_loss \/ len(dataloader)\n       print(f\"Epoch {epoch+1} Complete | Avg Loss: {avg_loss:.4f}\")\n  \n   return model<\/code><\/pre>\n<\/div>\n<\/div>\n<p>Here, we train our SimCLR model in a self-supervised manner using the NT-Xent contrastive loss, which encourages similar representations for augmented views of the same image. We optimize the model with stochastic gradient descent (SGD) and track the loss across epochs to monitor learning progress. This stage teaches the model to extract meaningful visual features without relying on labeled data. Check out the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/lightly_ai_self_supervised_active_learning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FULL CODES here<\/strong>.<\/a><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def generate_embeddings(model, dataset, device='cuda', batch_size=256):\n   \"\"\"Generate embeddings for the entire dataset\"\"\"\n   model.eval()\n   model.to(device)\n  \n   dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=2)\n  \n   embeddings = []\n   labels = []\n  \n   print(\"n=== Generating Embeddings ===\")\n   with torch.no_grad():\n       for images, targets in dataloader:\n           images = images.to(device)\n           features = model.extract_features(images)\n           embeddings.append(features.cpu().numpy())\n           labels.append(targets.numpy())\n  \n   embeddings = np.vstack(embeddings)\n   labels = np.concatenate(labels)\n  \n   print(f\"Generated {embeddings.shape[0]} embeddings with dimension {embeddings.shape[1]}\")\n   return embeddings, labels\n\n\ndef visualize_embeddings(embeddings, labels, method='umap', n_samples=5000):\n   \"\"\"Visualize embeddings using UMAP or t-SNE\"\"\"\n   print(f\"n=== Visualizing Embeddings with {method.upper()} ===\")\n  \n   if len(embeddings) &gt; n_samples:\n       indices = np.random.choice(len(embeddings), n_samples, replace=False)\n       embeddings = embeddings[indices]\n       labels = labels[indices]\n  \n   if method == 'umap':\n       reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')\n   else:\n       reducer = TSNE(n_components=2, perplexity=30, metric='cosine')\n  \n   embeddings_2d = reducer.fit_transform(embeddings)\n  \n   plt.figure(figsize=(12, 10))\n   scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],\n                         c=labels, cmap='tab10', s=5, alpha=0.6)\n   plt.colorbar(scatter)\n   plt.title(f'CIFAR-10 Embeddings ({method.upper()})')\n   plt.xlabel('Component 1')\n   plt.ylabel('Component 2')\n   plt.tight_layout()\n   plt.savefig(f'embeddings_{method}.png', dpi=150)\n   print(f\"Saved visualization to embeddings_{method}.png\")\n   plt.show()\n\n\ndef select_coreset(embeddings, labels, budget=1000, method='diversity'):\n   \"\"\"\n   Select a coreset using different strategies:\n   - diversity: Maximum diversity using k-center greedy\n   - balanced: Class-balanced selection\n   \"\"\"\n   print(f\"n=== Coreset Selection ({method}) ===\")\n  \n   if method == 'balanced':\n       selected_indices = []\n       n_classes = len(np.unique(labels))\n       per_class = budget \/\/ n_classes\n      \n       for cls in range(n_classes):\n           cls_indices = np.where(labels == cls)[0]\n           selected = np.random.choice(cls_indices, min(per_class, len(cls_indices)), replace=False)\n           selected_indices.extend(selected)\n      \n       return np.array(selected_indices)\n  \n   elif method == 'diversity':\n       selected_indices = []\n       remaining_indices = set(range(len(embeddings)))\n      \n       first_idx = np.random.randint(len(embeddings))\n       selected_indices.append(first_idx)\n       remaining_indices.remove(first_idx)\n      \n       for _ in range(budget - 1):\n           if not remaining_indices:\n               break\n          \n           remaining = list(remaining_indices)\n           selected_emb = embeddings[selected_indices]\n           remaining_emb = embeddings[remaining]\n          \n           distances = np.min(\n               np.linalg.norm(remaining_emb[:, None] - selected_emb, axis=2), axis=1\n           )\n          \n           max_dist_idx = np.argmax(distances)\n           selected_idx = remaining[max_dist_idx]\n           selected_indices.append(selected_idx)\n           remaining_indices.remove(selected_idx)\n      \n       print(f\"Selected {len(selected_indices)} samples\")\n       return np.array(selected_indices)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We extract high-quality feature embeddings from our trained backbone, cache them with labels, and project them to 2D using UMAP or t-SNE to visually see the cluster structure emerge. Next, we curate data using a coreset selector, either class-balanced or diversity-driven (k-center greedy), to prioritize the most informative, non-redundant samples for downstream training. This pipeline helps us both see what the model learns and select what matters most. Check out the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/lightly_ai_self_supervised_active_learning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FULL CODES here<\/strong>.<\/a><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def evaluate_linear_probe(model, train_subset, test_dataset, device='cuda'):\n   \"\"\"Train linear classifier on frozen features\"\"\"\n   model.eval()\n  \n   train_loader = DataLoader(train_subset, batch_size=128, shuffle=True, num_workers=2)\n   test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=2)\n  \n   classifier = nn.Linear(512, 10).to(device)\n   criterion = nn.CrossEntropyLoss()\n   optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)\n  \n   for epoch in range(10):\n       classifier.train()\n       for images, targets in train_loader:\n           images, targets = images.to(device), targets.to(device)\n          \n           with torch.no_grad():\n               features = model.extract_features(images)\n          \n           outputs = classifier(features)\n           loss = criterion(outputs, targets)\n          \n           optimizer.zero_grad()\n           loss.backward()\n           optimizer.step()\n  \n   classifier.eval()\n   correct = 0\n   total = 0\n  \n   with torch.no_grad():\n       for images, targets in test_loader:\n           images, targets = images.to(device), targets.to(device)\n           features = model.extract_features(images)\n           outputs = classifier(features)\n           _, predicted = outputs.max(1)\n           total += targets.size(0)\n           correct += predicted.eq(targets).sum().item()\n  \n   accuracy = 100. * correct \/ total\n   return accuracy\n\n\ndef main():\n   device = 'cuda' if torch.cuda.is_available() else 'cpu'\n   print(f\"Using device: {device}\")\n  \n   ssl_dataset, eval_dataset = load_dataset(train=True)\n   _, test_dataset = load_dataset(train=False)\n  \n   ssl_subset = Subset(ssl_dataset, range(10000)) \n   ssl_loader = DataLoader(ssl_subset, batch_size=128, shuffle=True, num_workers=2, drop_last=True)\n  \n   backbone = torchvision.models.resnet18(pretrained=False)\n   model = SimCLRModel(backbone)\n   model = train_ssl_model(model, ssl_loader, epochs=5, device=device)\n  \n   eval_subset = Subset(eval_dataset, range(10000))\n   embeddings, labels = generate_embeddings(model, eval_subset, device=device)\n  \n   visualize_embeddings(embeddings, labels, method='umap')\n  \n   coreset_indices = select_coreset(embeddings, labels, budget=1000, method='diversity')\n   coreset_subset = Subset(eval_dataset, coreset_indices)\n  \n   print(\"n=== Active Learning Evaluation ===\")\n   coreset_acc = evaluate_linear_probe(model, coreset_subset, test_dataset, device=device)\n   print(f\"Coreset Accuracy (1000 samples): {coreset_acc:.2f}%\")\n  \n   random_indices = np.random.choice(len(eval_subset), 1000, replace=False)\n   random_subset = Subset(eval_dataset, random_indices)\n   random_acc = evaluate_linear_probe(model, random_subset, test_dataset, device=device)\n   print(f\"Random Accuracy (1000 samples): {random_acc:.2f}%\")\n  \n   print(f\"nCoreset improvement: +{coreset_acc - random_acc:.2f}%\")\n  \n   print(\"n=== Tutorial Complete! ===\")\n   print(\"Key takeaways:\")\n   print(\"1. Self-supervised learning creates meaningful representations without labels\")\n   print(\"2. Embeddings capture semantic similarity between images\")\n   print(\"3. Smart data selection (coreset) outperforms random sampling\")\n   print(\"4. Active learning reduces labeling costs while maintaining accuracy\")\n\n\nif __name__ == \"__main__\":\n   main()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We freeze the backbone and train a lightweight linear probe to quantify how good our learned features are, then evaluate accuracy on the test set. In the main pipeline, we pretrain with SimCLR, generate embeddings, visualize them, pick a diverse coreset, and compare linear-probe performance against a random subset, thereby directly measuring the value of smart data curation.<\/p>\n<p>In conclusion, we have seen how self-supervised learning enables representation learning without manual annotations and how coreset-based data selection enhances model generalization with fewer samples. By training a SimCLR model, generating embeddings, curating data, and evaluating through active learning, we experience the end-to-end process of modern self-supervised workflows. We conclude that by combining intelligent data curation with learned representations, we can build models that are both resource-efficient and performance-optimized, setting a strong foundation for scalable machine learning applications.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/ML%20Project%20Codes\/lightly_ai_self_supervised_active_learning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FULL CODES here<\/strong>.<\/a> Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/10\/11\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\">A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore the power of self-supervised learning using the Lightly AI framework. We begin by building a SimCLR model to learn meaningful image representations without labels, then generate and visualize embeddings using UMAP and t-SNE. We then dive into coreset selection techniques to curate data intelligently, simulate an active learning workflow, and finally assess the benefits of transfer learning through a linear probe evaluation. Throughout this hands-on guide, we work step by step in Google Colab, training, visualizing, and comparing coreset-based and random sampling to understand how self-supervised learning can significantly improve data efficiency and model performance. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser !pip uninstall -y numpy !pip install numpy==1.26.4 !pip install -q lightly torch torchvision matplotlib scikit-learn umap-learn import torch import torch.nn as nn import torchvision from torch.utils.data import DataLoader, Subset from torchvision import transforms import numpy as np import matplotlib.pyplot as plt from sklearn.manifold import TSNE from sklearn.neighbors import NearestNeighbors import umap from lightly.loss import NTXentLoss from lightly.models.modules import SimCLRProjectionHead from lightly.transforms import SimCLRTransform from lightly.data import LightlyDataset print(f&#8221;PyTorch version: {torch.__version__}&#8221;) print(f&#8221;CUDA available: {torch.cuda.is_available()}&#8221;) We begin by setting up the environment, ensuring compatibility by fixing the NumPy version and installing essential libraries like Lightly, PyTorch, and UMAP. We then import all necessary modules for building, training, and visualizing our self-supervised learning model, confirming that PyTorch and CUDA are ready for GPU acceleration. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser class SimCLRModel(nn.Module): &#8220;&#8221;&#8221;SimCLR model with ResNet backbone&#8221;&#8221;&#8221; def __init__(self, backbone, hidden_dim=512, out_dim=128): super().__init__() self.backbone = backbone self.backbone.fc = nn.Identity() self.projection_head = SimCLRProjectionHead( input_dim=512, hidden_dim=hidden_dim, output_dim=out_dim ) def forward(self, x): features = self.backbone(x).flatten(start_dim=1) z = self.projection_head(features) return z def extract_features(self, x): &#8220;&#8221;&#8221;Extract backbone features without projection&#8221;&#8221;&#8221; with torch.no_grad(): return self.backbone(x).flatten(start_dim=1) We define our SimCLRModel, which uses a ResNet backbone to learn visual representations without labels. We remove the classification head and add a projection head to map features into a contrastive embedding space. The model\u2019s extract_features method allows us to obtain raw feature embeddings directly from the backbone for downstream analysis. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def load_dataset(train=True): &#8220;&#8221;&#8221;Load CIFAR-10 dataset&#8221;&#8221;&#8221; ssl_transform = SimCLRTransform(input_size=32, cj_prob=0.8) eval_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) base_dataset = torchvision.datasets.CIFAR10( root=&#8217;.\/data&#8217;, train=train, download=True ) class SSLDataset(torch.utils.data.Dataset): def __init__(self, dataset, transform): self.dataset = dataset self.transform = transform def __len__(self): return len(self.dataset) def __getitem__(self, idx): img, label = self.dataset[idx] return self.transform(img), label ssl_dataset = SSLDataset(base_dataset, ssl_transform) eval_dataset = torchvision.datasets.CIFAR10( root=&#8217;.\/data&#8217;, train=train, download=True, transform=eval_transform ) return ssl_dataset, eval_dataset In this step, we load the CIFAR-10 dataset and apply separate transformations for self-supervised and evaluation phases. We create a custom SSLDataset class that generates multiple augmented views of each image for contrastive learning, while the evaluation dataset uses normalized images for downstream tasks. This setup helps the model learn robust representations invariant to visual changes. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def train_ssl_model(model, dataloader, epochs=5, device=&#8217;cuda&#8217;): &#8220;&#8221;&#8221;Train SimCLR model&#8221;&#8221;&#8221; model.to(device) criterion = NTXentLoss(temperature=0.5) optimizer = torch.optim.SGD(model.parameters(), lr=0.06, momentum=0.9, weight_decay=5e-4) print(&#8220;n=== Self-Supervised Training ===&#8221;) for epoch in range(epochs): model.train() total_loss = 0 for batch_idx, batch in enumerate(dataloader): views = batch[0] view1, view2 = views[0].to(device), views[1].to(device) z1 = model(view1) z2 = model(view2) loss = criterion(z1, z2) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() if batch_idx % 50 == 0: print(f&#8221;Epoch {epoch+1}\/{epochs} | Batch {batch_idx} | Loss: {loss.item():.4f}&#8221;) avg_loss = total_loss \/ len(dataloader) print(f&#8221;Epoch {epoch+1} Complete | Avg Loss: {avg_loss:.4f}&#8221;) return model Here, we train our SimCLR model in a self-supervised manner using the NT-Xent contrastive loss, which encourages similar representations for augmented views of the same image. We optimize the model with stochastic gradient descent (SGD) and track the loss across epochs to monitor learning progress. This stage teaches the model to extract meaningful visual features without relying on labeled data. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def generate_embeddings(model, dataset, device=&#8217;cuda&#8217;, batch_size=256): &#8220;&#8221;&#8221;Generate embeddings for the entire dataset&#8221;&#8221;&#8221; model.eval() model.to(device) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=2) embeddings = [] labels = [] print(&#8220;n=== Generating Embeddings ===&#8221;) with torch.no_grad(): for images, targets in dataloader: images = images.to(device) features = model.extract_features(images) embeddings.append(features.cpu().numpy()) labels.append(targets.numpy()) embeddings = np.vstack(embeddings) labels = np.concatenate(labels) print(f&#8221;Generated {embeddings.shape[0]} embeddings with dimension {embeddings.shape[1]}&#8221;) return embeddings, labels def visualize_embeddings(embeddings, labels, method=&#8217;umap&#8217;, n_samples=5000): &#8220;&#8221;&#8221;Visualize embeddings using UMAP or t-SNE&#8221;&#8221;&#8221; print(f&#8221;n=== Visualizing Embeddings with {method.upper()} ===&#8221;) if len(embeddings) &gt; n_samples: indices = np.random.choice(len(embeddings), n_samples, replace=False) embeddings = embeddings[indices] labels = labels[indices] if method == &#8216;umap&#8217;: reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric=&#8217;cosine&#8217;) else: reducer = TSNE(n_components=2, perplexity=30, metric=&#8217;cosine&#8217;) embeddings_2d = reducer.fit_transform(embeddings) plt.figure(figsize=(12, 10)) scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap=&#8217;tab10&#8242;, s=5, alpha=0.6) plt.colorbar(scatter) plt.title(f&#8217;CIFAR-10 Embeddings ({method.upper()})&#8217;) plt.xlabel(&#8216;Component 1&#8217;) plt.ylabel(&#8216;Component 2&#8242;) plt.tight_layout() plt.savefig(f&#8217;embeddings_{method}.png&#8217;, dpi=150) print(f&#8221;Saved visualization to embeddings_{method}.png&#8221;) plt.show() def select_coreset(embeddings, labels, budget=1000, method=&#8217;diversity&#8217;): &#8220;&#8221;&#8221; Select a coreset using different strategies: &#8211; diversity: Maximum diversity using k-center greedy &#8211; balanced: Class-balanced selection &#8220;&#8221;&#8221; print(f&#8221;n=== Coreset Selection ({method}) ===&#8221;) if method == &#8216;balanced&#8217;: selected_indices = [] n_classes = len(np.unique(labels)) per_class = budget \/\/ n_classes for cls in range(n_classes): cls_indices = np.where(labels == cls)[0] selected = np.random.choice(cls_indices, min(per_class, len(cls_indices)), replace=False) selected_indices.extend(selected) return np.array(selected_indices) elif method == &#8216;diversity&#8217;: selected_indices = [] remaining_indices = set(range(len(embeddings))) first_idx = np.random.randint(len(embeddings)) selected_indices.append(first_idx) remaining_indices.remove(first_idx) for _ in range(budget &#8211; 1): if not remaining_indices: break remaining = list(remaining_indices) selected_emb = embeddings[selected_indices] remaining_emb = embeddings[remaining] distances = np.min( np.linalg.norm(remaining_emb[:, None] &#8211; selected_emb, axis=2), axis=1 ) max_dist_idx = np.argmax(distances) selected_idx = remaining[max_dist_idx] selected_indices.append(selected_idx) remaining_indices.remove(selected_idx) print(f&#8221;Selected {len(selected_indices)} samples&#8221;) return np.array(selected_indices) We extract high-quality feature embeddings from our trained backbone, cache them with labels, and project them to 2D using UMAP or t-SNE to visually see the cluster structure emerge. Next, we curate data using a coreset selector, either class-balanced or diversity-driven (k-center greedy), to prioritize the most informative, non-redundant samples for downstream training. This pipeline helps us both see what the model learns and select what matters most. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def evaluate_linear_probe(model, train_subset, test_dataset, device=&#8217;cuda&#8217;): &#8220;&#8221;&#8221;Train linear classifier on frozen features&#8221;&#8221;&#8221; model.eval() train_loader = DataLoader(train_subset, batch_size=128, shuffle=True, num_workers=2) test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False,<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-43843","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-12T06:59:00+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning\",\"datePublished\":\"2025-10-12T06:59:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\"},\"wordCount\":695,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\",\"url\":\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\",\"name\":\"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2025-10-12T06:59:00+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/","og_locale":"zh_CN","og_type":"article","og_title":"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-10-12T06:59:00+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"9 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning","datePublished":"2025-10-12T06:59:00+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/"},"wordCount":695,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/","url":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/","name":"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2025-10-12T06:59:00+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/a-coding-guide-to-master-self-supervised-learning-with-lightly-ai-for-efficient-data-curation-and-active-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we explore the power of self-supervised learning using the Lightly AI framework. We begin by building a SimCLR model to learn meaningful image representations without labels, then generate and visualize embeddings using UMAP and t-SNE. We then dive into coreset selection techniques to curate data intelligently, simulate an active learning workflow, and&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/43843","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=43843"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/43843\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=43843"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=43843"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=43843"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}