YouZum

Committee

AI, Committee, News, Uncategorized

SwiReasoning: Entropy-Driven Alternation of Latent and Explicit Chain-of-Thought for Reasoning LLMs

SwiReasoning is a decoding-time framework that lets a reasoning LLM decide when to think in latent space and when to write explicit chain-of-thought, using block-wise confidence estimated from entropy trends in next-token distributions. The method is training-free, model-agnostic, and targets Pareto-superior accuracy/efficiency trade-offs on mathematics and STEM benchmarks. Reported results show +1.5%–2.8% average accuracy improvements with unlimited tokens and +56%–79% average token-efficiency gains under constrained budgets; on AIME’24/’25, it reaches maximum reasoning accuracy earlier than standard CoT. What SwiReasoning changes at inference time? The controller monitors the decoder’s next-token entropy to form a block-wise confidence signal. When confidence is low (entropy trending upward), it enters latent reasoning—the model continues to reason without emitting tokens. When confidence recovers (entropy trending down), it switches back to explicit reasoning, emitting CoT tokens to consolidate and commit to a single path. A switch count control limits the maximum number of thinking-block transitions to suppress overthinking before finalizing the answer. This dynamic alternation is the core mechanism behind the reported accuracy-per-token gains. https://arxiv.org/pdf/2510.05069 Results: accuracy and efficiency on standard suites It reports improvements across mathematics and STEM reasoning tasks: Pass@1 (unlimited budget): accuracy lifts up to +2.8% (math) and +2.0% (STEM) in Figure 1 and Table 1, with a +2.17% average over baselines (CoT with sampling, CoT greedy, and Soft Thinking). Token efficiency (limited budgets): average improvements up to +79% (Figure 2). A comprehensive comparison shows SwiReasoning attains the highest token efficiency in 13/15 evaluations, with an +84% average improvement over CoT across those settings (Figure 4). Pass@k dynamics: with Qwen3-8B on AIME 2024/2025, maximum reasoning accuracies are achieved +50% earlier than CoT on average (Figure 5), indicating faster convergence to the ceiling with fewer sampled trajectories. Why switching helps? Explicit CoT is discrete and readable but locks in a single path prematurely, which can discard useful alternatives. Latent reasoning is continuous and information-dense per step, but purely latent strategies may diffuse probability mass and impede convergence. SwiReasoning adds a confidence-guided alternation: latent phases broaden exploration when the model is uncertain; explicit phases exploit rising confidence to solidify a solution and commit tokens only when beneficial. The switch count control regularizes the process by capping oscillations and limiting prolonged “silent” wandering—addressing both accuracy loss from diffusion and token waste from overthinking cited as challenges for training-free latent methods. Positioning vs. baselines The project compares against CoT with sampling, CoT greedy, and Soft Thinking, reporting a +2.17% average accuracy lift at unlimited budgets (Table 1) and consistent efficiency-per-token advantages under budget constraints. The visualized Pareto frontier shifts outward—either higher accuracy at the same budget or similar accuracy with fewer tokens—across different model families and scales. On AIME’24/’25, the Pass@k curves show that SwiReasoning reaches the performance ceiling with fewer samples than CoT, reflecting improved convergence behavior rather than only better raw ceilings. https://arxiv.org/pdf/2510.05069 https://arxiv.org/pdf/2510.05069 Key Takeaways Training-free controller: SwiReasoning alternates between latent reasoning and explicit chain-of-thought using block-wise confidence from next-token entropy trends. Efficiency gains: Reports +56–79% average token-efficiency improvements under constrained budgets versus CoT, with larger gains as budgets tighten. Accuracy lifts: Achieves +1.5–2.8% average Pass@1 improvements on mathematics/STEM benchmarks at unlimited budgets. Faster convergence: On AIME 2024/2025, reaches maximum reasoning accuracy earlier than CoT (improved Pass@k dynamics). Editorial Comments SwiReasoning is a useful step toward pragmatic “reasoning policy” control at decode time: it’s training-free, slots behind the tokenizer, and exposes measurable gains on math/STEM suites by toggling between latent and explicit CoT using an entropy-trend confidence signal with a capped switch count. The open-source BSD implementation and clear flags (–max_switch_count, –alpha) make replication straightforward and lower the barrier to stacking with orthogonal efficiency layers (e.g., quantization, speculative decoding, KV-cache tricks). The method’s value proposition is “accuracy per token” rather than raw SOTA accuracy, which is operationally important for budgeted inference and batching. Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post SwiReasoning: Entropy-Driven Alternation of Latent and Explicit Chain-of-Thought for Reasoning LLMs appeared first on MarkTechPost.

SwiReasoning: Entropy-Driven Alternation of Latent and Explicit Chain-of-Thought for Reasoning LLMs Read Post »

AI, Committee, News, Uncategorized

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

arXiv:2510.08825v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions — they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an “observe-then-navigate” principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs Read Post »

AI, Committee, News, Uncategorized

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

arXiv:2510.09032v1 Announce Type: new Abstract: As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language Read Post »

AI, Committee, News, Uncategorized

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

arXiv:2510.08800v1 Announce Type: new Abstract: While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs’ ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs’ ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective Read Post »

AI, Committee, News, Uncategorized

NLP-ADBench: NLP Anomaly Detection Benchmark

arXiv:2412.04784v2 Announce Type: replace Abstract: Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.

NLP-ADBench: NLP Anomaly Detection Benchmark Read Post »

AI, Committee, News, Uncategorized

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

arXiv:2510.09541v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models Read Post »

AI, Committee, News, Uncategorized

Is vibe coding ruining a generation of engineers?

AI tools are revolutionizing software development by automating repetitive tasks, refactoring bloated code, and identifying bugs in real-time. Developers can now generate well-structured code from plain language prompts, saving hours of manual effort. These tools learn from vast codebases, offering context-aware recommendations that enhance productivity and reduce errors. Rather than starting from scratch, engineers can prototype quickly, iterate faster and focus on solving increasingly complex problems. As code generation tools grow in popularity, they raise questions about the future size and structure of engineering teams. Earlier this year, Garry Tan, CEO of startup accelerator Y Combinator, noted that about one-quarter of its current clients use AI to write 95% or more of their software. In an interview with CNBC, Tan said: “What that means for founders is that you don’t need a team of 50 or 100 engineers, you don’t have to raise as much. The capital goes much longer.” AI-powered coding may offer a fast solution for businesses under budget pressure — but its long-term effects on the field and labor pool cannot be ignored. As AI-powered coding rises, human expertise may diminish In the era of AI, the traditional journey to coding expertise that has long supported senior developers may be at risk. Easy access to large language models (LLMs) enables junior coders to quickly identify issues in code. While this speeds up software development, it can distance developers from their own work, delaying the growth of core problem-solving skills. As a result, they may avoid the focused, sometimes uncomfortable hours required to build expertise and progress on the path to becoming successful senior developers. Consider Anthropic’s Claude Code, a terminal-based assistant built on the Claude 3.7 Sonnet model, which automates bug detection and resolution, test creation and code refactoring. Using natural language commands, it reduces repetitive manual work and boosts productivity. Microsoft has also released two open-source frameworks — AutoGen and Semantic Kernel — to support the development of agentic AI systems. AutoGen enables asynchronous messaging, modular components, and distributed agent collaboration to build complex workflows with minimal human input. Semantic Kernel is an SDK that integrates LLMs with languages like C#, Python and Java, letting developers build AI agents to automate tasks and manage enterprise applications. The increasing availability of these tools from Anthropic, Microsoft and others may reduce opportunities for coders to refine and deepen their skills. Rather than “banging their heads against the wall” to debug a few lines or select a library to unlock new features, junior developers may simply turn to AI for an assist. This means senior coders with problem-solving skills honed over decades may become an endangered species. Overreliance on AI for writing code risks weakening developers’ hands-on experience and understanding of key programming concepts. Without regular practice, they may struggle to independently debug, optimize or design systems. Ultimately, this erosion of skill can undermine critical thinking, creativity and adaptability — qualities that are essential not just for coding, but for assessing the quality and logic of AI-generated solutions. AI as mentor: Turning code automation into hands-on learning While concerns about AI diminishing human developer skills are valid, businesses shouldn’t dismiss AI-supported coding. They just need to think carefully about when and how to deploy AI tools in development. These tools can be more than productivity boosters; they can act as interactive mentors, guiding coders in real time with explanations, alternatives and best practices. When used as a training tool, AI can reinforce learning by showing coders why code is broken and how to fix it—rather than simply applying a solution. For example, a junior developer using Claude Code might receive immediate feedback on inefficient syntax or logic errors, along with suggestions linked to detailed explanations. This enables active learning, not passive correction. It’s a win-win: Accelerating project timelines without doing all the work for junior coders. Additionally, coding frameworks can support experimentation by letting developers prototype agent workflows or integrate LLMs without needing expert-level knowledge upfront. By observing how AI builds and refines code, junior developers who actively engage with these tools can internalize patterns, architectural decisions and debugging strategies — mirroring the traditional learning process of trial and error, code reviews and mentorship. However, AI coding assistants shouldn’t replace real mentorship or pair programming. Pull requests and formal code reviews remain essential for guiding newer, less experienced team members. We are nowhere near the point at which AI can single-handedly upskill a junior developer. Companies and educators can build structured development programs around these tools that emphasize code comprehension to ensure AI is used as a training partner rather than a crutch. This encourages coders to question AI outputs and requires manual refactoring exercises. In this way, AI becomes less of a replacement for human ingenuity and more of a catalyst for accelerated, experiential learning. Bridging the gap between automation and education When utilized with intention, AI doesn’t just write code; it teaches coding, blending automation with education to prepare developers for a future where deep understanding and adaptability remain indispensable. By embracing AI as a mentor, as a programming partner and as a team of developers we can direct to the problem at hand, we can bridge the gap between effective automation and education. We can empower developers to grow alongside the tools they use. We can ensure that, as AI evolves, so too does the human skill set, fostering a generation of coders who are both efficient and deeply knowledgeable. Richard Sonnenblick is chief data scientist at Planview.

Is vibe coding ruining a generation of engineers? Read Post »

AI, Committee, News, Uncategorized

Meet OpenTSLM: A Family of Time-Series Language Models (TSLMs) Revolutionizing Medical Time-Series Analysis

A significant development is set to transform AI in healthcare. Researchers at Stanford University, in collaboration with ETH Zurich and tech leaders including Google Research and Amazon, have introduced OpenTSLM, a novel family of Time-Series Language Models (TSLMs). This breakthrough addresses a critical limitation in current LLMs by enabling them to interpret and reason over complex, continuous medical time-series data, such as ECGs, EEGs, and wearable sensor streams, a feat where even frontier models like GPT-4o have struggled. The Critical Blind Spot: LLM Limitations in Time-Series Analysis Medicine is fundamentally temporal. Accurate diagnosis relies heavily on tracking how vital signs, biomarkers, and complex signals evolve. Despite the proliferation of digital health technology, today’s most advanced AI models have struggled to process this raw, continuous data. The core challenge lies in the “modality gap”, the difference between continuous signals (like a heartbeat) and the discrete text tokens that LLMs understand. Previous attempts to bridge this gap by converting signals into text have proven inefficient and difficult to scale. Why Vision-Language Models (VLMs) Fail at Time-Series Data A common workaround has been to convert time-series data into static images (line plots) and input them into advanced Vision-Language Models (VLMs). However, the OpenTSLM research demonstrates this approach is surprisingly ineffective for precise medical data analysis. VLMs are primarily trained on natural photographs; they recognize objects and scenes, not the dense, sequential dynamics of data visualizations. When high-frequency signals like an ECG are rendered into pixels, crucial fine-grained information is lost. Subtle temporal dependencies and high-frequency changes, vital for identifying heart arrhythmias or specific sleep stages, become obscured. The study confirms that VLMs struggle significantly when analyzing these plots, highlighting that time series must be treated as a distinct data modality, not merely a picture. Introducing OpenTSLM: A Native Modality Approach OpenTSLM integrates time series as a native modality directly into pretrained LLMs (such as Llama and Gemma), enabling natural language querying and reasoning over complex health data.  https://www.arxiv.org/abs/2510.02410 The research team explored two distinct architectures: Architecture Deep Dive: SoftPrompt vs. Flamingo 1. OpenTSLM-SoftPrompt (Implicit Modeling) This approach encodes time-series data into learnable tokens, which are then combined with text tokens (soft prompting). While efficient for short data bursts, this method scales poorly. Longer sequences require exponentially more memory, making it impractical for comprehensive analysis. https://www.arxiv.org/abs/2510.02410 2. OpenTSLM-Flamingo (Explicit Modeling) Inspired by the Flamingo architecture, this is the breakthrough solution for scalability. It explicitly models time series as a separate modality. It uses a specialized encoder and a Perceiver Resampler to create a fixed-size representation of the data, regardless of its length, and fuses it with text using gated cross-attention. https://www.arxiv.org/abs/2510.02410 OpenTSLM-Flamingo maintains stable memory requirements even with extensive data streams. For instance, during training on complex ECG data analysis, the Flamingo variant required only 40 GB of VRAM, compared to 110 GB for the SoftPrompt variant using the same LLM backbone. Performance Breakthroughs: Outperforming GPT-4o The results demonstrate the clear superiority of the specialized TSLM approach. To benchmark performance, the team created three new Chain-of-Thought (CoT) datasets focused on medical reasoning: HAR-CoT (activity recognition), Sleep-CoT (EEG sleep staging), and ECG-QA-CoT (ECG question answering). Sleep Staging: OpenTSLM achieved a 69.9% F1 score, vastly outperforming the best fine-tuned text-only baseline (9.05%). Activity Recognition: OpenTSLM reached a 65.4% F1 score Here is an example of human activity recognition COT. https://www.arxiv.org/abs/2510.02410 Here is an example of Sleep activity detection: https://www.arxiv.org/abs/2510.02410 Remarkably, even small-scale OpenTSLM models (1 billion parameters) significantly surpassed GPT-4o. Whether processing the data as text tokens (where GPT-4o scored only 15.47% on Sleep-CoT) or as images, the frontier model failed to match the specialized TSLMs. This finding underscores that specialized, domain-adapted AI architectures can achieve superior results without massive scale, paving the way for efficient, on-device medical AI deployment. Clinical Validation at Stanford Hospital: Ensuring Trust and Transparency A crucial element of Medical AI is trust. Unlike traditional models that output a single classification, OpenTSLM generates human-readable rationales (Chain-of-Thought), explaining its predictions. This AI transparency is vital for clinical settings. To validate the quality of this reasoning, an expert review was conducted with five cardiologists from Stanford Hospital. They assessed the rationales generated by the OpenTSLM-Flamingo model for ECG interpretation. The evaluation found that the model provided a correct or partially correct ECG interpretation in an impressive 92.9% of cases. The model showed exceptional strength in integrating clinical context (85.1% positive assessments), demonstrating sophisticated reasoning capabilities over raw sensor data. The Future of Multimodal Machine Learning The introduction of OpenTSLM marks a significant advancement in multimodal machine learning. By effectively bridging the gap between LLMs and time-series data, this research lays the foundation for general-purpose TSLMs capable of handling diverse longitudinal data, not just in healthcare, but also in finance, industrial monitoring, and beyond. To accelerate innovation in the field, the Stanford and ETH Zurich teams have open-sourced all code, datasets, and trained model weights. Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Meet OpenTSLM: A Family of Time-Series Language Models (TSLMs) Revolutionizing Medical Time-Series Analysis appeared first on MarkTechPost.

Meet OpenTSLM: A Family of Time-Series Language Models (TSLMs) Revolutionizing Medical Time-Series Analysis Read Post »

AI, Committee, News, Uncategorized

A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning

In this tutorial, we explore the power of self-supervised learning using the Lightly AI framework. We begin by building a SimCLR model to learn meaningful image representations without labels, then generate and visualize embeddings using UMAP and t-SNE. We then dive into coreset selection techniques to curate data intelligently, simulate an active learning workflow, and finally assess the benefits of transfer learning through a linear probe evaluation. Throughout this hands-on guide, we work step by step in Google Colab, training, visualizing, and comparing coreset-based and random sampling to understand how self-supervised learning can significantly improve data efficiency and model performance. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip uninstall -y numpy !pip install numpy==1.26.4 !pip install -q lightly torch torchvision matplotlib scikit-learn umap-learn import torch import torch.nn as nn import torchvision from torch.utils.data import DataLoader, Subset from torchvision import transforms import numpy as np import matplotlib.pyplot as plt from sklearn.manifold import TSNE from sklearn.neighbors import NearestNeighbors import umap from lightly.loss import NTXentLoss from lightly.models.modules import SimCLRProjectionHead from lightly.transforms import SimCLRTransform from lightly.data import LightlyDataset print(f”PyTorch version: {torch.__version__}”) print(f”CUDA available: {torch.cuda.is_available()}”) We begin by setting up the environment, ensuring compatibility by fixing the NumPy version and installing essential libraries like Lightly, PyTorch, and UMAP. We then import all necessary modules for building, training, and visualizing our self-supervised learning model, confirming that PyTorch and CUDA are ready for GPU acceleration. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class SimCLRModel(nn.Module): “””SimCLR model with ResNet backbone””” def __init__(self, backbone, hidden_dim=512, out_dim=128): super().__init__() self.backbone = backbone self.backbone.fc = nn.Identity() self.projection_head = SimCLRProjectionHead( input_dim=512, hidden_dim=hidden_dim, output_dim=out_dim ) def forward(self, x): features = self.backbone(x).flatten(start_dim=1) z = self.projection_head(features) return z def extract_features(self, x): “””Extract backbone features without projection””” with torch.no_grad(): return self.backbone(x).flatten(start_dim=1) We define our SimCLRModel, which uses a ResNet backbone to learn visual representations without labels. We remove the classification head and add a projection head to map features into a contrastive embedding space. The model’s extract_features method allows us to obtain raw feature embeddings directly from the backbone for downstream analysis. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def load_dataset(train=True): “””Load CIFAR-10 dataset””” ssl_transform = SimCLRTransform(input_size=32, cj_prob=0.8) eval_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) base_dataset = torchvision.datasets.CIFAR10( root=’./data’, train=train, download=True ) class SSLDataset(torch.utils.data.Dataset): def __init__(self, dataset, transform): self.dataset = dataset self.transform = transform def __len__(self): return len(self.dataset) def __getitem__(self, idx): img, label = self.dataset[idx] return self.transform(img), label ssl_dataset = SSLDataset(base_dataset, ssl_transform) eval_dataset = torchvision.datasets.CIFAR10( root=’./data’, train=train, download=True, transform=eval_transform ) return ssl_dataset, eval_dataset In this step, we load the CIFAR-10 dataset and apply separate transformations for self-supervised and evaluation phases. We create a custom SSLDataset class that generates multiple augmented views of each image for contrastive learning, while the evaluation dataset uses normalized images for downstream tasks. This setup helps the model learn robust representations invariant to visual changes. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def train_ssl_model(model, dataloader, epochs=5, device=’cuda’): “””Train SimCLR model””” model.to(device) criterion = NTXentLoss(temperature=0.5) optimizer = torch.optim.SGD(model.parameters(), lr=0.06, momentum=0.9, weight_decay=5e-4) print(“n=== Self-Supervised Training ===”) for epoch in range(epochs): model.train() total_loss = 0 for batch_idx, batch in enumerate(dataloader): views = batch[0] view1, view2 = views[0].to(device), views[1].to(device) z1 = model(view1) z2 = model(view2) loss = criterion(z1, z2) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() if batch_idx % 50 == 0: print(f”Epoch {epoch+1}/{epochs} | Batch {batch_idx} | Loss: {loss.item():.4f}”) avg_loss = total_loss / len(dataloader) print(f”Epoch {epoch+1} Complete | Avg Loss: {avg_loss:.4f}”) return model Here, we train our SimCLR model in a self-supervised manner using the NT-Xent contrastive loss, which encourages similar representations for augmented views of the same image. We optimize the model with stochastic gradient descent (SGD) and track the loss across epochs to monitor learning progress. This stage teaches the model to extract meaningful visual features without relying on labeled data. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def generate_embeddings(model, dataset, device=’cuda’, batch_size=256): “””Generate embeddings for the entire dataset””” model.eval() model.to(device) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=2) embeddings = [] labels = [] print(“n=== Generating Embeddings ===”) with torch.no_grad(): for images, targets in dataloader: images = images.to(device) features = model.extract_features(images) embeddings.append(features.cpu().numpy()) labels.append(targets.numpy()) embeddings = np.vstack(embeddings) labels = np.concatenate(labels) print(f”Generated {embeddings.shape[0]} embeddings with dimension {embeddings.shape[1]}”) return embeddings, labels def visualize_embeddings(embeddings, labels, method=’umap’, n_samples=5000): “””Visualize embeddings using UMAP or t-SNE””” print(f”n=== Visualizing Embeddings with {method.upper()} ===”) if len(embeddings) > n_samples: indices = np.random.choice(len(embeddings), n_samples, replace=False) embeddings = embeddings[indices] labels = labels[indices] if method == ‘umap’: reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric=’cosine’) else: reducer = TSNE(n_components=2, perplexity=30, metric=’cosine’) embeddings_2d = reducer.fit_transform(embeddings) plt.figure(figsize=(12, 10)) scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap=’tab10′, s=5, alpha=0.6) plt.colorbar(scatter) plt.title(f’CIFAR-10 Embeddings ({method.upper()})’) plt.xlabel(‘Component 1’) plt.ylabel(‘Component 2′) plt.tight_layout() plt.savefig(f’embeddings_{method}.png’, dpi=150) print(f”Saved visualization to embeddings_{method}.png”) plt.show() def select_coreset(embeddings, labels, budget=1000, method=’diversity’): “”” Select a coreset using different strategies: – diversity: Maximum diversity using k-center greedy – balanced: Class-balanced selection “”” print(f”n=== Coreset Selection ({method}) ===”) if method == ‘balanced’: selected_indices = [] n_classes = len(np.unique(labels)) per_class = budget // n_classes for cls in range(n_classes): cls_indices = np.where(labels == cls)[0] selected = np.random.choice(cls_indices, min(per_class, len(cls_indices)), replace=False) selected_indices.extend(selected) return np.array(selected_indices) elif method == ‘diversity’: selected_indices = [] remaining_indices = set(range(len(embeddings))) first_idx = np.random.randint(len(embeddings)) selected_indices.append(first_idx) remaining_indices.remove(first_idx) for _ in range(budget – 1): if not remaining_indices: break remaining = list(remaining_indices) selected_emb = embeddings[selected_indices] remaining_emb = embeddings[remaining] distances = np.min( np.linalg.norm(remaining_emb[:, None] – selected_emb, axis=2), axis=1 ) max_dist_idx = np.argmax(distances) selected_idx = remaining[max_dist_idx] selected_indices.append(selected_idx) remaining_indices.remove(selected_idx) print(f”Selected {len(selected_indices)} samples”) return np.array(selected_indices) We extract high-quality feature embeddings from our trained backbone, cache them with labels, and project them to 2D using UMAP or t-SNE to visually see the cluster structure emerge. Next, we curate data using a coreset selector, either class-balanced or diversity-driven (k-center greedy), to prioritize the most informative, non-redundant samples for downstream training. This pipeline helps us both see what the model learns and select what matters most. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def evaluate_linear_probe(model, train_subset, test_dataset, device=’cuda’): “””Train linear classifier on frozen features””” model.eval() train_loader = DataLoader(train_subset, batch_size=128, shuffle=True, num_workers=2) test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False,

A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US