This content originally appeared on Level Up Coding – Medium and was authored by Harish Siva Subramanian
When we think of deep learning breakthroughs, the buzzwords that come to mind are usually transformers, large language models (LLMs), or maybe diffusion models.
But there’s one powerful technique that rarely makes the headlines — yet it could significantly improve your models, save compute, and even help you train with less data.
That technique is: Self-Supervised Learning (SSL).
Why Self-Supervised Learning Matters
Traditional supervised learning depends heavily on labeled data. And as every practitioner knows, labeling data is painful, expensive, and often biased. Imagine trying to label millions of medical images or transcribe thousands of hours of audio.
Self-supervised learning bypasses this bottleneck. Instead of relying on manual labels, it creates labels from the data itself.
Your model learns representations by predicting missing parts, contrasting different samples, or reconstructing corrupted inputs.
In simple words: Your data teaches itself.
How Self-Supervised Learning Works
Here are three popular approaches:
- Contrastive Learning
- Compare two augmented versions of the same image (positive pair) against other images (negative pairs).
- Example: SimCLR, MoCo.
2. Masked Prediction
- Hide parts of the input and train the model to guess what’s missing.
- Example: BERT (masked words), MAE (masked image patches).
3. Generative Pretraining
- Train a model to generate the input itself.
- Example: GPT series (predicting next word).
Each of these approaches builds rich, general-purpose representations that can then be fine-tuned for your specific task — often outperforming models trained from scratch with labeled data.
Why Aren’t More People Using It?
Despite its power, SSL is still underused outside of research labs and big tech. Why?
- Tooling gap: Many tutorials focus on supervised methods.
- Perception: People think SSL is “only for big data.”
- Awareness: Most practitioners hear about GPT and BERT but don’t realize they can apply SSL principles to their own smaller projects.
But here’s the secret: SSL isn’t just for billion-parameter models. You can use it for:
- Building better embeddings for recommendation systems.
- Improving anomaly detection in manufacturing.
- Boosting performance in domains with little labeled data (healthcare, finance, etc.).
How You Can Start Today
If you want to bring SSL into your workflow, here are some accessible entry points:
- Hugging Face Datasets + Transformers → Experiment with masked prediction tasks.
- PyTorch Lightning Bolts → Ready-to-use implementations of SimCLR, BYOL, and more.
- Scikit-learn + Contrastive Learning → Use smaller-scale contrastive techniques for tabular data.
A good starting project: Take your current supervised model, pretrain it with an SSL objective on your raw dataset, then fine-tune with your limited labels. You’ll often see a jump in performance.
Real-World Example: Binary Image Classification with SSL
Let’s say you want to build a classifier that predicts whether an image is:
- 0 → Cat
- 1 → Dog
But you only have a few labeled cat/dog images and a huge pile of unlabeled pet photos.
Here’s how SSL helps:
Step 1. Pretraining with SSL (Unlabeled Data)
- Take all the unlabeled pet photos.
- Use an SSL method like Contrastive Learning:
- Generate two random augmentations of the same photo (e.g., crop, rotate, color shift).
- Train the model to recognize that these two are from the same original image (positive pair) while distinguishing them from other images (negative pairs).
- Result: The model learns a general representation of animal features (fur texture, ears, eyes, body shape, etc.).
Step 2. Fine-Tuning (Labeled Data)
- Now take your small labeled dataset (say, 500 cat images + 500 dog images).
- Add a binary classification head (logistic regression or small fully connected layer) on top of the pretrained encoder.
- Fine-tune it to distinguish cats vs. dogs.
Why This Works Better Than Supervised Alone
- If you trained on just 1,000 labeled images from scratch → your model may overfit and struggle.
- With SSL → the model already knows what animals look like, even without labels. Fine-tuning is faster, needs less data, and performs better.
So in practice, self-supervised learning lets you leverage huge amounts of raw images (or text, or audio) without paying the price of manual labeling — then transfer that knowledge to your small labeled dataset.
Let’s code them,
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader, Dataset
import random
# -----------------------------
# 1. Data Augmentation for SSL
# -----------------------------
ssl_transform = T.Compose([
T.RandomResizedCrop(224),
T.RandomHorizontalFlip(),
T.ColorJitter(0.4, 0.4, 0.4, 0.1),
T.RandomGrayscale(p=0.2),
T.ToTensor(),
])
class ContrastiveDataset(Dataset):
def __init__(self, dataset, transform):
self.dataset = dataset
self.transform = transform
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
img, _ = self.dataset[idx] # ignore labels for SSL
im1 = self.transform(img)
im2 = self.transform(img)
return im1, im2
- Creates two different random views of the same image using transformations such as:
Crop, Flip, Color jitter, Grayscale
- This step is critical for contrastive learning, as the model must learn that both augmented versions represent the same image.
Then,
- Constructs a custom dataset that, for each image, returns a pair of augmented versions (im1, im2).
- Labels are ignored because self-supervised learning does not require them.
- Enables the model to learn meaningful representations directly from the data without explicit supervision.
# -----------------------------
# 2. Simple Encoder (ResNet18)
# -----------------------------
class Encoder(nn.Module):
def __init__(self, base_model=models.resnet18, out_dim=128):
super().__init__()
self.backbone = base_model(pretrained=False)
in_features = self.backbone.fc.in_features
self.backbone.fc = nn.Identity()
self.projection = nn.Sequential(
nn.Linear(in_features, out_dim),
nn.ReLU(),
nn.Linear(out_dim, out_dim)
)
def forward(self, x):
h = self.backbone(x)
z = self.projection(h)
return F.normalize(z, dim=1) # normalized embeddings
Then,
- Uses ResNet18 (without its classification head).
- Adds a projection head (small MLP) that maps embeddings into a latent space for contrastive learning.
- Normalizes embeddings so cosine similarity works well.
Think of this as the “brain” that learns useful features from images.
# -----------------------------
# 3. Contrastive Loss (NT-Xent)
# -----------------------------
def nt_xent_loss(z1, z2, temperature=0.5):
N = z1.size(0)
z = torch.cat([z1, z2], dim=0) # (2N, d)
sim = torch.mm(z, z.t()) / temperature
sim_i_j = torch.diag(sim, N)
sim_j_i = torch.diag(sim, -N)
positives = torch.cat([sim_i_j, sim_j_i], dim=0)
# mask self-comparisons
mask = torch.eye(2*N, dtype=torch.bool).to(z.device)
negatives = sim[~mask].view(2*N, -1)
labels = torch.zeros(2*N, dtype=torch.long).to(z.device)
logits = torch.cat([positives.unsqueeze(1), negatives], dim=1)
loss = F.cross_entropy(logits, labels)
return loss
For the loss function, we would use something like a NT-Xent which is acontrastive loss function. They,
- Computes similarity between embeddings.
- Positives: (z1, z2) from the same image.
- Negatives: embeddings from different images.
- Encourages the model to bring positives closer and push negatives apart in feature space.
# -----------------------------
# 4. Pretraining Loop
# -----------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
# Example dataset: Cats vs Dogs (from torchvision or Kaggle path)
unlabeled_data = datasets.ImageFolder("data/cats_vs_dogs/unlabeled", transform=T.ToTensor())
contrastive_data = ContrastiveDataset(unlabeled_data, ssl_transform)
ssl_loader = DataLoader(contrastive_data, batch_size=64, shuffle=True, num_workers=4)
encoder = Encoder().to(device)
optimizer = torch.optim.Adam(encoder.parameters(), lr=3e-4)
print("Starting self-supervised pretraining...")
for epoch in range(5): # keep short for demo
for (im1, im2) in ssl_loader:
im1, im2 = im1.to(device), im2.to(device)
z1, z2 = encoder(im1), encoder(im2)
loss = nt_xent_loss(z1, z2)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, SSL Loss: {loss.item():.4f}")
torch.save(encoder.state_dict(), "encoder_ssl.pth")
In the pretraining loop,
- Runs SSL training on the unlabeled dataset.
- Each image pair is encoded → embeddings compared with contrastive loss.
- The encoder learns general animal features (fur, ears, eyes, shapes) without needing cat/dog labels.
- Saves the pretrained encoder weights (encoder_ssl.pth).
# -----------------------------
# 5. Fine-Tuning for Binary Classification
# -----------------------------
class FineTuneModel(nn.Module):
def __init__(self, encoder):
super().__init__()
self.encoder = encoder.backbone # use backbone only
in_features = encoder.backbone.fc.in_features if hasattr(encoder.backbone.fc, 'in_features') else 512
self.fc = nn.Linear(in_features, 2) # binary classification
def forward(self, x):
h = self.encoder(x)
return self.fc(h)
# Load small labeled dataset (cats vs dogs)
transform_supervised = T.Compose([
T.Resize((224, 224)),
T.ToTensor()
])
labeled_data = datasets.ImageFolder("data/cats_vs_dogs/labeled", transform=transform_supervised)
labeled_loader = DataLoader(labeled_data, batch_size=32, shuffle=True)
finetune_model = FineTuneModel(encoder).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(finetune_model.parameters(), lr=1e-4)
print("Starting fine-tuning...")
for epoch in range(5):
for imgs, labels in labeled_loader:
imgs, labels = imgs.to(device), labels.to(device)
preds = finetune_model(imgs)
loss = criterion(preds, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Fine-tune Loss: {loss.item():.4f}")
torch.save(finetune_model.state_dict(), "catdog_classifier.pth")
print("Training complete ✅")
Finally,
- Loads the backbone encoder (from SSL stage).
- Adds a new classifier head (Linear → 2 classes).
- Now trains with a small labeled dataset (cats=0, dogs=1).
- Uses cross entropy loss for binary classification.
- Optimizes both encoder + classification head (with smaller learning rate).
- Saves the final classifier model (catdog_classifier.pth).
The Big Picture
Self-supervised learning is quietly powering some of the biggest leaps in AI today. OpenAI’s GPT models? Built on self-supervised pretraining. Vision Transformers (ViT)? Same story.
But the real opportunity lies in applying SSL where it’s least expected — your niche dataset, your unique problem domain.
If you’re not experimenting with it yet, you’re leaving performance, efficiency, and innovation on the table.
Final Thoughts
The next wave of AI will not just be about bigger models. It will be about smarter ways to learn from data. And self-supervised learning is the bridge.
If this article gave you a new perspective, hit the
clap button (hold it down to give more claps!) and follow me for more practical deep learning insights.
Together, let’s make sure the underrated techniques get the spotlight they deserve.
Forget Labeled Data: How AI Is Learning on Its Own was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding – Medium and was authored by Harish Siva Subramanian