From Pixels to Predictions: How Tabular Models Can Rival CNNs on Image Tasks”



This content originally appeared on Level Up Coding – Medium and was authored by Harish Siva Subramanian

Photo by Jon Tyson on Unsplash

Deep learning has become the go-to hammer for virtually every image-related problem. Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and pretrained backbones dominate Kaggle competitions and set industry benchmarks. From medical imaging to retail analytics, it seems like every modern solution leans on deep architectures.

But here’s an interesting secret: sometimes, simpler models — tabular models like XGBoost, LightGBM, or Random Forests — can actually outperform deep networks on image tasks, especially when data is limited or compute resources are constrained.

At first glance, this seems counterintuitive.

Tabular models were designed for structured datasets — think customer transactions, sensor readings, or demographic data — not pixel grids. So how can they handle images effectively?

The trick lies in feature extraction. By transforming images into meaningful feature vectors — either via handcrafted descriptors or pretrained deep networks — you convert the unstructured image data into a structured, tabular form.

Each image becomes a single row, and each feature becomes a column, ready for classic machine learning algorithms.

This approach has several advantages:

  • Small datasets: CNNs often overfit when you have only a few hundred or thousand images. Tabular models, trained on well-extracted features, can generalize more effectively.
  • Lower compute requirements: Training a CNN can take hours or days on a GPU, whereas a Random Forest or XGBoost model can fit in minutes on a standard laptop.
  • Explainability: Tabular models provide feature importance metrics, allowing you to understand which aspects of the image embeddings drive predictions.
  • Multi-modal fusion: If your problem involves combining image data with structured information (e.g., patient age, income, or transaction history), tabular models make it straightforward to integrate these features.

In this article, we’ll walk through a practical Python example that extracts features from images using a pretrained ResNet18, converts them to tabular data, and trains an XGBoost classifier.

We’ll also compare it against a simple CNN to see how well the tabular model performs.

First things first, lets install the required packages,

pip install torch torchvision xgboost scikit-learn pandas matplotlib

Next step, we will load a Pretrained Model for Feature Extraction and we’ll use PyTorch and ResNet18, removing the final classification layer to get image embeddings.

import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import os
import pandas as pd
import numpy as np

# Load ResNet18 pretrained on ImageNet
model = models.resnet18(pretrained=True)
# Remove the final fully connected layer
model = torch.nn.Sequential(*list(model.children())[:-1])
model.eval()
Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(6): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(8): AdaptiveAvgPool2d(output_size=(1, 1))
)

This is the model architecture.

Now let’s perform the transformation needed for the RESNet architecture,

transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])

Next step is to extract features from images,

Let’s use the CIFAR-10 dataset, which contains 60,000 32×32 color images in 10 classes, with 6,000 images per class.

We’ll download and preprocess the dataset, then extract features using the pretrained ResNet18 model.

from torchvision import datasets
from torch.utils.data import DataLoader
from tqdm import tqdm

# Download CIFAR-10 dataset
dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
dataloader = DataLoader(dataset, batch_size=64, shuffle=False)

Next step is we will extract the features and convert them as dataframe,

features = []
labels = []

# Extract features
for inputs, targets in tqdm(dataloader, desc="Extracting features"):
with torch.no_grad():
outputs = model(inputs)
features.append(outputs.view(outputs.size(0), -1).cpu().numpy())
labels.append(targets.cpu().numpy())

# Convert to numpy arrays
features = np.concatenate(features)
labels = np.concatenate(labels)

# Convert to DataFrame
X = pd.DataFrame(features)
y = pd.Series(labels)
print(X.shape, y.shape)

We’ll unpack “features.append(outputs.view(outputs.size(0), -1).cpu().numpy())”step by step:

1. outputs

  • This comes from model(inputs), where model is the ResNet18 backbone without the final classification layer.
  • For a batch of images, outputs has shape:
[batch_size, 512, 1, 1]

because ResNet18’s final feature map has 512 channels of size 1×1.

2. outputs.view(outputs.size(0), -1)

  • .view() reshapes a tensor in PyTorch (similar to .reshape() in NumPy).
  • outputs.size(0) is the batch size.
  • -1 tells PyTorch: “figure out the right number of columns automatically.”

So:

[batch_size, 512, 1, 1] → [batch_size, 512]

This flattens each image’s feature map into a 512-dimensional feature vector.

3. cpu()

  • Moves the tensor from GPU to CPU.
  • Needed because numpy() only works on CPU tensors.

4. numpy()

  • Converts the PyTorch tensor into a NumPy array.
  • This makes it easy to store in a list and later put into a Pandas DataFrame for tabular ML.

5. features.append(…)

  • Appends the batch of feature vectors (shape [batch_size, 512]) to the list features.
  • After processing all batches, you’ll have a big NumPy array of shape:
[num_images, 512]

Next step is to train the model

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

xgb = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)

print("XGBoost Accuracy:", accuracy_score(y_test, y_pred))

Next step let’s compare with a Simple CNN

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Transform: resize to 32x32 (CIFAR-10 default), convert to tensor, normalize
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load CIFAR-10 dataset (train + test)
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = DataLoader(testset, batch_size=64, shuffle=False)

classes = trainset.classes
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, 3, padding=1) # input: 3x32x32, output: 16x32x32
self.pool = nn.MaxPool2d(2, 2) # output: 16x16x16
self.conv2 = nn.Conv2d(16, 32, 3, padding=1) # output: 32x16x16
# after pooling again → 32x8x8
self.fc1 = nn.Linear(32 * 8 * 8, 128)
self.fc2 = nn.Linear(128, 10) # 10 classes in CIFAR-10

def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = self.pool(torch.relu(self.conv2(x)))
x = x.view(-1, 32 * 8 * 8) # flatten
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x

# Use CPU only
device = torch.device("cpu")

net = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

# Train for a few epochs
for epoch in range(10): # demo with 10 epochs
running_loss = 0.0
for images, labels in trainloader:
images, labels = images.to(device), labels.to(device)

optimizer.zero_grad()
outputs = net(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

running_loss += loss.item()

print(f"Epoch {epoch+1}, Loss: {running_loss/len(trainloader):.4f}")

print("Training finished!")

#Test Accuracy
correct = 0
total = 0
net.eval()
with torch.no_grad():
for images, labels in testloader:
images, labels = images.to(device), labels.to(device)
outputs = net(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")

Hooray our tabular data model won!!

Photo by sandra lansue on Unsplash

Lets say you dont have a GPU, we can always use this method by converting images to tabular form and training a tree based model to have a robust enough model.

Why This Works

  1. Deep features are powerful: Even without fine-tuning, pretrained networks capture edges, textures, and high-level patterns.
  2. Tabular models are robust: With good features, XGBoost and Random Forests generalize well on small datasets.
  3. Flexibility: You can combine these features with structured metadata for richer predictions.

Transforming image embeddings into tabular data is a practical, often underrated approach. It offers speed, interpretability, and the ability to work effectively with small datasets. In real-world scenarios where GPUs are limited or data is scarce, this hybrid method can be a game-changer.

Next time you face an image prediction problem, consider combining the power of deep features with the simplicity of tabular models — you might be surprised at the results.

If you like the article and would like to support me, make sure to:


From Pixels to Predictions: How Tabular Models Can Rival CNNs on Image Tasks” was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding – Medium and was authored by Harish Siva Subramanian