🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Abdelrahman Adnan

Module Overview: Welcome to Module 2 of the LLM Zoomcamp! This chapter covers the theoretical foundations of vector search – the mathematical concepts, representation methods, and core techniques that power modern semantic search systems.

Introduction to Vector Search
Understanding Vectors and Embeddings
Types of Vector Representations
Vector Search Techniques
Vector Databases

Introduction to Vector Search

What is Vector Search?

Vector search is a modern approach to finding similar content by representing data as high-dimensional numerical vectors. Instead of searching for exact keyword matches like traditional search engines, vector search finds items that are semantically similar – meaning they have similar meanings or contexts.

Think of it this way: Imagine you’re looking for movies similar to “The Matrix.” Traditional keyword search might only find movies with “Matrix” in the title. Vector search, however, would find sci-fi movies with similar themes like “Inception” or “Blade Runner” because they share semantic similarity in the vector space.

Why Vector Search Matters

Semantic Understanding: Captures the meaning behind words, not just exact matches
Multi-modal Support: Works with text, images, audio, and other data types
Context Awareness: Understands relationships and context between different pieces of information
Flexible Querying: Enables natural language queries and similarity-based searches

Real-World Applications

Search Engines: Finding relevant documents based on meaning, not just keywords
Recommendation Systems: Suggesting products, movies, or content based on user preferences
Question Answering: Retrieving relevant context for LLM-based chat systems
Image Search: Finding visually similar images
Duplicate Detection: Identifying similar or duplicate content

Understanding Vectors and Embeddings

What are Vectors?

In the context of machine learning and search, a vector is a list of numbers that represents data in a mathematical form that computers can understand and process. Think of a vector as coordinates in a multi-dimensional space.

Simple Example:

A 2D vector: [3, 4] represents a point in 2D space
A 3D vector: [3, 4, 5] represents a point in 3D space
An embedding vector: [0.2, -0.1, 0.8, ...] might have 768 dimensions representing a word or document

What are Embeddings?

Embeddings are a special type of vector that represents the semantic meaning of data (like words, sentences, or images) in a continuous numerical space. They are created by machine learning models trained on large datasets.

Key Properties of Good Embeddings:

Semantic Similarity: Similar items have similar vectors
Distance Relationships: The distance between vectors reflects semantic relationships
Dense Representation: Each dimension contributes to the meaning (unlike sparse representations)

How Embeddings Capture Meaning

Consider these movie examples:

“Interstellar” → [0.8, 0.1, 0.1] (high sci-fi, low drama, low comedy)
“The Notebook” → [0.1, 0.9, 0.1] (low sci-fi, high drama, low comedy)
“Shrek” → [0.1, 0.1, 0.8] (low sci-fi, low drama, high comedy)

Movies with similar genres will have vectors that are close to each other in this space.

Types of Vector Representations

1⃣ One-Hot Encoding

What it is: The simplest way to represent categorical data as vectors. Each item gets a vector with a single 1 and the rest 0s.

Example:

# Vocabulary: ["apple", "banana", "cherry"]
"apple"  → [1, 0, 0]
"banana" → [0, 1, 0] 
"cherry" → [0, 0, 1]

Code Example:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

words = ["apple", "banana", "cherry"]
data = np.array(words).reshape(-1, 1)
encoder = OneHotEncoder()
one_hot_encoded = encoder.fit_transform(data)
print("One-Hot Encoded Vectors:")
print(one_hot_encoded.toarray())

Limitations:

No semantic relationships (apple and banana don’t appear similar)
Very high dimensionality for large vocabularies
Sparse (mostly zeros)
Memory inefficient

2⃣ Dense Vectors (Embeddings)

What they are: Compact, dense numerical representations where each dimension captures some aspect of meaning.

Example:

"apple"  → [0.2, -0.1, 0.8, 0.3, ...]  # 300+ dimensions
"banana" → [0.1, -0.2, 0.7, 0.4, ...]  # Similar to apple (both fruits)
"car"    → [0.9, 0.5, -0.1, 0.2, ...]  # Very different from fruits

Advantages:

Capture semantic relationships
Much more compact
Enable similarity calculations
Work well with machine learning models

Creating Dense Vectors:

from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer("all-mpnet-base-v2")

# Generate embeddings
texts = ["I love machine learning", "AI is fascinating", "The weather is nice"]
embeddings = model.encode(texts)

print(f"Embedding shape: {embeddings.shape}")  # e.g., (3, 768)
print(f"First embedding: {embeddings[0][:5]}...")  # First 5 dimensions

3⃣ Choosing the Right Dimensionality

How many dimensions do you need?

Word embeddings: 100-300 dimensions (Word2Vec, GloVe)
Sentence embeddings: 384-768 dimensions (BERT, MPNet)
Document embeddings: 512-1024+ dimensions
Image embeddings: 512-2048+ dimensions

Trade-offs:

More dimensions: Better representation, more computational cost
Fewer dimensions: Faster processing, potential information loss

Vector Search Techniques

1⃣ Similarity Metrics

Vector search relies on measuring how “similar” vectors are. Here are the most common metrics:

Cosine Similarity

What it measures: The angle between two vectors (ignores magnitude)
Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
Best for: Text embeddings, normalized data

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Example vectors
vec1 = np.array([[0.2, 0.8, 0.1]])
vec2 = np.array([[0.1, 0.9, 0.0]])

similarity = cosine_similarity(vec1, vec2)
print(f"Cosine similarity: {similarity[0][0]:.3f}")

Euclidean Distance

What it measures: Straight-line distance between points
Range: 0 to infinity (0 = identical, larger = more different)
Best for: Image embeddings, when magnitude matters

from sklearn.metrics.pairwise import euclidean_distances

distance = euclidean_distances(vec1, vec2)
print(f"Euclidean distance: {distance[0][0]:.3f}")

2⃣ Basic Vector Search

Simple Implementation:

def simple_vector_search(query_vector, document_vectors, top_k=5):
    """
    Find the most similar documents to a query
    """
    similarities = cosine_similarity([query_vector], document_vectors)[0]

    # Get indices of top-k most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_k]

    return top_indices, similarities[top_indices]

# Example usage
query = "machine learning tutorial"
query_vector = model.encode(query)

# Assume we have document vectors
top_docs, scores = simple_vector_search(query_vector, document_embeddings)

3⃣ Hybrid Search

The Problem: Pure vector search sometimes misses exact matches or specific terms.

The Solution: Combine vector search (semantic) with keyword search (lexical).

Example Scenario:

Query: “18 U.S.C. § 1341” (specific legal code)
Vector search might find semantically similar laws
Keyword search finds the exact code
Hybrid search combines both for better results

Implementation:

from sklearn.feature_extraction.text import TfidfVectorizer

def hybrid_search(query, documents, embeddings, alpha=0.5):
    """
    Combine vector and keyword search
    alpha: weight for vector search (1-alpha for keyword search)
    """
    # Vector search scores
    query_vector = model.encode(query)
    vector_scores = cosine_similarity([query_vector], embeddings)[0]

    # Keyword search scores (TF-IDF)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    query_tfidf = vectorizer.transform([query])
    keyword_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]

    # Normalize scores to 0-1 range
    vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
    keyword_scores = (keyword_scores - keyword_scores.min()) / (keyword_scores.max() - keyword_scores.min())

    # Combine scores
    combined_scores = alpha * vector_scores + (1 - alpha) * keyword_scores

    return combined_scores

4⃣ Approximate Nearest Neighbors (ANN)

For large datasets, exact search becomes too slow. ANN algorithms provide fast approximate results:

Popular ANN Libraries:

FAISS: Facebook’s similarity search library
Annoy: Spotify’s approximate nearest neighbors
HNSW: Hierarchical Navigable Small World graphs

FAISS Example:

import faiss
import numpy as np

# Create FAISS index
dimension = 768  # embedding dimension
index = faiss.IndexFlatL2(dimension)  # L2 distance index

# Add vectors to index
embeddings = np.random.random((1000, dimension)).astype('float32')
index.add(embeddings)

# Search
query_vector = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query_vector, k=5)  # top 5 results

Vector Databases

What are Vector Databases?

Vector databases are specialized systems designed to store, index, and query high-dimensional vector data efficiently. They are optimized for similarity search operations that traditional databases struggle with.

Key Components

Vector Storage: Efficiently stores millions/billions of high-dimensional vectors
Indexing Engine: Creates indices for fast retrieval (FAISS, HNSW, etc.)
Query Engine: Processes similarity queries using distance metrics
Metadata Storage: Stores associated data like IDs, timestamps, categories

Popular Vector Databases

Open Source Options:

Milvus: Scalable vector database for AI applications
Weaviate: Vector search engine with GraphQL API
FAISS: Facebook’s similarity search library
Elasticsearch: Traditional search with vector capabilities
Chroma: Simple vector database for LLM applications

Managed/Commercial Options:

Pinecone: Fully managed vector database
Qdrant: Vector search engine with API
Weaviate Cloud: Managed Weaviate
AWS OpenSearch: Amazon’s vector search service

Advantages Over Traditional Databases

Feature	Traditional DB	Vector DB
Data Type	Structured (rows/columns)	High-dimensional vectors
Query Type	Exact matches, ranges	Similarity search
Scalability	Good for structured data	Optimized for vector operations
Search Speed	Fast for indexed fields	Fast for similarity queries
Use Cases	CRUD operations	Recommendation, search, AI

Chapter 1 Summary

What You’ve Learned

In this foundational chapter, you’ve discovered:

Vector Search Fundamentals: Understanding semantic vs. keyword search
Vector Mathematics: How numbers represent meaning in multi-dimensional space
Representation Types: From simple one-hot to sophisticated dense embeddings
Search Techniques: Similarity metrics, hybrid approaches, and optimization methods
Storage Solutions: Specialized databases designed for vector operations

Key Takeaways

Vectors enable computers to understand meaning – not just match text
Embeddings capture semantic relationships – similar concepts cluster together
Multiple similarity metrics exist – choose based on your data type and use case
Hybrid search combines strengths – semantic understanding + exact matching
Specialized databases matter – vector databases outperform traditional ones for similarity search