This content originally appeared on DEV Community and was authored by Abdelrahman Adnan
Module Overview: Welcome to Module 2 of the LLM Zoomcamp! This chapter covers the theoretical foundations of vector search – the mathematical concepts, representation methods, and core techniques that power modern semantic search systems.
Table of Contents
Introduction to Vector Search
Understanding Vectors and Embeddings
Types of Vector Representations
Vector Search Techniques
Vector Databases
Introduction to Vector Search
What is Vector Search?
Vector search is a modern approach to finding similar content by representing data as high-dimensional numerical vectors. Instead of searching for exact keyword matches like traditional search engines, vector search finds items that are semantically similar – meaning they have similar meanings or contexts.
Think of it this way: Imagine you’re looking for movies similar to “The Matrix.” Traditional keyword search might only find movies with “Matrix” in the title. Vector search, however, would find sci-fi movies with similar themes like “Inception” or “Blade Runner” because they share semantic similarity in the vector space.
Why Vector Search Matters
-
Semantic Understanding: Captures the meaning behind words, not just exact matches
-
Multi-modal Support: Works with text, images, audio, and other data types
-
Context Awareness: Understands relationships and context between different pieces of information
-
Flexible Querying: Enables natural language queries and similarity-based searches
Real-World Applications
-
Search Engines: Finding relevant documents based on meaning, not just keywords
-
Recommendation Systems: Suggesting products, movies, or content based on user preferences
-
Question Answering: Retrieving relevant context for LLM-based chat systems
-
Image Search: Finding visually similar images
-
Duplicate Detection: Identifying similar or duplicate content
Understanding Vectors and Embeddings
What are Vectors?
In the context of machine learning and search, a vector is a list of numbers that represents data in a mathematical form that computers can understand and process. Think of a vector as coordinates in a multi-dimensional space.
Simple Example:
- A 2D vector:
[3, 4]
represents a point in 2D space - A 3D vector:
[3, 4, 5]
represents a point in 3D space - An embedding vector:
[0.2, -0.1, 0.8, ...]
might have 768 dimensions representing a word or document
What are Embeddings?
Embeddings are a special type of vector that represents the semantic meaning of data (like words, sentences, or images) in a continuous numerical space. They are created by machine learning models trained on large datasets.
Key Properties of Good Embeddings:
-
Semantic Similarity: Similar items have similar vectors
-
Distance Relationships: The distance between vectors reflects semantic relationships
-
Dense Representation: Each dimension contributes to the meaning (unlike sparse representations)
How Embeddings Capture Meaning
Consider these movie examples:
- “Interstellar” โ
[0.8, 0.1, 0.1]
(high sci-fi, low drama, low comedy) - “The Notebook” โ
[0.1, 0.9, 0.1]
(low sci-fi, high drama, low comedy) - “Shrek” โ
[0.1, 0.1, 0.8]
(low sci-fi, low drama, high comedy)
Movies with similar genres will have vectors that are close to each other in this space.
Types of Vector Representations
1⃣ One-Hot Encoding
What it is: The simplest way to represent categorical data as vectors. Each item gets a vector with a single 1 and the rest 0s.
Example:
# Vocabulary: ["apple", "banana", "cherry"]
"apple" โ [1, 0, 0]
"banana" โ [0, 1, 0]
"cherry" โ [0, 0, 1]
Code Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
words = ["apple", "banana", "cherry"]
data = np.array(words).reshape(-1, 1)
encoder = OneHotEncoder()
one_hot_encoded = encoder.fit_transform(data)
print("One-Hot Encoded Vectors:")
print(one_hot_encoded.toarray())
Limitations:
- No semantic relationships (apple and banana don’t appear similar)
- Very high dimensionality for large vocabularies
- Sparse (mostly zeros)
- Memory inefficient
2⃣ Dense Vectors (Embeddings)
What they are: Compact, dense numerical representations where each dimension captures some aspect of meaning.
Example:
"apple" โ [0.2, -0.1, 0.8, 0.3, ...] # 300+ dimensions
"banana" โ [0.1, -0.2, 0.7, 0.4, ...] # Similar to apple (both fruits)
"car" โ [0.9, 0.5, -0.1, 0.2, ...] # Very different from fruits
Advantages:
- Capture semantic relationships
- Much more compact
- Enable similarity calculations
- Work well with machine learning models
Creating Dense Vectors:
from sentence_transformers import SentenceTransformer
# Load pre-trained model
model = SentenceTransformer("all-mpnet-base-v2")
# Generate embeddings
texts = ["I love machine learning", "AI is fascinating", "The weather is nice"]
embeddings = model.encode(texts)
print(f"Embedding shape: {embeddings.shape}") # e.g., (3, 768)
print(f"First embedding: {embeddings[0][:5]}...") # First 5 dimensions
3⃣ Choosing the Right Dimensionality
How many dimensions do you need?
-
Word embeddings: 100-300 dimensions (Word2Vec, GloVe)
-
Sentence embeddings: 384-768 dimensions (BERT, MPNet)
-
Document embeddings: 512-1024+ dimensions
-
Image embeddings: 512-2048+ dimensions
Trade-offs:
-
More dimensions: Better representation, more computational cost
-
Fewer dimensions: Faster processing, potential information loss
Vector Search Techniques
1⃣ Similarity Metrics
Vector search relies on measuring how “similar” vectors are. Here are the most common metrics:
Cosine Similarity
What it measures: The angle between two vectors (ignores magnitude)
Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
Best for: Text embeddings, normalized data
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Example vectors
vec1 = np.array([[0.2, 0.8, 0.1]])
vec2 = np.array([[0.1, 0.9, 0.0]])
similarity = cosine_similarity(vec1, vec2)
print(f"Cosine similarity: {similarity[0][0]:.3f}")
Euclidean Distance
What it measures: Straight-line distance between points
Range: 0 to infinity (0 = identical, larger = more different)
Best for: Image embeddings, when magnitude matters
from sklearn.metrics.pairwise import euclidean_distances
distance = euclidean_distances(vec1, vec2)
print(f"Euclidean distance: {distance[0][0]:.3f}")
2⃣ Basic Vector Search
Simple Implementation:
def simple_vector_search(query_vector, document_vectors, top_k=5):
"""
Find the most similar documents to a query
"""
similarities = cosine_similarity([query_vector], document_vectors)[0]
# Get indices of top-k most similar documents
top_indices = np.argsort(similarities)[::-1][:top_k]
return top_indices, similarities[top_indices]
# Example usage
query = "machine learning tutorial"
query_vector = model.encode(query)
# Assume we have document vectors
top_docs, scores = simple_vector_search(query_vector, document_embeddings)
3⃣ Hybrid Search
The Problem: Pure vector search sometimes misses exact matches or specific terms.
The Solution: Combine vector search (semantic) with keyword search (lexical).
Example Scenario:
- Query: “18 U.S.C. ยง 1341” (specific legal code)
- Vector search might find semantically similar laws
- Keyword search finds the exact code
- Hybrid search combines both for better results
Implementation:
from sklearn.feature_extraction.text import TfidfVectorizer
def hybrid_search(query, documents, embeddings, alpha=0.5):
"""
Combine vector and keyword search
alpha: weight for vector search (1-alpha for keyword search)
"""
# Vector search scores
query_vector = model.encode(query)
vector_scores = cosine_similarity([query_vector], embeddings)[0]
# Keyword search scores (TF-IDF)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
query_tfidf = vectorizer.transform([query])
keyword_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]
# Normalize scores to 0-1 range
vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
keyword_scores = (keyword_scores - keyword_scores.min()) / (keyword_scores.max() - keyword_scores.min())
# Combine scores
combined_scores = alpha * vector_scores + (1 - alpha) * keyword_scores
return combined_scores
4⃣ Approximate Nearest Neighbors (ANN)
For large datasets, exact search becomes too slow. ANN algorithms provide fast approximate results:
Popular ANN Libraries:
-
FAISS: Facebook’s similarity search library
-
Annoy: Spotify’s approximate nearest neighbors
-
HNSW: Hierarchical Navigable Small World graphs
FAISS Example:
import faiss
import numpy as np
# Create FAISS index
dimension = 768 # embedding dimension
index = faiss.IndexFlatL2(dimension) # L2 distance index
# Add vectors to index
embeddings = np.random.random((1000, dimension)).astype('float32')
index.add(embeddings)
# Search
query_vector = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query_vector, k=5) # top 5 results
Vector Databases
What are Vector Databases?
Vector databases are specialized systems designed to store, index, and query high-dimensional vector data efficiently. They are optimized for similarity search operations that traditional databases struggle with.
Key Components
-
Vector Storage: Efficiently stores millions/billions of high-dimensional vectors
-
Indexing Engine: Creates indices for fast retrieval (FAISS, HNSW, etc.)
-
Query Engine: Processes similarity queries using distance metrics
-
Metadata Storage: Stores associated data like IDs, timestamps, categories
Popular Vector Databases
Open Source Options:
-
Milvus: Scalable vector database for AI applications
-
Weaviate: Vector search engine with GraphQL API
-
FAISS: Facebook’s similarity search library
-
Elasticsearch: Traditional search with vector capabilities
-
Chroma: Simple vector database for LLM applications
Managed/Commercial Options:
-
Pinecone: Fully managed vector database
-
Qdrant: Vector search engine with API
-
Weaviate Cloud: Managed Weaviate
-
AWS OpenSearch: Amazon’s vector search service
Advantages Over Traditional Databases
Feature | Traditional DB | Vector DB |
---|---|---|
![]() |
Structured (rows/columns) | High-dimensional vectors |
![]() |
Exact matches, ranges | Similarity search |
![]() |
Good for structured data | Optimized for vector operations |
![]() |
Fast for indexed fields | Fast for similarity queries |
![]() |
CRUD operations | Recommendation, search, AI |
Chapter 1 Summary
What You’ve Learned
In this foundational chapter, you’ve discovered:
-
Vector Search Fundamentals: Understanding semantic vs. keyword search
-
Vector Mathematics: How numbers represent meaning in multi-dimensional space
-
Representation Types: From simple one-hot to sophisticated dense embeddings
-
Search Techniques: Similarity metrics, hybrid approaches, and optimization methods
-
Storage Solutions: Specialized databases designed for vector operations
Key Takeaways
Vectors enable computers to understand meaning – not just match text
Embeddings capture semantic relationships – similar concepts cluster together
Multiple similarity metrics exist – choose based on your data type and use case
Hybrid search combines strengths – semantic understanding + exact matching
Specialized databases matter – vector databases outperform traditional ones for similarity search
This content originally appeared on DEV Community and was authored by Abdelrahman Adnan