This content originally appeared on DEV Community and was authored by Leapcell
Leapcell: The Best of Serverless Web Hosting
Implementing a TF-IDF Based English Search Engine in Pure Python: Building a Dependency-Free Retrieval System from Scratch
In today’s era of information explosion, search engines have become the primary way for people to access information. While commercial search engines like Google and Bing have complex technical architectures behind them, their core principles can be understood through basic information retrieval technologies. This article will guide you through building a TF-IDF algorithm-based English search engine from scratch, using only Python’s standard library without any third-party dependencies, and storing the key inverted index structure in CSV format. Through this practice, you’ll gain a deep understanding of how search engines work and master core technologies in text processing, index construction, and relevance calculation.
Core Components of a Search Engine
A complete search engine typically consists of four core modules: document processing module, index construction module, query processing module, and ranking module. Unlike implementations that rely on third-party libraries, a pure Python implementation requires us to handle the details of each环节 manually, allowing for a deeper understanding of the principles behind each step.
The document processing module converts raw text into structured data suitable for computation, including operations such as tokenization, noise removal (e.g., punctuation), and normalization (e.g., case conversion). The index construction module is the core of the search engine, enabling fast queries through the establishment of an inverted index, which records in which documents each term appears and their positions. The query processing module receives user input queries and performs the same normalization operations as document processing to match terms in the index. The ranking module calculates the relevance score between the query terms and each document using the TF-IDF algorithm and returns results sorted by score.
Principles of the TF-IDF Algorithm
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method used to evaluate the importance of a term in a document collection. Its core idea is: the importance of a term is proportional to its frequency in a particular document and inversely proportional to its frequency across the entire document collection.
Calculation of Term Frequency (TF)
Term frequency refers to the number of times a term appears in a specific document. To avoid the influence of document length on results, normalization is usually applied:
TF(t,d) = Number of occurrences of term t in document d / Total number of terms in document d
For example, in a document containing 100 terms, if “learning” appears 5 times, its term frequency is 5/100 = 0.05.
Calculation of Inverse Document Frequency (IDF)
Inverse document frequency measures the discriminative power of a term, calculated as:
IDF(t) = log(Total number of documents / Number of documents containing term t)
If a term appears in most documents, its IDF value is low, indicating weak discriminative power; conversely, if a term appears in only a few documents, its IDF value is high, indicating strong discriminative power. For example, assuming there are 10 documents in total, and “machine” appears in 3 of them, its IDF value is log(10/3) ≈ 1.20397.
Calculation of TF-IDF Value
The TF-IDF value is the product of term frequency and inverse document frequency:
TF-IDF(t,d) = TF(t,d) × IDF(t)
This value comprehensively reflects the importance of term t in document d and is an important indicator for measuring the relevance between a document and a query.
Design of the Inverted Index
The inverted index is a key data structure for fast querying in search engines, recording each term and the documents and positions where it appears. In a pure Python implementation, we need to design a reasonable inverted index structure and store it in CSV format for persistence.
The basic structure of an inverted index includes three parts: term, document ID (doc_id), and positions. The position information records the specific locations where the term appears in the document, facilitating subsequent phrase queries and proximity queries.
The CSV file format is designed as follows:
term,doc_id,positions
machine,0,”[1,5]”
learning,0,”[2,8]”
…
This format allows us to use Python’s standard csv module for reading and writing operations. Each occurrence of a term in different documents is recorded as a row, with the positions field storing the position list as a string.
Steps for Pure Python Implementation
Next, we’ll detail how to implement a TF-IDF based English search engine from scratch using pure Python, including document preprocessing, inverted index construction, TF-IDF calculation, query processing, and result ranking.
Step 1: Document Preprocessing
Document preprocessing converts raw text into structured data, mainly including the following operations:
- Case conversion: Convert all text to lowercase to avoid treating “Machine” and “machine” as different terms.
- Punctuation removal: Remove punctuation from text to reduce noise.
- Tokenization: Split continuous text into individual terms.
- Stop word removal: Filter out high-frequency words with no practical meaning such as “the” and “and”.
- Simple stemming: Reduce terms to their root form (simplified implementation).
Here’s the implementation code:
import string
# Define English stop words set
STOP_WORDS = {
'a', 'an', 'and', 'the', 'or', 'of', 'to', 'in', 'for', 'on', 'with',
'at', 'by', 'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him',
'her', 'us', 'them', 'my', 'your', 'his', 'its', 'our', 'their', 'this',
'that', 'these', 'those', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'shall', 'should',
'may', 'might', 'must', 'can', 'could', 'as', 'but', 'if', 'or', 'because',
'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between',
'into', 'through', 'during', 'before', 'after', 'above', 'below', 'from', 'up',
'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then',
'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both',
'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very'
}
def preprocess_text(text):
"""Preprocess text: case conversion, punctuation removal, tokenization, stop word removal"""
# Convert to lowercase
text = text.lower()
# Remove punctuation
translator = str.maketrans('', '', string.punctuation)
text = text.translate(translator)
# Tokenization (simple space splitting; more complex logic can be used in practical applications)
tokens = text.split()
# Remove stop words and empty strings
tokens = [token for token in tokens if token not in STOP_WORDS and token.strip() != '']
# Simple stemming (simplified version)
tokens = [stem_token(token) for token in tokens]
return tokens
def stem_token(token):
"""Simple stemming function (more complex algorithms can be used in practical applications)"""
# Handle common suffixes
suffixes = ['ing', 'ly', 'ed', 'es', 's']
for suffix in suffixes:
if token.endswith(suffix) and len(token) > len(suffix):
return token[:-len(suffix)]
return token
# Test the preprocessing function
sample_text = "Machine learning is a subset of artificial intelligence focused on developing algorithms that learn from data."
processed_tokens = preprocess_text(sample_text)
print("Preprocessed terms:", processed_tokens)
This preprocessing function implements basic text cleaning and normalization, laying the foundation for subsequent index construction and TF-IDF calculation. It should be noted that the tokenization and stemming here are simplified implementations; more complex algorithms can be used in practical applications to improve accuracy.
Step 2: Build Inverted Index and Store as CSV
The process of building an inverted index involves iterating through all documents, recording the document IDs and position information where each term appears, and storing the results as a CSV file.
import csv
from collections import defaultdict
def build_inverted_index(documents):
"""Build inverted index and store as CSV file"""
inverted_index = defaultdict(list) # Structure: {term: [(doc_id, positions), ...]}
for doc_id, doc in enumerate(documents):
# Preprocess the document
tokens = preprocess_text(doc)
# Record positions of each term in the current document
term_positions = defaultdict(list)
for pos, term in enumerate(tokens):
term_positions[term].append(pos)
# Update inverted index
for term, positions in term_positions.items():
inverted_index[term].append((doc_id, positions))
# Store inverted index as CSV file
with open('inverted_index.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['term', 'doc_id', 'positions'])
for term, doc_info in inverted_index.items():
for doc_id, positions in doc_info:
# Convert position list to string for storage
positions_str = str(positions)
writer.writerow([term, doc_id, positions_str])
return inverted_index
# Sample document collection
documents = [
"Machine learning is a subset of artificial intelligence focused on developing algorithms that learn from data.",
"Artificial intelligence involves creating systems that can perform tasks requiring human intelligence.",
"Deep learning is a type of machine learning based on artificial neural networks with multiple layers.",
"Natural language processing allows computers to understand and generate human language.",
"Computer vision enables machines to interpret and understand the visual world.",
"Reinforcement learning is an area of machine learning concerned with how agents take actions in an environment.",
"Supervised learning algorithms learn from labeled training data to make predictions on new data.",
"Unsupervised learning deals with unlabeled data, finding patterns and structures within it.",
"A neural network is a computational model inspired by the human brain's structure and function.",
"Big data refers to large and complex data sets that require advanced processing techniques."
]
# Build inverted index
inverted_index = build_inverted_index(documents)
print(f"Inverted index built, containing {len(inverted_index)} terms")
This code first iterates through each document, preprocesses it, records the positions of each term in the document, organizes this information into an inverted index, and stores it as a CSV file. The inverted index is structured as a dictionary where keys are terms and values are lists of tuples containing document IDs and position lists.
Step 3: Calculate TF-IDF Values
Calculating TF-IDF values requires first counting term frequencies and inverse document frequencies, then computing their product. In a pure Python implementation, we need to perform these calculations manually.
def calculate_tfidf(documents, inverted_index):
"""Calculate TF-IDF values for each term in each document"""
num_docs = len(documents)
tfidf = {} # Structure: {doc_id: {term: tfidf_value, ...}, ...}
# Calculate total number of terms for each document
doc_lengths = []
for doc in documents:
tokens = preprocess_text(doc)
doc_lengths.append(len(tokens))
# Calculate document frequency for each term (number of documents containing the term)
doc_freq = {term: len(entries) for term, entries in inverted_index.items()}
# Calculate TF-IDF
for term, entries in inverted_index.items():
# Calculate IDF
idf = math.log(num_docs / (doc_freq[term] + 1)) # +1 to avoid division by zero
for doc_id, positions in entries:
# Calculate TF
tf = len(positions) / doc_lengths[doc_id] if doc_lengths[doc_id] > 0 else 0
# Calculate TF-IDF
tfidf_value = tf * idf
# Store results
if doc_id not in tfidf:
tfidf[doc_id] = {}
tfidf[doc_id][term] = tfidf_value
return tfidf
import math # Import math library for logarithm calculation
# Calculate TF-IDF values
tfidf_scores = calculate_tfidf(documents, inverted_index)
print("TF-IDF calculation completed")
This code first calculates the length of each document (total number of terms after preprocessing) and the document frequency of each term, then calculates term frequency and inverse document frequency respectively, and finally computes their product to get the TF-IDF value. The calculation results are stored in a nested dictionary for easy use in subsequent queries.
Step 4: Process Queries and Return Results
The query processing module needs to preprocess user input queries, find documents containing the query terms based on the inverted index, calculate the relevance score between the query and each document, and return results sorted by score.
def search(query, documents, inverted_index, tfidf_scores, top_n=3):
"""Process query and return most relevant documents"""
# Preprocess query
query_terms = preprocess_text(query)
if not query_terms:
return []
# Get documents containing at least one query term
relevant_docs = set()
for term in query_terms:
if term in inverted_index:
for doc_id, _ in inverted_index[term]:
relevant_docs.add(doc_id)
relevant_docs = list(relevant_docs)
# Calculate relevance scores between query and each relevant document
scores = []
for doc_id in relevant_docs:
score = 0.0
for term in query_terms:
if term in tfidf_scores.get(doc_id, {}):
score += tfidf_scores[doc_id][term]
# Normalize score (divide by number of query terms)
score /= len(query_terms)
scores.append((doc_id, score))
# Sort by score
scores.sort(key=lambda x: x[1], reverse=True)
# Return top N results
results = []
for doc_id, score in scores[:top_n]:
if score > 0:
results.append({
'document': documents[doc_id],
'score': score,
'doc_id': doc_id
})
return results
# Test search functionality
import math # Ensure math library is imported
query = "machine learning"
results = search(query, documents, inverted_index, tfidf_scores)
print(f"Query: {query}")
for i, result in enumerate(results, 1):
print(f"\n Result {i} (Score: {result['score']:.4f}):")
print(result['document'])
This code first preprocesses the query terms, then finds all documents containing the query terms, calculates the relevance score between each document and the query (using a simple summation method here), and finally returns the results sorted by score.
Step 5: Load Inverted Index from CSV
To persistency, we need to be able to load the inverted index from a CSV file.
def load_inverted_index_from_csv(filename):
"""Load inverted index from CSV file"""
inverted_index = defaultdict(list)
with open(filename, 'r', encoding='utf-8') as f:
reader = csv.reader(f)
next(reader) # Skip header
for row in reader:
term = row[0]
doc_id = int(row[1])
# Convert position string back to list
positions = eval(row[2]) # Note: eval has security risks; use safer methods in practical applications
inverted_index[term].append((doc_id, positions))
return inverted_index
# Test loading inverted index
loaded_index = load_inverted_index_from_csv('inverted_index.csv')
print(f"Inverted index loaded from CSV contains {len(loaded_index)} terms")
This code reads inverted index data from a CSV file, converts the position information from a string back to a list, and reconstructs the inverted index structure. It should be noted that using the eval function poses security risks; in practical applications, safer methods (such as regular expression parsing) can be used to convert the position string.
Performance Optimization and Extension
A search engine implemented in pure Python may encounter performance issues when processing large-scale documents. Here are some optimization suggestions:
- Index compression: Position information in the inverted index can be compressed using methods such as delta encoding to reduce storage space and memory usage.
- Caching mechanism: Cache frequently accessed parts of the index in memory to reduce disk I/O operations.
- Parallel computing: Use Python’s multiprocessing module for parallel computing when performing time-consuming operations like TF-IDF calculation.
- Query optimization: For multi-term queries, first find documents containing all query terms (intersection) before calculating relevance to reduce computation.
In addition, the functionality of the search engine can be extended to support phrase queries, proximity queries, boolean queries, etc., improving retrieval accuracy and flexibility.
Challenges in Practical Applications
In practical applications, a search engine implemented in pure Python may face the following challenges:
- Performance limitations: Compared to compiled languages like C++, Python has lower execution efficiency and may have bottlenecks when processing large-scale data.
- Functional limitations: A pure Python implementation struggles with complex natural language processing tasks such as word sense disambiguation and entity recognition.
- Scalability: As the number of documents and vocabulary grows, the index structure expands rapidly, requiring more complex distributed storage and computing architectures.
Therefore, in actual search engine development, the ease of use of Python is usually combined with the advantages of other high-performance languages to build systems with hybrid architectures.
Conclusion
Through this article, we’ve built a TF-IDF-based English search engine from scratch without relying on any third-party libraries, and stored the key inverted index in CSV format. This process has allowed us to gain an in-depth understanding of the core principles and implementation details of search engines, including key steps such as document preprocessing, inverted index construction, TF-IDF calculation, and query processing.
While this implementation is relatively simple, it covers the basic framework of modern search engines. On this foundation, you can further expand the functionality and optimize performance to build a more powerful retrieval system. Whether for academic research or practical applications, understanding these basic principles is an important step in deepening your knowledge of information retrieval technology.
We hope this article has opened the door to the field of information retrieval for you, inspiring your interest and desire to explore search engine technology. In this era of information explosion, mastering information retrieval technology not only helps us obtain information more efficiently but also provides a solid foundation for research in fields such as data mining and artificial intelligence.
Leapcell: The Best of Serverless Web Hosting
Finally, we recommend an excellent platform for deploying Python services: Leapcell
Build with Your Favorite Language
Develop effortlessly in JavaScript, Python, Go, or Rust.
Deploy Unlimited Projects for Free
Only pay for what you use—no requests, no charges.
Pay-as-You-Go, No Hidden Costs
No idle fees, just seamless scalability.
Follow us on Twitter: @LeapcellHQ
This content originally appeared on DEV Community and was authored by Leapcell