This content originally appeared on DEV Community and was authored by Abdelrahman Adnan
Welcome back to our evaluation journey! In Part 1, you built the essential foundation with key concepts and ground truth creation. Now we’re diving into the exciting world of actually measuring how well your systems perform.
Think of this as learning to be a detective – you’ll discover how to uncover exactly what’s working (and what isn’t) in your AI systems. By the end of this guide, you’ll be able to measure and optimize both the search capabilities and answer quality of any LLM or RAG system.
What You’ll Master in This Part
After working through this guide, you’ll be confident in:
Evaluating how well your system finds relevant information (retrieval evaluation)
Measuring the quality and accuracy of generated answers
Implementing key metrics like Hit Rate, MRR, and cosine similarity
Understanding when to use which evaluation approach
Building automated evaluation pipelines that save you time
Interpreting results and making data-driven improvements
The Two Sides of System Evaluation
Most AI systems that answer questions have two main components that need evaluation:
-
The Retrieval System (The “Finder”)
- Searches through documents to find relevant information
- Like a librarian finding the right books for your research topic
-
The Generation System (The “Writer”)
- Takes found information and crafts a coherent answer
- Like a skilled writer synthesizing research into a clear response
Each requires different evaluation approaches. Let’s master both!
Retrieval System Evaluation: The Art of Finding the Right Needle
What Makes Retrieval Evaluation Special?
Retrieval evaluation answers the fundamental question: “Did my system find the information needed to answer the question?”
This is different from evaluating the final answer because:
A system might find the right documents but generate a poor answer
A system might generate a fluent answer but base it on irrelevant documents
By evaluating retrieval separately, you can diagnose exactly where problems occur
Think of it like evaluating a research assistant: you want to know both whether they found the right source materials AND whether they wrote a good summary based on those materials.
Core Retrieval Metrics: Your Evaluation Toolkit
Hit Rate: The “Did We Find It?” Metric
Hit Rate measures what percentage of your searches successfully find at least one relevant document. It’s the most intuitive metric – either you found what you were looking for, or you didn’t.
The Simple Math:
def calculate_hit_rate_explained(search_results, k=5):
"""
Calculate hit rate with step-by-step explanation
Args:
search_results: List of search attempts with their outcomes
k: How many top results to consider (top-5, top-10, etc.)
"""
print(f"📊 Calculating Hit Rate @ {k}")
print("=" * 40)
successful_searches = 0
total_searches = len(search_results)
for i, search in enumerate(search_results, 1):
print(f"\n🔍 Search {i}: '{search['question'][:50]}...'")
# Look through the top-k results
found_relevant = False
for position, doc in enumerate(search['top_k_results'][:k], 1):
if doc['id'] in search['relevant_document_ids']:
print(f" ✅ Found relevant doc at position {position}: {doc['title'][:40]}...")
found_relevant = True
break
if found_relevant:
successful_searches += 1
print(f" 🎯 Result: HIT")
else:
print(f" ❌ Result: MISS (no relevant docs in top {k})")
hit_rate = successful_searches / total_searches
print(f"\n📈 Final Hit Rate @ {k}: {hit_rate:.2%} ({successful_searches}/{total_searches})")
return hit_rate
# Practical example
example_searches = [
{
'question': 'How do I reset my password?',
'relevant_document_ids': ['doc_123'],
'top_k_results': [
{'id': 'doc_456', 'title': 'Creating new accounts'},
{'id': 'doc_123', 'title': 'Password reset guide'}, # ✅ Found at position 2
{'id': 'doc_789', 'title': 'Troubleshooting login'}
]
},
{
'question': 'What are your business hours?',
'relevant_document_ids': ['doc_999'],
'top_k_results': [
{'id': 'doc_111', 'title': 'Contact information'},
{'id': 'doc_222', 'title': 'Location details'},
{'id': 'doc_333', 'title': 'Service offerings'} # ❌ Relevant doc not in top 3
]
}
]
hit_rate = calculate_hit_rate_explained(example_searches, k=3)
# Output: Hit Rate @ 3: 50.00% (1/2)
What Hit Rate Tells You:
- High Hit Rate (>80%): Your search system is finding relevant documents consistently
- Medium Hit Rate (50-80%): Decent performance, but missing relevant docs too often
- Low Hit Rate (<50%): Major issues with search relevance or document coverage
Mean Reciprocal Rank (MRR): The “How High Up?” Metric
While Hit Rate tells you IF you found relevant documents, MRR tells you WHERE you found them. Finding the right answer as result #1 is much better than finding it buried at position #20.
Understanding Reciprocal Rank:
def explain_reciprocal_rank():
"""
Demonstrate how reciprocal rank works with intuitive examples
"""
examples = [
{"position": 1, "reciprocal_rank": 1.0, "meaning": "Perfect! Found at the top"},
{"position": 2, "reciprocal_rank": 0.5, "meaning": "Good! Second result"},
{"position": 3, "reciprocal_rank": 0.33, "meaning": "Okay, third result"},
{"position": 5, "reciprocal_rank": 0.2, "meaning": "Not great, fifth result"},
{"position": 10, "reciprocal_rank": 0.1, "meaning": "Poor, buried deep"},
{"position": None, "reciprocal_rank": 0.0, "meaning": "Terrible! Not found at all"}
]
print("🏆 Understanding Reciprocal Rank")
print("=" * 50)
for example in examples:
if example["position"]:
print(f"Position {example['position']:2}: RR = {example['reciprocal_rank']:.2f} - {example['meaning']}")
else:
print(f"Not found : RR = {example['reciprocal_rank']:.2f} - {example['meaning']}")
print("\n💡 Key insight: Higher positions = Lower reciprocal ranks")
print("💡 MRR = Average of all reciprocal ranks across your test questions")
explain_reciprocal_rank()
Complete MRR Implementation:
def calculate_mrr_with_details(search_results, k=10):
"""
Calculate MRR with detailed explanation of each step
"""
print(f"🏆 Calculating Mean Reciprocal Rank @ {k}")
print("=" * 50)
reciprocal_ranks = []
for i, search in enumerate(search_results, 1):
print(f"\n🔍 Question {i}: '{search['question'][:60]}...'")
# Find the position of the first relevant document
best_position = None
for position, doc in enumerate(search['top_k_results'][:k], 1):
if doc['id'] in search['relevant_document_ids']:
best_position = position
break
# Calculate reciprocal rank
if best_position:
rr = 1.0 / best_position
reciprocal_ranks.append(rr)
print(f" ✅ First relevant doc found at position {best_position}")
print(f" 📊 Reciprocal Rank = 1/{best_position} = {rr:.3f}")
else:
reciprocal_ranks.append(0.0)
print(f" ❌ No relevant docs found in top {k}")
print(f" 📊 Reciprocal Rank = 0.000")
# Calculate final MRR
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
print(f"\n🎯 Individual Reciprocal Ranks: {[f'{rr:.3f}' for rr in reciprocal_ranks]}")
print(f"🏆 Mean Reciprocal Rank = {sum(reciprocal_ranks):.3f} / {len(reciprocal_ranks)} = {mrr:.3f}")
return mrr, reciprocal_ranks
# Example with step-by-step calculation
mrr_example_searches = [
{
'question': 'How to change my email address?',
'relevant_document_ids': ['doc_email_change'],
'top_k_results': [
{'id': 'doc_email_change', 'title': 'Email address modification guide'}, # Position 1, RR = 1.0
]
},
{
'question': 'What is your refund policy?',
'relevant_document_ids': ['doc_refunds'],
'top_k_results': [
{'id': 'doc_billing', 'title': 'Billing overview'},
{'id': 'doc_payments', 'title': 'Payment methods'},
{'id': 'doc_refunds', 'title': 'Refund policy and procedures'}, # Position 3, RR = 0.33
]
},
{
'question': 'How do I delete my account?',
'relevant_document_ids': ['doc_account_deletion'],
'top_k_results': [
{'id': 'doc_signup', 'title': 'Account creation'},
{'id': 'doc_settings', 'title': 'Account settings'}, # Relevant doc not found, RR = 0.0
]
}
]
mrr_score, individual_scores = calculate_mrr_with_details(mrr_example_searches, k=5)
# Expected: MRR = (1.0 + 0.33 + 0.0) / 3 = 0.44
Interpreting MRR Scores:
- MRR > 0.8: Excellent – relevant docs typically in top 1-2 positions
- MRR 0.5-0.8: Good – relevant docs usually in top 2-4 positions
- MRR 0.3-0.5: Fair – relevant docs often around positions 3-6
- MRR < 0.3: Poor – relevant docs are ranked low or missing
Advanced Retrieval Metrics
Precision and Recall in Retrieval Context
While Hit Rate and MRR are most common, understanding precision and recall gives you deeper insights:
def calculate_retrieval_precision_recall(search_results, k=5):
"""
Calculate precision and recall for retrieval evaluation
"""
all_precisions = []
all_recalls = []
for search in search_results:
relevant_docs = set(search['relevant_document_ids'])
retrieved_docs = set([doc['id'] for doc in search['top_k_results'][:k]])
# Calculate for this search
true_positives = len(relevant_docs.intersection(retrieved_docs))
# Precision: Of the docs we retrieved, how many were relevant?
precision = true_positives / len(retrieved_docs) if retrieved_docs else 0
# Recall: Of the relevant docs, how many did we retrieve?
recall = true_positives / len(relevant_docs) if relevant_docs else 0
all_precisions.append(precision)
all_recalls.append(recall)
print(f"Query: '{search['question'][:40]}...'")
print(f" Precision @ {k}: {precision:.2%} ({true_positives}/{len(retrieved_docs)} retrieved docs were relevant)")
print(f" Recall @ {k}: {recall:.2%} ({true_positives}/{len(relevant_docs)} relevant docs were retrieved)")
avg_precision = sum(all_precisions) / len(all_precisions)
avg_recall = sum(all_recalls) / len(all_recalls)
print(f"\n📊 Average Precision @ {k}: {avg_precision:.2%}")
print(f"📊 Average Recall @ {k}: {avg_recall:.2%}")
return avg_precision, avg_recall
When to Use Each Metric:
- Hit Rate: Quick overall health check – “Are we finding anything relevant?”
- MRR: User experience focused – “How quickly do users find what they need?”
- Precision: Resource efficiency – “Are we wasting time with irrelevant results?”
- Recall: Coverage assessment – “Are we missing important information?”
Answer Quality Evaluation: Measuring the Art of Good Responses
Once your system finds relevant information, it needs to generate helpful, accurate answers. This is where answer quality evaluation comes in – measuring how well your AI writes responses that users actually find useful.
The Challenge of Evaluating Generated Text
Unlike retrieval (where success is relatively binary), answer quality is nuanced:
- Multiple valid answers: Many questions can be answered correctly in different ways
- Subjective qualities: What counts as “helpful” or “well-written” can vary
- Context dependency: The same answer might be great for experts but confusing for beginners
This is why we need multiple evaluation approaches!
Cosine Similarity: Measuring Semantic Closeness
Cosine similarity measures how “close” your AI’s answer is to a reference answer in terms of meaning, even if the exact words are different.
The Geometric Intuition:
Imagine each piece of text as an arrow (vector) in a multi-dimensional space. Cosine similarity measures the angle between these arrows:
- 0Β° angle (similarity = 1.0): Arrows point in exactly the same direction (identical meaning)
- 90Β° angle (similarity = 0.0): Arrows are perpendicular (unrelated meaning)
- 180Β° angle (similarity = -1.0): Arrows point in opposite directions (contradictory meaning)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def explain_cosine_similarity_with_examples():
"""
Demonstrate cosine similarity with intuitive examples
"""
# Example text pairs with expected similarity levels
examples = [
{
"text1": "The cat sat on the mat",
"text2": "A feline rested on the rug",
"expected": "High (same meaning, different words)"
},
{
"text1": "How do I reset my password?",
"text2": "What's the process for changing my login credentials?",
"expected": "High (same intent, different phrasing)"
},
{
"text1": "The weather is sunny today",
"text2": "I love eating pizza",
"expected": "Low (completely unrelated topics)"
},
{
"text1": "This product is excellent and I highly recommend it",
"text2": "This product is terrible and I would avoid it",
"expected": "Low (opposite sentiments)"
}
]
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
print("🔄 Cosine Similarity Examples")
print("=" * 60)
for i, example in enumerate(examples, 1):
# Calculate similarity
texts = [example["text1"], example["text2"]]
tfidf_matrix = vectorizer.fit_transform(texts)
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
print(f"\n📝 Example {i}:")
print(f" Text 1: '{example['text1']}'")
print(f" Text 2: '{example['text2']}'")
print(f" Expected: {example['expected']}")
print(f" 📊 Cosine Similarity: {similarity:.3f}")
explain_cosine_similarity_with_examples()
Practical Implementation for Answer Evaluation:
def evaluate_answer_similarity(generated_answers, reference_answers):
"""
Evaluate generated answers using cosine similarity
"""
vectorizer = TfidfVectorizer(
stop_words='english',
lowercase=True,
ngram_range=(1, 2), # Include word pairs for better context
max_features=5000
)
similarities = []
print("📊 Answer Quality Evaluation using Cosine Similarity")
print("=" * 60)
for i, (generated, reference) in enumerate(zip(generated_answers, reference_answers), 1):
# Create vectors for both answers
texts = [reference, generated]
tfidf_matrix = vectorizer.fit_transform(texts)
# Calculate similarity
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
similarities.append(similarity)
print(f"\n🔍 Answer {i}:")
print(f" Reference: '{reference[:80]}...'")
print(f" Generated: '{generated[:80]}...'")
print(f" 📊 Similarity: {similarity:.3f} ({interpret_similarity_score(similarity)})")
average_similarity = np.mean(similarities)
print(f"\n🎯 Average Cosine Similarity: {average_similarity:.3f}")
print(f"🏆 Overall Quality: {interpret_similarity_score(average_similarity)}")
return similarities, average_similarity
def interpret_similarity_score(score):
"""Convert similarity scores to human-readable quality levels"""
if score >= 0.8:
return "Excellent match"
elif score >= 0.6:
return "Good alignment"
elif score >= 0.4:
return "Moderate similarity"
elif score >= 0.2:
return "Low similarity"
else:
return "Poor match"
# Example evaluation
example_generated = [
"To reset your password, go to the login page and click 'Forgot Password', then follow the email instructions.",
"Our business hours are Monday through Friday, 9 AM to 6 PM, and weekends 10 AM to 4 PM.",
"You can track your order by logging into your account and checking the order status page."
]
example_reference = [
"Password reset process: 1) Navigate to login screen 2) Select 'Forgot Password' 3) Enter email 4) Check email for reset link",
"We're open Monday-Friday 9:00 AM to 6:00 PM, and Saturday-Sunday 10:00 AM to 4:00 PM.",
"To track orders, sign in to your account and visit the 'My Orders' section where you can see current status."
]
similarities, avg_sim = evaluate_answer_similarity(example_generated, example_reference)
ROUGE Scores: Measuring Content Overlap
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures how much your generated answer overlaps with reference answers in terms of words and phrases.
Types of ROUGE Scores:
- ROUGE-1: Overlap of individual words (unigrams)
- ROUGE-2: Overlap of two-word phrases (bigrams)
- ROUGE-L: Longest common subsequence (measures word order)
from rouge_score import rouge_scorer
def comprehensive_rouge_evaluation(generated_answers, reference_answers):
"""
Calculate ROUGE scores with detailed explanation
"""
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
all_rouge1_f = []
all_rouge2_f = []
all_rougeL_f = []
print("🌹 ROUGE Score Evaluation")
print("=" * 50)
for i, (generated, reference) in enumerate(zip(generated_answers, reference_answers), 1):
scores = scorer.score(reference, generated)
rouge1_f = scores['rouge1'].fmeasure
rouge2_f = scores['rouge2'].fmeasure
rougeL_f = scores['rougeL'].fmeasure
all_rouge1_f.append(rouge1_f)
all_rouge2_f.append(rouge2_f)
all_rougeL_f.append(rougeL_f)
print(f"\n📝 Answer {i}:")
print(f" Generated: '{generated[:60]}...'")
print(f" Reference: '{reference[:60]}...'")
print(f" 📊 ROUGE-1: {rouge1_f:.3f} (word overlap)")
print(f" 📊 ROUGE-2: {rouge2_f:.3f} (phrase overlap)")
print(f" 📊 ROUGE-L: {rougeL_f:.3f} (sequence overlap)")
print(f" 🏆 Quality: {interpret_rouge_scores(rouge1_f, rouge2_f, rougeL_f)}")
# Calculate averages
avg_rouge1 = np.mean(all_rouge1_f)
avg_rouge2 = np.mean(all_rouge2_f)
avg_rougeL = np.mean(all_rougeL_f)
print(f"\n📊 AVERAGE ROUGE SCORES:")
print(f" ROUGE-1: {avg_rouge1:.3f}")
print(f" ROUGE-2: {avg_rouge2:.3f}")
print(f" ROUGE-L: {avg_rougeL:.3f}")
return {
'rouge1': all_rouge1_f,
'rouge2': all_rouge2_f,
'rougeL': all_rougeL_f,
'averages': {
'rouge1': avg_rouge1,
'rouge2': avg_rouge2,
'rougeL': avg_rougeL
}
}
def interpret_rouge_scores(rouge1, rouge2, rougeL):
"""Interpret ROUGE scores in plain English"""
avg_score = (rouge1 + rouge2 + rougeL) / 3
if avg_score >= 0.6:
return "Excellent content match"
elif avg_score >= 0.4:
return "Good content overlap"
elif avg_score >= 0.25:
return "Moderate content similarity"
else:
return "Low content overlap"
# Practical example
rouge_results = comprehensive_rouge_evaluation(example_generated, example_reference)
When ROUGE is Most Useful:
Summarization tasks: Measuring how well key information is preserved
Factual questions: Where specific terms and phrases matter
Content coverage: Ensuring important details aren’t missed
ROUGE Limitations to Remember:
Doesn’t understand meaning: High word overlap doesn’t guarantee good answers
Penalizes paraphrasing: Different but correct phrasings get lower scores
Ignores answer quality: Fluency and helpfulness aren’t measured
Building a Complete Answer Quality Evaluation Pipeline
Let’s combine multiple metrics for comprehensive answer evaluation:
def comprehensive_answer_evaluation(generated_answers, reference_answers, questions):
"""
Complete evaluation pipeline combining multiple metrics
"""
print("🎯 COMPREHENSIVE ANSWER QUALITY EVALUATION")
print("=" * 60)
# 1. Cosine Similarity Evaluation
print("\n1⃣ SEMANTIC SIMILARITY ANALYSIS")
similarities, avg_similarity = evaluate_answer_similarity(generated_answers, reference_answers)
# 2. ROUGE Score Evaluation
print("\n2⃣ CONTENT OVERLAP ANALYSIS")
rouge_results = comprehensive_rouge_evaluation(generated_answers, reference_answers)
# 3. Length and Structure Analysis
print("\n3⃣ STRUCTURAL ANALYSIS")
structure_analysis = analyze_answer_structure(generated_answers, reference_answers)
# 4. Overall Quality Assessment
print("\n4⃣ OVERALL QUALITY ASSESSMENT")
overall_scores = []
for i, (gen, ref, q) in enumerate(zip(generated_answers, reference_answers, questions)):
# Combine metrics for overall score
semantic_score = similarities[i]
content_score = rouge_results['rouge1'][i]
structure_score = structure_analysis['structure_scores'][i]
# Weighted combination (adjust weights based on your priorities)
overall_score = (
0.4 * semantic_score + # Semantic meaning is important
0.3 * content_score + # Content coverage matters
0.3 * structure_score # Structure and completeness count
)
overall_scores.append(overall_score)
print(f"\n📝 Answer {i+1}: '{q[:50]}...'")
print(f" 🔄 Semantic: {semantic_score:.3f}")
print(f" 🌹 Content: {content_score:.3f}")
print(f" 📏 Structure: {structure_score:.3f}")
print(f" 🏆 Overall: {overall_score:.3f} ({interpret_overall_score(overall_score)})")
final_score = np.mean(overall_scores)
print(f"\n🎯 FINAL EVALUATION SUMMARY")
print(f" Average Overall Score: {final_score:.3f}")
print(f" System Quality Level: {interpret_overall_score(final_score)}")
return {
'individual_scores': overall_scores,
'average_score': final_score,
'detailed_metrics': {
'similarities': similarities,
'rouge_scores': rouge_results,
'structure_analysis': structure_analysis
}
}
def analyze_answer_structure(generated_answers, reference_answers):
"""
Analyze structural aspects of answers
"""
structure_scores = []
for gen, ref in zip(generated_answers, reference_answers):
# Length appropriateness (not too short, not too long)
ref_length = len(ref.split())
gen_length = len(gen.split())
length_ratio = min(gen_length, ref_length) / max(gen_length, ref_length)
# Completeness indicators
has_proper_punctuation = gen.strip().endswith(('.', '!', '?'))
has_appropriate_length = 10 <= gen_length <= 200 # Adjust based on your domain
# Calculate structure score
structure_score = (
0.5 * length_ratio + # Appropriate length
0.3 * (1.0 if has_proper_punctuation else 0.0) + # Proper ending
0.2 * (1.0 if has_appropriate_length else 0.0) # Reasonable length
)
structure_scores.append(structure_score)
return {
'structure_scores': structure_scores,
'average_structure_score': np.mean(structure_scores)
}
def interpret_overall_score(score):
"""Interpret overall quality scores"""
if score >= 0.8:
return "Excellent Quality"
elif score >= 0.6:
return "Good Quality"
elif score >= 0.4:
return "Fair Quality"
else:
return "Needs Improvement"
# Example usage with complete evaluation
example_questions = [
"How do I reset my password?",
"What are your business hours?",
"How can I track my order?"
]
comprehensive_results = comprehensive_answer_evaluation(
example_generated,
example_reference,
example_questions
)
This comprehensive approach gives you a well-rounded view of answer quality that considers multiple important aspects!
What’s Next?
Congratulations! You’ve now mastered the core evaluation techniques for both retrieval and answer quality. You can measure how well systems find information AND how well they use that information to generate helpful responses.
Ready for the advanced techniques? Continue to Part 3: Advanced Metrics and Production Best Practices where you’ll learn:
LLM-as-a-Judge evaluation techniques
Advanced metrics and specialized evaluation approaches
Building production-ready evaluation pipelines
Hands-on projects to cement your skills
Expert tips and troubleshooting strategies
Quick recap of your new superpowers:
Hit Rate and MRR for retrieval evaluation
Cosine similarity for semantic answer quality
ROUGE scores for content overlap analysis
Comprehensive evaluation pipelines that combine multiple metrics
Practical interpretation of evaluation results
You’re well on your way to becoming an evaluation expert! The advanced techniques in Part 3 will give you the final tools you need to evaluate any LLM system with confidence.
llmzoomcamp
This content originally appeared on DEV Community and was authored by Abdelrahman Adnan