Building Strands Agents with a few lines of code: Evaluating Performance with RAGAs



This content originally appeared on DEV Community and was authored by Elizabeth Fuentes L

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

GitHub repository

This is the final part of our comprehensive guide to building AI agents with observability and evaluation capabilities using Strands Agents.

🔗 From Monitoring to Evaluation: Closing the Loop

In part 3, we implemented comprehensive observability for our restaurant agent using LangFuse. Now we’re taking it further by adding automated evaluation that not only measures performance but also sends evaluation scores back to LangFuse for centralized monitoring.

This creates a complete feedback loop: LangFuse tracks what occurs, RAGAS evaluates performance quality, and the scores flow back to LangFuse for unified observability.

🎯 Why Agent Evaluation Matters

Imagine deploying your restaurant agent to production, and users start complaining that it recommends closed restaurants or suggests seafood to vegetarians. How do you catch these issues before they reach users?

Automated evaluation addresses this challenge. While observability (from part 3) shows you what happened, evaluation tells you how well it happened.

The Problem with Manual Testing

Manual testing has limitations at scale:

  • Time-consuming: Testing 100 different queries manually takes hours
  • Inconsistent: Different people evaluate responses differently
  • Expensive: Requires human reviewers for every change
  • Limited coverage: Can’t test edge cases comprehensively

Enter LLM-as-a-Judge

LLM-as-a-Judge lets you use AI models to evaluate AI outputs automatically. This acts as an expert reviewer that you can use to:

  • Evaluate thousands of responses in minutes
  • Apply consistent evaluation criteria
  • Scale with your application growth
  • Identify subtle issues humans might miss

RAGAS (Retrieval Augmented Generation Assessment) provides the framework to implement LLM judges systematically, answering questions like:

  • How accurate are your agent’s responses?
  • Are responses grounded in source data?
  • Does the agent directly address user questions?

Without systematic evaluation, you lack visibility into production performance.

🤖 Setting Up the LLM-Judge

The foundation of our evaluation system is configuring an LLM to act as our judge. This is remarkably straightforward with RAGAS:

from ragas.llms import LangchainLLMWrapper

# Set up evaluator LLM (using the same model as your agent from part 1)
evaluator_llm = LangchainLLMWrapper(model)

This configuration creates a consistent evaluator that will assess your agent’s performance across all metrics. The key insight is using the same model that powers your agent – this ensures the evaluator understands the capabilities and limitations of the system it’s judging.

📊 RAGAS: Beyond Basic Metrics

Unlike basic evaluation approaches, our notebook implementation uses a multi-dimensional evaluation suite that goes far beyond basic accuracy checks.

1. RAG-Specific Metrics

Context Relevance measures how well retrieved information addresses user queries – crucial for ensuring your vector database returns meaningful results.

Response Groundedness determines if agent responses are actually supported by the retrieved contexts, preventing hallucinations even when the right information is available.

2. Conversational Quality Assessment

The notebook implements several AspectCritic metrics that evaluate nuanced aspects of agent behavior:

from ragas.metrics import AspectCritic

# Evaluates completeness of responses
request_completeness = AspectCritic(
    name="Request Completeness",
    llm=evaluator_llm,
    definition=(
        "Return 1 if the agent completely fulfills all the user requests with no omissions. "
        "otherwise, return 0."
    ),
)

# Ensures consistent brand voice
brand_tone = AspectCritic(
    name="Brand Voice Metric",
    llm=evaluator_llm,
    definition=(
        "Return 1 if the AI's communication is friendly, approachable, helpful, clear, and concise; "
        "otherwise, return 0."
    ),
)

# Evaluates appropriate tool usage
tool_usage_effectiveness = AspectCritic(
    name="Tool Usage Effectiveness",
    llm=evaluator_llm,
    definition=(
        "Return 1 if the agent appropriately used available tools to fulfill the user's request "
        "(such as using retrieve for menu questions and current_time for time questions). "
        "Return 0 if the agent failed to use appropriate tools or used unnecessary tools."
    ),
)

These AspectCritic metrics are powerful because they allow you to define exactly what “good performance” means for your specific use case through natural language definitions.

3. Recommendation Intelligence with Rubrics

This is where the evaluation system gets particularly sophisticated. The notebook implements a rubrics-based scoring system that evaluates how well agents handle complex scenarios:

from ragas.metrics import RubricsScore

# Define rubric for recommendation quality
rubrics = {
    "score-1_description": (
        """The item requested by the customer is not present in the menu and no 
        recommendations were made."""
    ),
    "score0_description": (
        "Either the item requested by the customer is present in the menu, "
        "or the conversation does not include any "
        "food or menu inquiry (e.g., booking, cancellation). "
        "This score applies regardless of whether any recommendation was "
        "provided."
    ),
    "score1_description": (
        "The item requested by the customer is not present in the menu "
        "and a recommendation was provided."
    ),
}

recommendations = RubricsScore(rubrics=rubrics, llm=evaluator_llm, name="Recommendations")

This rubric handles a common restaurant agent challenge: what happens when users ask for items that don’t exist? The scoring system:

  • Penalizes (-1) agents that ignore unavailable requests
  • Neutral (0) for straightforward available items or non-food queries
  • Rewards (+1) agents that proactively offer alternatives

This nuanced scoring captures the difference between a basic “item not found” response and a helpful “we don’t have that, but here are similar options” approach.

🔄 The Complete Evaluation Pipeline

The implementation processes LangFuse traces into RAGAS-compatible evaluation datasets through :

  1. Automatic extraction of user inputs, agent responses, retrieved contexts, and tool usage patterns.
  2. Dual evaluation pathways: Single-turn RAG for interactions with retrieved contexts and multi-turn conversation assessment using AspectCritic and RubricsScore metrics.
  3. Automated score integration back to LangFuse via the create_score API

📈 Real-World Impact: What You’ll See

After implementing this evaluation system, you’ll have unprecedented visibility into agent performance:

  • Performance Trending: Track how your agent’s performance evolves over time
  • Correlation Analysis: Identify patterns between user behavior and agent performance
  • Quality Gates: Set automated thresholds for immediate alerts when performance drops
  • A/B Testing Foundation: Compare different agent configurations with comprehensive metrics

🚀 Implementation Strategy

Getting Started

The complete notebook provides a ready-to-use implementation. The key steps involve:

  1. Setting up the LLM judge with the simple LangchainLLMWrapper configuration
  2. Defining comprehensive RAGAS metrics using AspectCritic and RubricsScore
  3. Implementing trace processing functions to extract evaluation data from LangFuse
  4. Creating evaluation pipelines that handle both RAG and conversational assessments
  5. Configuring automated score reporting back to LangFuse

Remember: the goal isn’t perfect scores, but consistent improvement and early detection of issues before they impact users.

🛠 Common Challenges and Solutions

  • Low Context Relevance: Review your vector database setup, document chunking strategies, and embedding model selection.
  • Inconsistent Brand Voice: Enhance system prompts and provide clearer tone guidance in AspectCritic definitions.
  • Rubric Score Issues: Ensure each score level is clearly distinguishable and covers all possible scenarios.

Thank You for Following This Series!

Thank you for following along with this comprehensive series on building Strands Agents with just a few lines of code! Throughout these four parts, you’ve learned to:

  1. Build agents with custom tools and MCP integration – Creating powerful, extensible agents that can interact with external systems
  2. Implement agent-to-agent communication – Enabling sophisticated multi-agent workflows and collaboration
  3. Add comprehensive observability with LangFuse – Gaining deep insights into your agent’s behavior and performance
  4. Evaluate and improve performance with RAGAS – Implementing systematic evaluation to ensure quality at scale

You now have a complete toolkit for building production-ready AI agents that are observable, evaluable, and continuously improving.

Ready to build your next AI agent? Start with the Getting Started with Strands Agents: Build Your First AI Agent FREE course and begin experimenting with these powerful patterns today!

📚 Essential Resources

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr


This content originally appeared on DEV Community and was authored by Elizabeth Fuentes L