Multi-Modal Content Processing with Strands Agent and just a few lines of code



This content originally appeared on DEV Community and was authored by Elizabeth Fuentes L

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

GitHub repositorie: Strands Agent Multi-Understanding

In this blog, you’ll learn how to create multi-modal AI agents that move beyond text-only interactions to understand and process diverse content types. Whether you need to extract data from PDFs, analyze image content, or understand video sequences, multi-modal agents provide the flexibility to handle diverse use cases.

Using the Strands Agent framework, yyou can build sophisticated agents with only a few lines of code.

⭐ Getting Started with the Strands Agent Framework

If this is your first time with Strands Agents, follow the steps in the documentation or check out the blog post First Impressions with Strands Agents SDK by my colleague Laura Salinas, where she explains in detail how to use this new approach to creating agents.

After installing the Strands Agent framework, I used tools to make it capable of understanding different input types. Tools are mechanisms that extend agent capabilities, letting them perform actions, so they can perform actions beyond simple text generation.

Strands offers an optional example tools package called strands-agents-tools. For this agent, I used two tools from the package: image_reader and file_read, which are ready to use out of the box.

🧰 Creating Custom Video Processing Tools

Since no tool existed for video processing, and I already had code from one of my previous applications: Processing WhatsApp Multimedia with Amazon Bedrock Agents: Images, Video, and Documents. I created a custom tool using my existing code. I followed the steps in Adding Tools to Agents and used the Strands Agent Builder for assistance.

The Strands Agent Builder is an interactive toolkit designed to help you build, test, and extend your own custom AI agents and tools. I highly recommend installing it for your development workflow.

The tool I created is called video_reader, and here’s the final code for my multi-modal agent:

from strands import Agent
from strands.models import BedrockModel
from strands_tools import image_reader, file_read

# Configure the model
bedrock_model = BedrockModel(
    model_id=bedrock-model-id,
    temperature=0.3
)

# Create an agent with multi-modal capabilities
agent = Agent(
    system_prompt=MULTIMODAL_SYSTEM_PROMPT,
    tools=[image_reader, file_read,video_reader],
    model=bedrock_model
)

This application uses Amazon Bedrock, but since Strands Agents is open source, you can make modifications to use your preferred model provider.

This implementation uses two different models:

  • us.anthropic.claude-3-5-sonnet-20241022-v2:0 Claude 3.5 Sonnet for images, documents, and agent management.
  • us.amazon.nova-pro-v1:0 Amazon Nova Pro for video analysis tool.

The agent uses the following system prompt to guide its multi-modal processing behavior:

MULTIMODAL_SYSTEM_PROMPT = """ You are a helpful assistant that can process documents, images, and videos. 
Analyze their contents and provide relevant information.

You can:

1. For PNG, JPEG/JPG, GIF, or WebP formats use image_reader to process file
2. For PDF, csv, docx, xls or xlsx formats use file_read to process file  
3. For MP4, MOV, AVI, MKV, WebM formats use video_reader to process file
4. Just deliver the answer

When displaying responses:
- Format answers data in a human-readable way
- Highlight important information
- Handle errors appropriately
- Convert technical terms to user-friendly language
- Always reply in the original user language

Always reply in the original user language.
"""

The agent automatically determines which tool to use based on your request and the file type. This abstraction means you don’t need to specify whether you’re working with an image, document, or video—the agent handles the routing intelligently.

🤩 Getting Started

  1. Clone the GitHub repositorie:
   git clone https://github.com/elizabethfuentes12/strands-agent-multi-understanding
   cd notebook
  1. Create a virtual environment:
   python -m venv .venv
   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install the required dependencies:
   pip install -r requirements.txt
  1. Configure AWS credentials and Amazon Bedrock model access.

👾 Remember: Strands Agents is open source, you can make modifications to use your preferred model provider.

  1. Run the multi-understanding.ipynb notebook to see the multi-modal agent in action

Testing Your Multi-Modal Agent

Once configured, you can test your agent with various content types:

Process an image

response = multimodal_agent("Analyze this image and describe what you see: data-sample/image.jpg")

Process a document

response = multimodal_agent("Summarize as json the content of the document data-sample/Welcome-Strands-Agents-SDK.pdf")

Process a video

response = multimodal_agent("Summarize the content of this video: data-sample/video.mp4")

Advanced Implementation: Serverless Deployment using AWS CDK

For advanced users, I’m sharing the code for deploying this same application in an AWS Lambda function using AWS CDK. This approach provides a scalable, serverless solution for multi-modal content processing.

Deployment Steps

Navigate to the CDK directory:

cd my_agent_cdk

Set up the environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Install Lambda layer dependencies:

pip install -r layers/lambda_requirements.txt --python-version 3.12 --platform manylinux2014_aarch64 --target layers/strands/_dependencies --only-binary=:all:

Package the Lambda layers:

python layers/package_for_lambda.py

Bootstrap and deploy:

cdk bootstrap  # Only needed once per AWS account/region
cdk deploy

Next Steps

This isn’t everything I have for you! This is the first in a series of 3 blogs where I’ll show you the capabilities of Strands Agents with code pieces so you can learn and develop powerful applications with just a few lines of code.

In the next edition, I’ll add conversation memory management using a database, so conversations can be recovered at any time or persist throughout the session. In the third part, I’ll add a knowledge base and recreate my WhatsApp travel agent: Building a Travel Support Agent with RAG and PostgreSQL, Using IaC.

Stay tuned for more Strands Agent implementations!

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr


This content originally appeared on DEV Community and was authored by Elizabeth Fuentes L