Async Web Scraping with scrapy_cffi



This content originally appeared on DEV Community and was authored by strange Funny

Introduction

scrapy_cffi is a lightweight async-first web scraping framework that follows a Scrapy-style architecture.

It is designed for developers who want a familiar crawling flow, but with full asyncio support, modular utilities, and flexible integration points.

The framework uses curl_cffi as the default HTTP client—requests-like API but more powerful—but the request layer is fully decoupled from the engine, allowing easy replacement with other HTTP libraries in the future.

Even if you don’t need a full crawler, many of the utility libraries can be used independently.

💡 IDE-friendly: The framework emphasizes code completion, type hints, and programmatic settings creation, making development and debugging smoother in modern Python IDEs.

Why scrapy_cffi?

scrapy_cffi was designed with several core principles in mind:

  • API-first & Modular: All spiders, pipelines, and tasks are fully accessible via Python interfaces. CLI is optional, and settings are generated programmatically to support both single and batch spider execution modes.
  • Async Execution: Fully asyncio-based engine allows high concurrency, HTTP + WebSocket support, and smooth integration with async workflows.
  • Scrapy-style Architecture: Spider flow, pipelines, and hooks resemble Scrapy, making it easy for existing Scrapy users to transition.
  • Decoupled Request Layer: By default, curl_cffi is used, but the scheduler and engine are independent of the HTTP client. This allows flexible swapping of request libraries without touching the crawler core.
  • Utility-first: Components like HTTP, WebSocket, media handling, JSON parsing, and database adapters can be used independently, supporting small scripts or full asynchronous crawlers alike.

✨ Features

  • 🕸 Scrapy-style components: spiders, items, pipelines, interceptors
  • ⚡ Fully asyncio-based engine for high concurrency
  • 🌐 HTTP & WebSocket support with TLS
  • 🔔 Lightweight signal system
  • 🔌 Plug-in ready interceptor & task manager
  • 🗄 Redis-compatible scheduler (optional)
  • 💾 Built-in adapters for Redis, MySQL, and MongoDB with automatic retry & reconnection

🚀 Quick Start

# Install
pip install scrapy_cffi

# Create a new project
scrapy-cffi startproject myproject
cd myproject

# Generate a spider
scrapy-cffi genspider myspider example.com

# Run your crawler
python runner.py

Note: The CLI command changed from scrapy_cffi (≤0.1.4) to scrapy-cffi (>0.1.4).
Because scrapy_cffi uses programmatic settings creation and API-first design, the framework does not rely on CLI for spider execution.

Full documentation: docs/

⭐ Star & contribute on GitHub: scrapy_cffi

⚡ Handy Utilities

scrapy_cffi provides several async-first and utility-focused features that make crawling and async task orchestration easier:

Async Crawling

  • Supports both async def async generators and Scrapy-style synchronous generators.
  • Fully asyncio-based execution with high concurrency.

ResultHolder

  • Aggregate multiple request results before generating the next batch.
  • Useful for multi-stage workflows and distributed tasks.

Hooks System

  • Access sessions, scheduler, or other subsystems safely.
  • Supports multi-user cookies and session rotation.

HTTP + WebSocket Requests

  • Send HTTP & WebSocket requests in a single Spider.
  • TLS support included.
  • Advanced curl_cffi features: TLS/JA3 fingerprinting, proxy control, unified HTTP/WS API.

Request & Response Utilities

  • HttpRequest / WebSocketRequest with optional Protobuf & gRPC encoding.
  • MediaRequest for segmented downloads (videos, large files).
  • HttpResponse selector with .css(), .xpath(), .re().
  • Robust JSON extraction:
    • extract_json() for standard JSON.
    • extract_json_strong() for malformed or embedded JSON.
  • Protobuf / gRPC decoding from HTTP or WebSocket responses.

Database Support

Built-in adapters with automatic retry & reconnection:

  • RedisManager (redis.asyncio.Redis compatible)
  • SQLAlchemyMySQLManager (async SQLAlchemy engine & session, original API supported)
  • MongoDBManager (async Motor client, native API supported)

MongoDB & MySQL usage examples:
MongoDB
MySQL

Multi-process RPC with ProcessManager

scrapy_cffi includes a lightweight ProcessManager for quick multi-process RPC registration.

This is ideal for small projects or debugging without relying on MQ/Redis, but not recommended for production.

  • Supports function, class, and object registration for remote calls.
  • Allows starting a server to expose registered methods and a client to connect and call them.
  • Runs each registered callable in a separate process if needed, with optional result retrieval.
  • Works cross-platform, but Windows has some Ctrl+C limitations due to process startup.
from scrapy_cffi.utils import ProcessManager

# Register methods
def hello(name: str):
    return f"Hello, {name}!"

class Greeter:
    def greet(self, msg: str):
        return f"Greeting: {msg}"

class Counter:
    def __init__(self):
        self.value = 0
    def inc(self):
        self.value += 1
        return self.value
    def get(self):
        return self.value

counter = Counter()

# Start server
manager = ProcessManager(register_methods={
    "hello": hello,
    "Greeter": Greeter,
    "counter": counter
})
manager.start_server(run_mode=0)  # blocking mode

# Start client
manager.start_client()
print(manager.hello("World"))
c = manager.counter()
print(c.inc())
g = manager.Greeter()
print(g.greet("Hi"))

Tip: ProcessManager is designed for rapid prototyping and small-scale tasks. For production-grade distributed systems, consider using a full-featured message queue or RPC framework.

scrapy_cffi is currently in development. Its modular and API-first design allows developers to either use it as a full-fledged Scrapy-style framework or pick individual utilities for smaller, async-first scraping tasks. The ultimate goal is high flexibility, independent utilities, and easy extensibility for complex crawling projects.


This content originally appeared on DEV Community and was authored by strange Funny