This content originally appeared on DEV Community and was authored by strange Funny
Introduction
scrapy_cffi
is a lightweight async-first web scraping framework that follows a Scrapy-style architecture.
It is designed for developers who want a familiar crawling flow, but with full asyncio support, modular utilities, and flexible integration points.
The framework uses curl_cffi
as the default HTTP client—requests
-like API but more powerful—but the request layer is fully decoupled from the engine, allowing easy replacement with other HTTP libraries in the future.
Even if you don’t need a full crawler, many of the utility libraries can be used independently.
IDE-friendly: The framework emphasizes code completion, type hints, and programmatic settings creation, making development and debugging smoother in modern Python IDEs.
Why scrapy_cffi
?
scrapy_cffi
was designed with several core principles in mind:
- API-first & Modular: All spiders, pipelines, and tasks are fully accessible via Python interfaces. CLI is optional, and settings are generated programmatically to support both single and batch spider execution modes.
- Async Execution: Fully asyncio-based engine allows high concurrency, HTTP + WebSocket support, and smooth integration with async workflows.
- Scrapy-style Architecture: Spider flow, pipelines, and hooks resemble Scrapy, making it easy for existing Scrapy users to transition.
-
Decoupled Request Layer: By default,
curl_cffi
is used, but the scheduler and engine are independent of the HTTP client. This allows flexible swapping of request libraries without touching the crawler core. - Utility-first: Components like HTTP, WebSocket, media handling, JSON parsing, and database adapters can be used independently, supporting small scripts or full asynchronous crawlers alike.
Features
Scrapy-style components: spiders, items, pipelines, interceptors
Fully asyncio-based engine for high concurrency
HTTP & WebSocket support with TLS
Lightweight signal system
Plug-in ready interceptor & task manager
Redis-compatible scheduler (optional)
Built-in adapters for Redis, MySQL, and MongoDB with automatic retry & reconnection
Quick Start
# Install
pip install scrapy_cffi
# Create a new project
scrapy-cffi startproject myproject
cd myproject
# Generate a spider
scrapy-cffi genspider myspider example.com
# Run your crawler
python runner.py
Note: The CLI command changed from
scrapy_cffi
(≤0.1.4) to scrapy-cffi
(>0.1.4).
Becausescrapy_cffi
uses programmatic settings creation and API-first design, the framework does not rely on CLI for spider execution.
Full documentation: docs/
Star & contribute on GitHub:
scrapy_cffi
Handy Utilities
scrapy_cffi
provides several async-first and utility-focused features that make crawling and async task orchestration easier:
Async Crawling
- Supports both
async def
async generators and Scrapy-style synchronous generators. - Fully asyncio-based execution with high concurrency.
ResultHolder
- Aggregate multiple request results before generating the next batch.
- Useful for multi-stage workflows and distributed tasks.
Hooks System
- Access sessions, scheduler, or other subsystems safely.
- Supports multi-user cookies and session rotation.
HTTP + WebSocket Requests
- Send HTTP & WebSocket requests in a single Spider.
- TLS support included.
- Advanced
curl_cffi
features: TLS/JA3 fingerprinting, proxy control, unified HTTP/WS API.
Request & Response Utilities
-
HttpRequest
/WebSocketRequest
with optional Protobuf & gRPC encoding. -
MediaRequest
for segmented downloads (videos, large files). -
HttpResponse
selector with.css()
,.xpath()
,.re()
. - Robust JSON extraction:
-
extract_json()
for standard JSON. -
extract_json_strong()
for malformed or embedded JSON.
-
- Protobuf / gRPC decoding from HTTP or WebSocket responses.
Database Support
Built-in adapters with automatic retry & reconnection:
-
RedisManager (
redis.asyncio.Redis
compatible) - SQLAlchemyMySQLManager (async SQLAlchemy engine & session, original API supported)
- MongoDBManager (async Motor client, native API supported)
Multi-process RPC with ProcessManager
scrapy_cffi
includes a lightweight ProcessManager for quick multi-process RPC registration.
This is ideal for small projects or debugging without relying on MQ/Redis, but not recommended for production.
- Supports function, class, and object registration for remote calls.
- Allows starting a server to expose registered methods and a client to connect and call them.
- Runs each registered callable in a separate process if needed, with optional result retrieval.
- Works cross-platform, but Windows has some Ctrl+C limitations due to process startup.
from scrapy_cffi.utils import ProcessManager
# Register methods
def hello(name: str):
return f"Hello, {name}!"
class Greeter:
def greet(self, msg: str):
return f"Greeting: {msg}"
class Counter:
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
def get(self):
return self.value
counter = Counter()
# Start server
manager = ProcessManager(register_methods={
"hello": hello,
"Greeter": Greeter,
"counter": counter
})
manager.start_server(run_mode=0) # blocking mode
# Start client
manager.start_client()
print(manager.hello("World"))
c = manager.counter()
print(c.inc())
g = manager.Greeter()
print(g.greet("Hi"))
Tip:
ProcessManager
is designed for rapid prototyping and small-scale tasks. For production-grade distributed systems, consider using a full-featured message queue or RPC framework.
scrapy_cffi
is currently in development. Its modular and API-first design allows developers to either use it as a full-fledged Scrapy-style framework or pick individual utilities for smaller, async-first scraping tasks. The ultimate goal is high flexibility, independent utilities, and easy extensibility for complex crawling projects.
This content originally appeared on DEV Community and was authored by strange Funny