Scrapling：下一代智能网页抓取框架#

Scrapling 封面图

功能介绍#

Scrapling 是由 D4Vinci 开发的一个现代化网页抓取框架。它最大的特点是提供了三种抓取策略，可以应对从简单静态页面到高强度反爬保护的各种场景：

抓取方式	使用场景
HTTP 抓取（`Fetcher`）	静态页面、API 接口，速度快，适合批量请求
动态抓取（`DynamicFetcher`）	JS 渲染的单页应用（SPA）、懒加载内容
隐身抓取（`StealthyFetcher`）	Cloudflare、Turnstile 等反爬保护站点

此外，Scrapling 还内置了蜘蛛爬行框架（Spider），可以自动跟踪链接进行多页面爬取。

核心特性#

1. 多种选择器支持#

Scrapling 的元素选择器非常灵活，支持 CSS 选择器、XPath、按文本查找、按正则表达式查找，甚至可以找到结构相似的元素（这对抓取商品列表、搜索结果等重复性页面特别有用）：

1
# 找第一个商品，然后找出所有结构相似的商品
2
first_product = page.css('.product')[0]
3
all_similar = first_product.find_similar()

2. 隐身模式：反反爬克星#

内置反爬对抗能力，自动隐藏浏览器指纹：

1
from scrapling.fetchers import StealthyFetcher
2

3
page = StealthyFetcher.fetch(
4
    'https://protected-site.com',
5
    headless=True,
6
    solve_cloudflare=True,  # 自动解决 Cloudflare 挑战
7
    block_webrtc=True,       # 阻止 WebRTC 泄露 IP
8
    hide_canvas=True,       # 隐藏 Canvas 指纹
9
)

3. 完整的 CLI 工具#

不想写代码？Scrapling 提供了功能完整的命令行工具：

1
# 静态页面抓取
2
scrapling extract get 'https://example.com' output.md
3

4
# JS 动态渲染页面
5
scrapling extract fetch 'https://example.com' output.md \
6
  --css-selector '.dynamic-content' \
7
  --network-idle
8

9
# Cloudflare 保护页面
10
scrapling extract stealthy-fetch 'https://protected-site.com' output.html \
11
  --solve-cloudflare
12

13
# POST 请求
14
scrapling extract post 'https://api.example.com/search' output.json \
15
  --json '{"query": "keyword"}'

4. 蜘蛛爬行框架#

多页面爬取变得前所未有的简单：

1
from scrapling.spiders import Spider, Response
2

3
class MySpider(Spider):
4
    name = "my_spider"
5
    start_urls = ["https://example.com/"]
6
    concurrent_requests = 10
7
    download_delay = 1
8

9
    async def parse(self, response: Response):
10
        # 提取数据
11
        for item in response.css('.item'):
12
            yield {
13
                "title": item.css('h2::text').get(),
14
                "link": item.css('a::attr(href)').get(),
15
            }
16
        # 跟踪下一页
17
        next_page = response.css('.next a::attr(href)').get()
18
        if next_page:
19
            yield response.follow(next_page)
20

21
result = MySpider().start()
22
result.items.to_json("output.json")

5. 支持代理、会话持久化、自定义页面操作#

1
from scrapling.fetchers import FetcherSession
2

3
# 带代理
4
page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')
5

6
# 持久会话（自动处理 Cookie）
7
with FetcherSession(impersonate='chrome') as session:
8
    page = session.get('https://example.com/login')
9
    # 自动携带 Cookie
10
    page = session.get('https://example.com/dashboard')
11

12
# 自定义页面操作（滚动、点击等）
13
from playwright.sync_api import Page
14

15
def scroll_and_click(page: Page):
16
    page.mouse.wheel(0, 3000)
17
    page.click('button.load-more')
18

19
page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)

安装方法#

1
# 完整安装（推荐）
2
pip install "scrapling[all]"
3
scrapling install
4

5
# 仅 HTTP（无浏览器依赖）
6
pip install scrapling
7

8
# 仅浏览器自动化
9
pip install "scrapling[fetchers]"
10
scrapling install

⚠️ 注意：scrapling install 是必选项，用于安装 Playwright 浏览器引擎，否则 DynamicFetcher 和 StealthyFetcher 无法运行。

我的评价#

优点：

设计优雅：统一的选择器 API 在三种抓取方式间完全一致，学习成本低
反爬能力强：隐身模式内置多种指纹隐藏技术，对 Cloudflare 等主流反爬机制有效
生态完整：CLI 和 Python API 都很成熟，文档清晰，示例丰富
爬虫框架：内置 Spider 比自己写递归抓取要省心太多
持续活跃：上游项目更新频繁，功能迭代迅速

需要注意的地方：

Cloudflare 绕过会增加 5-15 秒的抓取时间，非必要不要开启
隐身模式运行真实浏览器，比较消耗资源，并发量要控制
抓取速度 Dynamic > HTTP > Stealth，根据场景选择合适的模式
始终尊重网站的 robots.txt 和服务条款

对比现有的 scrape-web 技能：scrape-web 更轻量，适合简单页面提取；Scrapling 则在复杂场景（JS 渲染、反爬保护、多页面爬行）下更强。两者互补，Scrapling 是进阶选择。

总体来说，Scrapling 是一个非常值得掌握的网页抓取工具，无论你是要抓取电商数据、舆情监控还是研究采集，它都能大幅提升效率。

相关链接：

GitHub：https://github.com/D4Vinci/Scrapling
Python 版本要求：≥ 3.10