$ ~/registry/skill/affaan-m-data-scraper-agent

SKILL

data-scraper-agent

Name: data-scraper-agent
Author: affaan-m

帮你搭建可定时抓取、分析并存储公开数据的全自动智能采集代理

星标

★ 230,858

来源

GitHub

更新于

2026-07-18

// 安全评估低风险

仅提示词，不执行代码
开源可审计
社区验证· 230.9k

正在进行安全审计…

凭证密钥
网络外发
代码执行
数据访问
来源供应链

// 安装

复制安装指令，让 AI 自动完成配置 · 推荐新手

请帮我安装 askskill 上的 "data-scraper-agent" 技能：
1. 下载 https://raw.githubusercontent.com/affaan-m/ECC/main/skills/data-scraper-agent/SKILL.md
2. 保存为 ~/.claude/skills/data-scraper-agent/SKILL.md
3. 装好后重载技能，告诉我可以用了

// 下载

下载 SKILL.md机读安装清单 ↗

// 用法示例

监控招聘网站职位更新

输入

帮我创建一个数据采集代理，每天早上 9 点抓取指定招聘网站上的“数据分析师”和“AI 产品经理”岗位，提取职位名称、公司、地点、薪资、发布日期和链接，用 Gemini Flash 判断岗位是否匹配“远程优先、3 年以上经验、英语要求低”，结果保存到 Google Sheets，并把高匹配岗位标记出来。

预期产出

一个可定时运行的职位监控流程，自动输出结构化岗位表，并标记高匹配机会。

追踪竞品价格变化

输入

帮我搭建一个自动抓取代理，每 6 小时监控 10 个竞品页面的价格、库存状态、优惠信息和商品标题变化；如果价格下降超过 5%，就记录变化原因摘要并同步到 Notion 数据库，按品牌和品类分类。

预期产出

一个自动价格监控系统，持续记录竞品变动并输出可筛选的结构化数据。

汇总 GitHub 项目动态

输入

请建立一个 GitHub 数据采集代理，每天抓取我关注的 20 个开源仓库的 star 增长、issue、PR、release 和 README 更新，用 LLM 生成每日报告摘要，识别哪些项目最活跃，并把结果写入 Supabase 供后续分析。

预期产出

一个自动化开源项目追踪方案，产出每日摘要、活跃度判断和可分析的数据表。

// 文档

Data Scraper Agent

Build a production-ready, AI-powered data collection agent for any public data source. Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.

Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase

When to Activate

User wants to scrape or monitor any public website or API
User says "build a bot that checks...", "monitor X for me", "collect data from..."
User wants to track jobs, prices, news, repos, sports scores, events, listings
User asks how to automate data collection without paying for hosting
User wants an agent that gets smarter over time based on their decisions

Core Concepts

The Three Layers

Every data scraper agent has three layers:

COLLECT → ENRICH → STORE
  │           │        │
Scraper    AI (LLM)  Database
runs on    scores/   Notion /
schedule   summarises Sheets /
           & classifies Supabase

Free Stack

Layer	Tool	Why
Scraping	`requests` + `BeautifulSoup`	No cost, covers 80% of public sites
JS-rendered sites	`playwright` (free)	When HTML scraping fails
AI enrichment	Gemini Flash via REST API	500 req/day, 1M tokens/day — free
Storage	Notion API	Free tier, great UI for review
Schedule	GitHub Actions cron	Free for public repos
Learning	JSON feedback file in repo	Zero infra, persists in git

AI Model Fallback Chain

Build agents to auto-fallback across Gemini models on quota exhaustion:

gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (fallback)

Batch API Calls for Efficiency

Never call the LLM once per item. Always batch:

# BAD: 33 API calls for 33 items
for item in items:
    result = call_ai(item)  # 33 calls → hits rate limit

# GOOD: 7 API calls for 33 items (batch size 5)
for batch in chunks(items, size=5):
    results = call_ai(batch)  # 7 calls → stays within free tier

Workflow

Step 1: Understand the Goal

Ask the user:

What to collect: "What data source? URL / API / RSS / public endpoint?"
What to extract: "What fields matter? Title, price, URL, date, score?"
How to store: "Where should results go? Notion, Google Sheets, Supabase, or local file?"
How to enrich: "Do you want AI to score, summarise, classify, or match each item?"
Frequency: "How often should it run? Every hour, daily, weekly?"

Common examples to prompt:

Job boards → score relevance to resume
Product prices → alert on drops
GitHub repos → summarise new releases
News feeds → classify by topic + sentiment
Sports results → extract stats to tracker
Events calendar → filter by interest

Step 2: Design the Agent Architecture

Generate this directory structure for the user:

my-agent/
├── config.yaml              # User customises this (keywords, filters, preferences)
├── profile/
│   └── context.md           # User context the AI uses (resume, interests, criteria)
├── scraper/
│   ├── __init__.py
│   ├── main.py              # Orchestrator: scrape → enrich → store
│   ├── filters.py           # Rule-based pre-filter (fast, before AI)
│   └── sources/
│       ├── __init__.py
│       └── source_name.py   # One file per data source
├── ai/
│   ├── __init__.py
│   ├── client.py            # Gemini REST client with model fallback
│   ├── pipeline.py          # Batch AI analysis
│   ├── jd_fetcher.py        # Fetch full content from URLs (optional)
│   └── memory.py            # Learn from user feedback
├── storage/
│   ├── __init__.py
│   └── notion_sync.py       # Or sheets_sync.py / supabase_sync.py
├── data/
│   └── feedback.json        # User decision history (auto-updated)
├── .env.example
├── setup.py                 # One-time DB/schema creation
├── enrich_existing.py       # Backfill AI scores on old rows
├── requirements.txt

…

查看完整文档 ↗