为 Claude Code 会话建立正式评估流程,支持评测驱动开发与质量验证
复制安装指令,让 AI 自动完成配置 · 推荐新手
请帮我安装 askskill 上的 "eval-harness" 技能: 1. 下载 https://raw.githubusercontent.com/affaan-m/ECC/main/skills/eval-harness/SKILL.md 2. 保存为 ~/.claude/skills/eval-harness/SKILL.md 3. 装好后重载技能,告诉我可以用了
请为 Claude Code 的重构任务设计一套正式评测方案,采用评测驱动开发原则。输出测试目标、输入样例、评分标准、失败判定条件,以及如何在每次会话后自动汇总结果。
一套结构化评测方案,包含测试用例、评分维度、通过标准与结果汇总方法。
我想评估 Claude Code 在多轮调试会话中的表现。请设计一个评估框架,衡量正确性、稳定性、修复效率和指令遵循度,并给出适合持续集成的执行步骤。
一个面向多轮调试场景的评估框架,含指标定义、执行流程和 CI 集成建议。
基于一组 Claude Code 会话评测结果,帮我分析常见失败模式,并提出下一轮评测驱动开发中的提示词优化建议、测试补充方向和优先修复项。
一份失败模式分析与迭代建议清单,帮助优化提示词和后续评测设计。
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Eval-Driven Development treats evals as the "unit tests of AI development":
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
Deterministic checks using code:
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
Use Claude to evaluate open-ended outputs:
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
Flag for manual review:
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
"At least one success in k attempts"
"All k trials succeed"
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
Write code to pass the defined evals.
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
/eval define feature-name
Creates eval definition file at .claude/evals/feature-name.md
/eval check feature-name
Runs current evals and reports status
/eval report feature-name
Generates full eval report
Store evals in project:
.claude/
evals/
…
帮助开发者为代码代理配置性能优化、安全防护与研究优先工作流。
提供数据库迁移、回滚与零停机发布的最佳实践指导,适用于多种 ORM 与 SQL 数据库。
通过双评审智能体对结果进行对抗式校验,提升输出发布前的可靠性
帮助你掌握地道 Rust 模式、所有权与并发实践,编写安全高性能应用。
基于 C++ Core Guidelines 编写、审查并重构更安全现代的 C++ 代码。
为 Claude Code 会话提供系统化校验流程,帮助检查结果正确性与质量。
为 Claude Code 会话提供正式评估框架,支持评估驱动开发与质量验证。
为 Claude Code 会话提供正式评测框架,支持评估驱动开发流程。
为 Claude Code 会话建立基于 EDD 的正式评测框架与质量验证流程
为 Claude Code 会话提供自动化校验流程,帮助检查代码、输出与执行结果。
帮助你编写、校验并运行基于 eval.yaml 的智能体评测套件
为 Claude Code 会话提供全面校验流程,提升代码与结果可靠性。