Workspace/Evaluations
production
/04Evaluations

Evaluation runs

Every candidate is measured against its baseline. Runs are flagged when a candidate underperforms or regresses on rubric criteria.
Coverage
94%
▲ 2.1 pts · 30d
Avg cases per run
186
contract-corpus largest
Regressions · 24h
6
Contract Clause Reviewer
Median runtime
28s
▼ 4s vs last week
Skill / VersionDatasetCasesPassΔStatusStarted
Contract Clause Reviewerv2.4.0-rc.2
contract-corpus-v924824260.66 regressions58m ago
Incident Triagev3.1.0-rc.4
sev-2-corpus-v313212480.4running1h ago
Quarterly Earnings Summaryv0.5.3-rc.1
earnings-q3-v1847950.2passed1h ago
PR Summarizerv1.8.3-rc.1
pr-corpus-v632031460.3passed2h ago
SOC 2 Evidence Drafterv1.2.0
soc-2-corpus-v2969240.1passed4h ago
Engineering RFC Reviewerv2.0.0
rfc-corpus-v415615420.5passed6h ago
Customer Email Tone Passv1.0.4
email-tone-v2280268120.2passed9h ago
Vendor Onboarding Memov0.4.1
vendor-corpus-v1645590.4passed1d ago
RFP Response Drafterv0.9.0
rfp-corpus-v1726210baselinebaseline set1d ago
Contract Clause Reviewerv2.3.7
contract-corpus-v924824620.4passed6d ago
Queue · running
1
Incident Triage98%
v3.1.0-rc.4 · sev-2-corpus-v3
124/132 cases · ~14s remaining
Eval workers2 / 4 active
Coverage by tier
Skills with active eval suites
T1Tier 1
100%
T2Tier 2
94%
T3Tier 3
71%
Top failure modes · 30d
Off-policy response42
Missing risk-flag31
Format / schema drift18
Latency p95 over budget14
Tone compliance8