AI Agent Scoping Worksheet

Agent projects fail differently than other GenAI projects. They fail at the seams: the tool that times out intermittently, the step where the model decided to improvise, the retry loop that burned $400 in tokens before anyone noticed. This worksheet is how we pin down an agent project before a line of code is written. It forces the conversations that, left to the prototype phase, turn into emergency slack threads.

Agents don't fail in the LLM. They fail in the surrounding system — tool contracts, state management, human handoff. Scope the seams, not the model.

Section 1: Task taxonomy

Write the task out, step by step, as if instructing a new hire. No "and then the agent figures it out."

Q1: What is the end-to-end job?

One sentence. "Takes a customer refund request from intake to refund issued or escalated." Not "handles refunds."

Q2: Decompose the job into ordered steps.

List them. Expect 5-15 steps. If you can't decompose, the agent will also fail to. Example:

1. Read incoming ticket
2. Classify refund type (damaged, wrong item, changed mind, fraud)
3. Look up order in OMS
4. Check refund eligibility against policy
5. If eligible and <$100: issue refund automatically
6. If eligible and >=$100: queue for human approval with draft response
7. If ineligible: draft denial with reason, queue for human review
8. Send response
9. Update ticket status

Q3: At which steps does the agent make a judgment call vs execute deterministic logic?

Mark each step: J (judgment) or D (deterministic). Deterministic steps should be code, not LLM calls. Be ruthless — most steps should be D.

Q4: What's the branching factor?

How many distinct paths through the graph? More than 10 branches = high test burden. More than 50 = consider splitting into smaller agents or moving logic to code.

Section 2: Tool inventory

Q5: List every external system the agent reads from or writes to.

Name, purpose, read/write, API style (REST, GraphQL, SOAP, SDK). Include databases, queues, internal services, SaaS tools, email, and "a human in Slack" (yes, that's a tool).

Q6: For each tool: latency, rate limit, auth model, and failure mode.

Fill in a row per tool. Tools with P95 > 2s dominate the agent's latency. Tools without idempotency keys are dangerous for retries.

Tool	Read/Write	P95 latency	Rate limit	Auth	Idempotent?
e.g. Stripe API	Write	400ms	100/sec	API key	Yes (idempotency-key header)
e.g. Slack webhook	Write	600ms	1/sec/channel	Bot token	No

Q7: Which tools have destructive side effects?

Refunds, emails sent, orders cancelled, records deleted. These need explicit confirmation or a dry-run mode during development and a confirmation gate in production.

Q8: What's the tool contract for each?

Input schema, output schema, error shape. If tools return free-form JSON or HTML, plan a parsing/validation layer. The agent calling a tool with inconsistent output will flail.

Section 3: Autonomy levels

Q9: Where does the human sit in this loop?

Pick one per step:

Autonomous: agent acts, no review.
Human-on-the-loop: agent acts, human can intervene or review after.
Human-in-the-loop: agent proposes, human approves before action.
Human-initiated: agent only acts after a human kicks off the step.

Q10: What's the autonomy dial today, and where does it go over time?

Ship with the most conservative setting that still proves value. Document what quality threshold moves a step from in-the-loop to on-the-loop. Autonomy is earned.

Q11: What's the kill switch?

How does a human stop the agent mid-task? Is there an "are you sure?" confirmation for high-impact actions? What's the max number of steps or tool calls per run?

Section 4: Success criteria

Q12: What's the accuracy threshold at each judgment step?

Per step, not overall. If step 2 (classification) is 95% accurate and step 4 (eligibility) is 92%, your end-to-end is ~87%. Compounding is brutal.

Q13: Latency SLO?

End-to-end P50 and P95. Budget by step. If a step dominates latency, parallelize or cache.

Q14: Cost ceiling per run?

A number. Agents with unbounded loops are the top cause of "why is our AWS bill $40K this month" incidents. See our GenAI Cost Estimation Template.

Q15: What's the end-to-end success metric?

"Refund processed correctly and customer satisfied" — how do you measure it? Ticket reopen rate? CSAT? Auditor spot-check? Define it up front.

Section 5: Failure modes

For each, write what the agent does when it happens. "It will figure it out" is not an answer.

Q16: Tool timeout or 5xx.

Retry count, backoff, circuit breaker, fallback path. Agents that retry forever are a liability.

Q17: Tool returns unexpected shape.

Validation, structured error to the agent, escalation to human. Never let free-form tool output drive the next prompt uncritically.

Q18: Model returns malformed tool call.

Structured outputs + repair loop with a hard max retries. After N tries, escalate.

Q19: Agent loops (same action repeatedly).

Detect loops by action fingerprint. Kill after 3 repeats. Log and alert.

Q20: Ambiguous input — not enough info to decide.

Does the agent ask a clarifying question, escalate, or use a default? Decide per step.

Q21: The agent succeeds but does the wrong thing.

Silent failures are the worst. What's the audit trail? Can a human replay the decision and see why?

Section 6: Eval plan

Q22: What data do you evaluate on?

Historical runs of the same task, synthetic cases, a held-out set of real cases with labeled expected final state. Bias toward real.

Q23: What do you measure?

Per-step metrics (did step 3 classify correctly?), trajectory metrics (did the agent take a reasonable path?), and outcome metrics (did the end state match expected?). All three. See our LLM Eval Starter Kit.

Q24: How do you handle non-determinism in evals?

Multiple paths can be correct. Score on outcome and trajectory-validity separately. Don't penalize a different-but-correct path.

Q25: What's in the regression lock set?

The 20 happy-path scenarios that must never regress. Run on every change.

How to use this

Fill in all sections before writing agent code. Expect unknowns. Every unknown is a sprint-zero task.

The answers to Section 5 (failure modes) will dictate 40% of your code. Do not skip it. "We'll handle errors later" is how you end up with an agent that sends 400 duplicate emails on its first production run.

A well-scoped agent is 80% a well-specified tool layer and well-defined autonomy boundaries, 20% prompting. Projects that invert this ratio don't ship.

Next step

If you're scoping an agent project and want a sparring partner on autonomy levels and failure modes, get in touch. We've seen where agents break — we'll help you avoid the usual ones.

Fillable Template

Unlock the full template.

Drop your details to see the full worksheet and worked examples. AI Agent Scoping Worksheet.