AI Implementation Playbook

Most GenAI projects don't fail at the model. They fail at scoping, evals, or the handoff from prototype to production. This playbook is the ordered sequence we run on engagements, with the specific artifacts we expect out of each phase, the failure modes we've seen kill pilots, and the red flags that tell you to stop and regroup. Treat it as a checklist, not a waterfall — phases overlap, but skipping one almost always costs you a rewrite.

If you can't describe your use case in one sentence and your success metric in one number, you're not ready for Phase 2.

Phase 1: Discovery

Goal: Confirm this is a problem GenAI is the right tool for, and build an ROI thesis you can defend.

Deliverables

A one-sentence use case statement ("Reduce support handle time for tier-1 billing tickets by deflecting with an AI assistant in the help center.")
Baseline metrics: current state numbers (resolution time, cost, error rate, volume).
ROI thesis: one-page with the assumption math. If the honest number is under $500K/year impact and this is a custom build, stop.
Stakeholder map: who approves, who integrates, who owns it after launch.
"Why GenAI" justification: what's different about this vs a rules engine, classical ML, or no automation.

Common failure modes

Solutioning before the problem is pinned. "We need a chatbot" is not a problem statement.
Taking the first use case leadership pitched. It's often the loudest, not the highest-leverage. Inventory 5+ candidates and score them.
ROI math that assumes 100% adoption. Discount hard: 30-50% of projected impact is a realistic year-one number.

Red flags

No baseline metric exists and no one wants to measure one.
The executive sponsor can't name a person who will own this post-launch.
The use case requires the model to be correct >99% of the time with no human review.

Phase 2: Scoping

Goal: Translate the use case into a technical plan with an eval-first mindset.

Deliverables

Eval plan (before any code). What does "good" look like? What's the target metric? What's the threshold to ship?
Data audit. What data exists, where, in what shape, under what access controls. Sample 100 real inputs.
Architecture sketch. RAG / fine-tune / tool use / agent. See our RAG vs Fine-tuning framework.
Model shortlist. 2-3 candidate models with cost-per-request estimates.
Guardrail requirements. PII handling, prompt injection, refusal policy, escalation paths.
Integration surface. What systems does this read from and write to?

Common failure modes

Eval plan is deferred to "after the prototype works." You will build a prototype that works on your favorite 5 examples and fails on the 95 you didn't try.
Data audit reveals the data doesn't exist / isn't labeled / lives in a SaaS tool you can't export. Better to find out now.
Architecture picked from a conference talk. Agents for everything, vector DBs as a default. Resist.

Red flags

No one can produce 50 real examples of the task.
"We'll figure out evals later."
Security review scheduled for week before launch.

Phase 3: Prototype

Goal: Build the simplest thing that might work, on a representative 20-50 examples.

Deliverables

Working prototype on a single path (happy case + 2-3 edge cases).
Initial prompt and retrieval setup in version control, with prompts as data.
50-case eval set (curated, not random).
First pass on the eval. Record the number. It will be worse than you hope.

Common failure modes

Demo-driven development. Polishing the prototype to look good in a screenshot instead of being measurable.
Over-engineering the scaffolding. LangChain layers on LangGraph layers on a custom agent framework. Keep it flat.
Skipping the eval run. If you don't know the baseline number after week 1, you're flying blind the rest of the project.

Red flags

The first eval score is 40% and the team says "we'll fix it in prod."
The prototype uses a model/API tier no one has budget-approved.
"It's 90% there" with no number behind it.

Phase 4: Eval

Goal: Build the eval harness that tells you whether changes help or hurt. This is the phase that separates projects that ship from projects that die in pilot.

Deliverables

Eval dataset, 200+ cases, with clear labels (correct/incorrect, or rubric scores).
Automated eval runner that can score a prompt or model version in under 10 minutes.
Regression suite: the top 20 cases that must never regress.
LLM-as-judge prompts for subjective criteria (tone, helpfulness), validated against human ratings on a sample.
Dashboard or report: score per version, broken down by category.

Common failure modes

LLM-as-judge without calibration. You're measuring the judge, not the system.
Eval dataset drawn from the same examples used to build prompts. You've overfit.
One aggregate score. Breakdown by category is where real signal lives.
Eval takes 3 hours to run. Engineers will stop running it.

Red flags

Team debates prompt changes based on vibes, not scores.
No one on the team can answer "what's the eval score on main right now?"
"We'll add evals after launch."

Phase 5: Production

Goal: Get to a stable, observable, cost-bounded service with guardrails.

Deliverables

Observability: request/response logging (PII-aware), latency, token counts, cost per request, error rates. Per-user traces.
Guardrails: input validation, prompt injection defense, output filtering, PII redaction, refusal for out-of-scope queries.
Cost controls: per-user rate limits, prompt caching, model tier routing, token budgets.
Fallback path: what happens when the model is down, slow, or returns garbage. Human escalation or cached response.
On-call runbook: symptoms, triage steps, rollback procedure.
Canary / gradual rollout plan. Never 100% on day one.
Drift monitoring: track eval scores on production traffic samples, not just launch set.

Common failure modes

Cost blows up in week 2. Someone's prompt went from 500 tokens to 5000 after an "improvement."
No rollback plan. You ship a new prompt, quality drops, and you're scrambling through git history in an incident.
Guardrails are regex and hope. Prompt injection will find you.
Observability is print statements and good intentions.

Red flags

Logs don't capture the full prompt and response.
No one is paged when error rate spikes.
Costs aren't tracked per-feature or per-user segment.

Phase 6: Iteration

Goal: Establish the cadence of improvement: measure, change, re-measure.

Deliverables

Weekly or biweekly eval runs on main + any open experiments.
Labeled production samples flowing back into the eval set (bad outputs become new test cases).
Quarterly model refresh: re-evaluate against newer/cheaper models.
Prompt version history with eval scores attached.
Cost review monthly. Token usage per feature. Opportunities to cache, downsize model, or batch.

Common failure modes

Eval set calcifies. Production drifts, evals don't follow.
Never re-evaluating against newer models. You're paying frontier prices for a task a cheap model now handles.
Prompt tweaks without tracking. "It used to work better" with no git history to prove it.

Red flags

The team hasn't updated the eval set in 3 months.
No one has asked "can we run this on a smaller model now?" in 6 months.
Incidents are resolved but not added as test cases.

Phase summary

Phase	Primary artifact	Ship gate
Discovery	ROI thesis	Sponsor + owner signed off
Scoping	Eval plan	Target metric and threshold defined
Prototype	50-case eval score	Measurable baseline
Eval	200+ case harness	Regression suite green
Production	Observability + guardrails	Canary healthy at 5% traffic
Iteration	Weekly eval cadence	Prod evals tracked, drift detected

The gap between a prototype that demos well and a system that survives production is almost entirely Phases 4 and 5. Budget accordingly — it's usually 40% of the total effort, not 10%.

Next step

If you're mid-pilot and some of these red flags are lighting up, talk to us. We've pulled a lot of projects out of Phase 3 purgatory.

Full Guide

Finish reading the guide.

Drop your details to unlock the rest of the guide on this page. AI Implementation Playbook.