GenAI Cost Estimation Template

AI projects go over budget for the same three reasons: token usage is modeled at a fraction of real consumption, hidden human costs get left off the spreadsheet, and the "let's upgrade to the new model" decision later blows up runtime cost by 3-5x. This template is the cost model we walk clients through before any build starts. The numbers won't be exact, but they'll be within 30%, which is more than most projects can say at kickoff.

The cheapest line item in a GenAI project is rarely the tokens. It's also rarely the one that blows up the budget. Model the whole stack.

The seven cost categories

1. Development cost

Line items:

Prompt iteration: eng time on prompt writing, testing, review
Eval harness build: one-time engineering cost + ongoing maintenance
Integration: hooking the system into upstream data and downstream actions
Frontend / UI if customer-facing

Rule of thumb: 6-12 engineering weeks to get a non-trivial production GenAI feature shipped. Expect a mid-to-senior engineer, not a bootcamp hire. Fully-loaded cost at $250/hr = $60K-$120K.

2. Data prep cost

Line items:

Source data collection and cleanup
Labeling (if needed for evals or fine-tuning)
Chunking and indexing for RAG (one-time and incremental)
Document parsing (PDFs, scanned docs, tables)

Rule of thumb: for a RAG system with 10K documents, budget 40-80 hours of data engineering + 20-40 hours of SME review time. If documents are PDFs with complex layout, multiply parsing effort by 3x.

3. Runtime cost (the formula)

This is the line everyone obsesses over. Simple version:

monthly_runtime_cost =
    requests_per_month
  × avg_tokens_per_request
  × cost_per_1M_tokens / 1_000_000

With RAG, avg_tokens_per_request should include:

System prompt tokens
Retrieved context tokens (often 2-6K)
User query tokens
Output tokens (weighted at output price — often 4x input)

A more honest formula:

cost = requests × (prompt_tokens × input_price + output_tokens × output_price)

Watch out for:

Output tokens priced 3-5x input tokens. Verbose outputs get expensive fast.
Retries on structured output failures — budget 1.1-1.3x base cost.
Reasoning models bill for hidden thinking tokens.
Tool calls that re-send the full conversation history.

4. Infrastructure

Line items:

Vector DB: managed (Pinecone, Weaviate Cloud) $70-$500/mo baseline, scales with dimensions × vectors. Self-hosted (pgvector, Qdrant): compute + storage.
Embedding API calls: one-time for index build, ongoing for new docs and queries. Query-side embeddings add up — factor in.
Caching layer: Redis or similar for prompt/response caching. $30-$200/mo for modest workloads.
Compute for orchestration (serverless or container).

Rule of thumb: for a small-to-medium RAG system, infra runs $300-$1500/mo in addition to model API costs.

5. Observability and tooling

Line items:

LLM-specific observability (Langfuse, Arize Phoenix, Helicone, LangSmith, Braintrust): $0-$2K/mo depending on volume and tier
Standard infra logging (Datadog, etc.) — usually already paid for
Eval tool subscriptions if you buy rather than build

Rule of thumb: budget $500-$2K/mo for LLM observability at mid-scale. This is non-negotiable for production — see our Eval Starter Kit.

6. Human cost

Quietly, this is often the largest line.

Line items:

Review loop: humans reviewing outputs, especially in human-in-the-loop systems. If a reviewer spends 30 seconds per output at $40/hr and you do 1000 reviews/day, that's $5-6K/mo in review labor alone.
Labeling for ongoing eval set growth: 2-5 hours/week of SME time.
On-call: partial on-call rotation for the AI system.
Prompt / eval curator: often an underspecified role that absorbs 0.5 FTE.

Rule of thumb: for a meaningful production system, budget 0.5-1 FTE of ongoing human labor across review, eval curation, and prompt maintenance.

7. Hidden costs

The ones that show up on the invoice nobody expected.

Model upgrades: when the provider deprecates your model, you re-run evals, re-tune prompts, re-certify. Budget 2-4 engineering weeks per major provider upgrade, 1-2x per year.
Prompt rewrites at scale: when a feature changes, 10+ downstream prompts may need updates.
Retraining / re-embedding: if you switch embedding models, you re-embed everything. Budget.
Compliance and audit work: for regulated industries, 10-20% overhead across the board.
Incident response: the first real AI incident costs 1-2 weeks of engineering.

Worked example: internal support bot

Internal IT support chatbot for a 5000-person company. RAG over a 8K-article KB. Launch scope.

Assumptions:

50,000 queries/month (10 per user per month, modest)
Each query: ~1,500 prompt tokens (system + retrieved context + query), ~300 output tokens
Model: GPT-4o-mini at roughly $0.15/1M input, $0.60/1M output (illustrative)
Embedding: text-embedding-3-small
Vector DB: managed (Pinecone starter)
Observability: Langfuse hobby/team tier

Runtime cost:

input_cost  = 50,000 × 1,500 tokens × $0.15 / 1,000,000  = $11.25
output_cost = 50,000 × 300 tokens   × $0.60 / 1,000,000  = $9.00
retries/guardrails buffer (+20%)                         = $4.05
MONTHLY RUNTIME (model)                                  ≈ $24.30

Yes, under $30/mo. This is why teams assume GenAI is free. But keep going.

Embedding cost (query-side): 50,000 queries × ~200 tokens × $0.02/1M = ~$0.20/mo. Negligible.

Embedding cost (one-time index): 8,000 docs × avg 3,000 tokens = 24M tokens × $0.02/1M = ~$0.48. Negligible.

Infra:

Pinecone starter: $70/mo
Redis cache (ElastiCache small): $40/mo
Compute (small container or Lambda): $50/mo
Subtotal: ~$160/mo

Observability:

Langfuse Team tier: ~$100-$300/mo at this volume
Subtotal: ~$200/mo

Human:

0.25 FTE of an IT analyst curating eval set, reviewing flagged responses, and owning prompts. At $120K fully loaded: ~$2,500/mo
On-call partial contribution: ~$500/mo
Subtotal: ~$3,000/mo

Monthly total (steady state): $24 + $160 + $200 + $3,000 ≈ $3,400/mo

Development cost (one-time):

8 weeks senior engineer × $10K/week = $80K
Data prep and chunking: $10K
UI and integration: $15K
Subtotal: ~$105K

Year-one total: $105K + 12 × $3,400 ≈ $146K.

Notice: the model API itself was 0.7% of the year-one cost. Human labor and dev cost were 95%.

Category	Monthly	% of steady-state
Model tokens	$24	0.7%
Infra	$160	4.7%
Observability	$200	5.9%
Human	$3,000	88.2%
Hidden (buffer 10%)	$340	—
Total (approx)	$3,700/mo	—

Now imagine you "upgrade" to a frontier model without re-evaluating. Swap to GPT-4o at $2.50/$10 per 1M tokens: runtime jumps from $24 to ~$180/mo. Still not the biggest line. But if you also drop caching ("the model is smarter now"), queries triple in cost, and suddenly your "AI upgrade" is a line item.

The model bill is the smallest number on this page and the one that can 10x overnight. Model it with headroom, instrument it, and don't assume "cheap model today" means "cheap forever."

Worksheet

Fill in your own:

Category	Your estimate
Q1: Requests/month
Q2: Avg prompt tokens
Q3: Avg output tokens
Q4: Model + price ($/1M in, $/1M out)
Q5: Runtime cost/mo (compute Q1-Q4)
Q6: Vector DB + infra/mo
Q7: Observability/mo
Q8: Human FTE allocated (fraction)
Q9: Human $/mo (Q8 × fully-loaded)
Q10: Hidden buffer (+10-20%)
Steady-state total/mo
Dev cost (one-time)
Year 1 total

Next step

If you're building a business case and want a second pair of eyes on the numbers before it goes to the CFO, reach out. We'll pressure-test the assumptions and flag the line items you're likely missing.

Fillable Template

Unlock the full template.

Drop your details to see the full worksheet and worked examples. GenAI Cost Estimation Template.