GenAI Cost Estimation Template
AI projects go over budget for the same three reasons: token usage is modeled at a fraction of real consumption, hidden human costs get left off the spreadsheet, and the "let's upgrade to the new model" decision later blows up runtime cost by 3-5x. This template is the cost model we walk clients through before any build starts. The numbers won't be exact, but they'll be within 30%, which is more than most projects can say at kickoff.
The cheapest line item in a GenAI project is rarely the tokens. It's also rarely the one that blows up the budget. Model the whole stack.
The seven cost categories
1. Development cost
Line items:
- Prompt iteration: eng time on prompt writing, testing, review
- Eval harness build: one-time engineering cost + ongoing maintenance
- Integration: hooking the system into upstream data and downstream actions
- Frontend / UI if customer-facing
Rule of thumb: 6-12 engineering weeks to get a non-trivial production GenAI feature shipped. Expect a mid-to-senior engineer, not a bootcamp hire. Fully-loaded cost at $250/hr = $60K-$120K.
2. Data prep cost
Line items:
- Source data collection and cleanup
- Labeling (if needed for evals or fine-tuning)
- Chunking and indexing for RAG (one-time and incremental)
- Document parsing (PDFs, scanned docs, tables)
Rule of thumb: for a RAG system with 10K documents, budget 40-80 hours of data engineering + 20-40 hours of SME review time. If documents are PDFs with complex layout, multiply parsing effort by 3x.
3. Runtime cost (the formula)
This is the line everyone obsesses over. Simple version:
monthly_runtime_cost =
requests_per_month
× avg_tokens_per_request
× cost_per_1M_tokens / 1_000_000
With RAG, avg_tokens_per_request should include:
- System prompt tokens
- Retrieved context tokens (often 2-6K)
- User query tokens
- Output tokens (weighted at output price — often 4x input)
A more honest formula:
cost = requests × (prompt_tokens × input_price + output_tokens × output_price)
Watch out for:
- Output tokens priced 3-5x input tokens. Verbose outputs get expensive fast.
- Retries on structured output failures — budget 1.1-1.3x base cost.
- Reasoning models bill for hidden thinking tokens.
- Tool calls that re-send the full conversation history.
4. Infrastructure
Line items:
- Vector DB: managed (Pinecone, Weaviate Cloud) $70-$500/mo baseline, scales with dimensions × vectors. Self-hosted (pgvector, Qdrant): compute + storage.
- Embedding API calls: one-time for index build, ongoing for new docs and queries. Query-side embeddings add up — factor in.
- Caching layer: Redis or similar for prompt/response caching. $30-$200/mo for modest workloads.
- Compute for orchestration (serverless or container).
Rule of thumb: for a small-to-medium RAG system, infra runs $300-$1500/mo in addition to model API costs.
5. Observability and tooling
Line items:
- LLM-specific observability (Langfuse, Arize Phoenix, Helicone, LangSmith, Braintrust): $0-$2K/mo depending on volume and tier
- Standard infra logging (Datadog, etc.) — usually already paid for
- Eval tool subscriptions if you buy rather than build
Rule of thumb: budget $500-$2K/mo for LLM observability at mid-scale. This is non-negotiable for production — see our Eval Starter Kit.
6. Human cost
Quietly, this is often the largest line.
Line items:
- Review loop: humans reviewing outputs, especially in human-in-the-loop systems. If a reviewer spends 30 seconds per output at $40/hr and you do 1000 reviews/day, that's $5-6K/mo in review labor alone.
- Labeling for ongoing eval set growth: 2-5 hours/week of SME time.
- On-call: partial on-call rotation for the AI system.
- Prompt / eval curator: often an underspecified role that absorbs 0.5 FTE.
Rule of thumb: for a meaningful production system, budget 0.5-1 FTE of ongoing human labor across review, eval curation, and prompt maintenance.
7. Hidden costs
The ones that show up on the invoice nobody expected.
- Model upgrades: when the provider deprecates your model, you re-run evals, re-tune prompts, re-certify. Budget 2-4 engineering weeks per major provider upgrade, 1-2x per year.
- Prompt rewrites at scale: when a feature changes, 10+ downstream prompts may need updates.
- Retraining / re-embedding: if you switch embedding models, you re-embed everything. Budget.
- Compliance and audit work: for regulated industries, 10-20% overhead across the board.
- Incident response: the first real AI incident costs 1-2 weeks of engineering.
Worked example: internal support bot
Internal IT support chatbot for a 5000-person company. RAG over a 8K-article KB. Launch scope.
Assumptions:
- 50,000 queries/month (10 per user per month, modest)
- Each query: ~1,500 prompt tokens (system + retrieved context + query), ~300 output tokens
- Model: GPT-4o-mini at roughly $0.15/1M input, $0.60/1M output (illustrative)
- Embedding: text-embedding-3-small
- Vector DB: managed (Pinecone starter)
- Observability: Langfuse hobby/team tier
Runtime cost:
input_cost = 50,000 × 1,500 tokens × $0.15 / 1,000,000 = $11.25
output_cost = 50,000 × 300 tokens × $0.60 / 1,000,000 = $9.00
retries/guardrails buffer (+20%) = $4.05
MONTHLY RUNTIME (model) ≈ $24.30
Yes, under $30/mo. This is why teams assume GenAI is free. But keep going.
Embedding cost (query-side): 50,000 queries × ~200 tokens × $0.02/1M = ~$0.20/mo. Negligible.
Embedding cost (one-time index): 8,000 docs × avg 3,000 tokens = 24M tokens × $0.02/1M = ~$0.48. Negligible.
Infra:
- Pinecone starter: $70/mo
- Redis cache (ElastiCache small): $40/mo
- Compute (small container or Lambda): $50/mo
- Subtotal: ~$160/mo
Observability:
- Langfuse Team tier: ~$100-$300/mo at this volume
- Subtotal: ~$200/mo
Human:
- 0.25 FTE of an IT analyst curating eval set, reviewing flagged responses, and owning prompts. At $120K fully loaded: ~$2,500/mo
- On-call partial contribution: ~$500/mo
- Subtotal: ~$3,000/mo
Monthly total (steady state): $24 + $160 + $200 + $3,000 ≈ $3,400/mo
Development cost (one-time):
- 8 weeks senior engineer × $10K/week = $80K
- Data prep and chunking: $10K
- UI and integration: $15K
- Subtotal: ~$105K
Year-one total: $105K + 12 × $3,400 ≈ $146K.
Notice: the model API itself was 0.7% of the year-one cost. Human labor and dev cost were 95%.
| Category | Monthly | % of steady-state |
|---|---|---|
| Model tokens | $24 | 0.7% |
| Infra | $160 | 4.7% |
| Observability | $200 | 5.9% |
| Human | $3,000 | 88.2% |
| Hidden (buffer 10%) | $340 | — |
| Total (approx) | $3,700/mo | — |
Now imagine you "upgrade" to a frontier model without re-evaluating. Swap to GPT-4o at $2.50/$10 per 1M tokens: runtime jumps from $24 to ~$180/mo. Still not the biggest line. But if you also drop caching ("the model is smarter now"), queries triple in cost, and suddenly your "AI upgrade" is a line item.
The model bill is the smallest number on this page and the one that can 10x overnight. Model it with headroom, instrument it, and don't assume "cheap model today" means "cheap forever."
Worksheet
Fill in your own:
| Category | Your estimate |
|---|---|
| Q1: Requests/month | |
| Q2: Avg prompt tokens | |
| Q3: Avg output tokens | |
| Q4: Model + price ($/1M in, $/1M out) | |
| Q5: Runtime cost/mo (compute Q1-Q4) | |
| Q6: Vector DB + infra/mo | |
| Q7: Observability/mo | |
| Q8: Human FTE allocated (fraction) | |
| Q9: Human $/mo (Q8 × fully-loaded) | |
| Q10: Hidden buffer (+10-20%) | |
| Steady-state total/mo | |
| Dev cost (one-time) | |
| Year 1 total |
Next step
If you're building a business case and want a second pair of eyes on the numbers before it goes to the CFO, reach out. We'll pressure-test the assumptions and flag the line items you're likely missing.
Unlock the full template.
Drop your details to see the full worksheet and worked examples. GenAI Cost Estimation Template.