Enterprise Chatbot RFP Checklist

Most enterprise chatbot RFPs are written like it's still 2019 — questions about intents, NLU accuracy, and button-based flows. The vendors have moved on; the RFPs haven't. The result is procurement teams comparing incomparable answers and signing 3-year contracts based on a demo. This checklist is the 30 questions we've drafted into client RFPs that cut through the demoware. Each has a one-line "good answer looks like" so you can score the responses.

The point of the RFP is not to collect marketing copy. It's to force the vendor to describe how their system behaves on a Tuesday afternoon when something goes wrong.

Section 1: Architecture and data

Q1: Describe your retrieval approach in technical terms. What chunker, embedder, vector store, and reranker? Good answer: specifies each component by name, explains why, mentions hybrid (BM25 + dense) or reranking. Red flag: "we use AI."

Q2: How does the system stay current as our source documents change? Good answer: incremental re-indexing on change events, with a named freshness SLO (e.g., "updates propagate within 5 minutes"). Red flag: "weekly full re-index" or silence.

Q3: How do you handle document-level access control? If a user shouldn't see a doc, how is that enforced at retrieval? Good answer: filters applied at query time using the user's group memberships, with an auditable log. Red flag: "we don't retrieve restricted docs" — they mean they hope.

Q4: Do you support multiple knowledge domains with different retrieval strategies? Good answer: yes, with per-domain config for chunking, index, and prompts. Red flag: one global knob.

Q5: What's your approach to chunking? Fixed-size, semantic, or hierarchical? Good answer: semantic or hierarchical with document-type-aware splitting (FAQ vs manual vs policy). Red flag: "we chunk at 512 tokens."

Q6: How are tables, images, and PDFs with layout handled? Good answer: layout-aware parsing (e.g., specific PDF parsers, table-to-markdown), with named tools. Red flag: "we extract the text."

Section 2: Accuracy and evals

Q7: What metrics do you report to us weekly, and how are they computed? Good answer: retrieval recall, answer faithfulness, helpfulness (rubric-based), refusal rate — with computation method. Red flag: "user satisfaction score" with no definition.

Q8: Do you maintain an eval set of our specific queries with known-good answers? Good answer: yes, maintained with the customer, typically 200-500 cases, refreshed quarterly. Red flag: they don't have an eval set at all.

Q9: How do you prevent regressions when you update prompts or models? Good answer: CI-run eval suite with ship gates, regression lock set. Red flag: "our QA team tests it."

Q10: How do you detect when quality drifts in production? Good answer: sampled production scoring, alerts on rolling-average drops, per-category breakdown. Red flag: "users tell us."

Q11: When you swap underlying models (e.g., a provider releases a new version), what's your process? Good answer: side-by-side eval, staged rollout, rollback path. Red flag: "we upgrade when the provider recommends."

Section 3: Guardrails

Q12: How do you handle prompt injection attempts? Good answer: delimited user input, system prompt separation, input classifiers for injection patterns, output validation. Red flag: "our model is trained to ignore them."

Q13: How do you handle PII in user messages and in retrieved documents? Good answer: configurable redaction/tokenization pre-send, audit log of what left the customer tenant. Red flag: "we don't store PII."

Q14: What's your policy for hallucinations — under what conditions will the system refuse to answer? Good answer: explicit refusal when retrieval confidence is low or when sources contradict, with tunable threshold. Red flag: "we minimize hallucinations."

Q15: Can we configure topic and content policies? Examples of topics the bot should refuse or redirect? Good answer: yes, with a policy file we own and can update without a ticket. Red flag: "contact support to configure."

Q16: How are outputs filtered before display (profanity, confidential tokens, toxic content)? Good answer: named classifier or rules layer, per-deployment tunable. Red flag: no filtering layer.

Q17: What happens if the model generates a confidently wrong answer? How is that detected post-hoc? Good answer: thumbs-down signal triage, faithfulness scoring vs cited sources, weekly incident review. Red flag: "we rely on user feedback."

Section 4: Cost

Q18: What model tier(s) is this running on, and can we tier queries by complexity? Good answer: named models, with a routing layer that sends easy queries to cheaper models. Red flag: "we always use the best model."

Q19: What's the expected token spend for our described volume? Show the math. Good answer: per-query tokens × volume × per-token cost, with a high/low range. Red flag: a flat per-query price that hides markup.

Q20: What caching layers exist — response cache, embedding cache, retrieval cache? Good answer: all three with named TTLs and hit rates from current customers. Red flag: "we cache where possible."

Q21: Is pricing pass-through on model costs, or is there a margin? Good answer: transparent breakdown: infra fee + pass-through. Red flag: "our pricing is usage-based" with no components.

Q22: What's the cost impact if usage doubles overnight? Is there a runaway protection? Good answer: per-tenant rate limits and spend caps, alerting. Red flag: "we autoscale."

Section 5: Integration

Q23: SSO options? SAML, OIDC, Okta, Azure AD? Good answer: lists all by name with existing customers on each. Red flag: "we support SSO" with no specifics.

Q24: What APIs does the bot expose to us (analytics, admin, webhooks)? Good answer: documented REST API, rate-limited, versioned. Red flag: "contact your account manager for data."

Q25: What analytics do you provide out of the box, and can we export raw event data? Good answer: conversation-level export to our warehouse (S3, Snowflake, BigQuery), with schema docs. Red flag: dashboard only.

Q26: Can we deploy in our cloud or in a specific region? Good answer: yes, with a named list of supported regions and a deployment option matrix. Red flag: "we're multi-region" with no specifics.

Section 6: Ops

Q27: What observability do we get — logs, traces, metrics, per-conversation replay? Good answer: per-request trace with retrieved chunks, prompts, model version, latency breakdown, streamed to our SIEM. Red flag: aggregate dashboards only.

Q28: What's your SLA? Specifically, what's included in "availability" — just the API, or end-to-end quality? Good answer: uptime SLA with credits, plus a separately-measured quality SLO. Red flag: "99.9%" with no scope.

Q29: Describe your on-call and incident response. What's the time to first human on a P1? Good answer: named tiers, P1 response in minutes, post-mortem within 5 business days shared with the customer. Red flag: "email support@."

Q30: What's the process when we find a regression — quality, not availability? Good answer: intake path, triage SLA, example of a recent one resolved. Red flag: "file a ticket."

Scoring rubric

Score each answer 0-3:

0 — not answered or handwave
1 — answered generically, no specifics
2 — specific, with named tools or processes
3 — specific + evidence (customer example, metric, or documentation link)

Cut any vendor below 60 (of 90 possible). Interview the top 3.

If a vendor needs more than a paragraph to answer Q1-Q3, their system is a black box. Treat that as the pricing information it is.

Next step

If you're writing this RFP now and want a sanity check before it goes out, reach out. We review RFPs on the buyer side and know the vendor answers that don't hold up.

Fillable Template

Unlock the full template.

Drop your details to see the full worksheet and worked examples. Enterprise Chatbot RFP Checklist.