AI Customer Support
Multi-Agent AI System for Banking Customer Support — Capstone Project Write-Up & Technical Deep Dive
Built a Multi-Agent AI System for Customer Support — Here's Everything I Learned
When a bank customer fires off "My debit card hasn't arrived in three weeks!", something deceptively complex needs to happen. The system must understand that this is a complaint — not a question — empathise with the frustration, generate a unique ticket, persist it to a database, and respond with warmth and a reference number. All in under four seconds. All without a human in the loop.
That's what I set out to build for my Applied Generative AI capstone: a production-grade, multi-agent customer support system for a bank. No monolithic prompt. No single-agent loop. A directed graph of specialised AI agents, each owning exactly one job.
This post walks through what I built, every feature it has, the architectural decisions behind it, and — most importantly — why I chose RAG, why I chose MCP, and why those choices matter beyond this project.
The Problem With "Just Use an LLM"
The first instinct for an AI customer support system is to stuff everything into one prompt:
"You are a banking support assistant.
Classify the message, respond empathetically,
check the database if needed, create tickets if needed..."
This breaks in production for three reasons:
1. Reliability collapses at complexity. A single prompt trying to classify, retrieve data, write to a database, and generate a response will occasionally fail at one step — and you'll have no idea which step failed or why.
2. No auditability. Banking is a regulated domain. When something goes wrong, you need to know: did the classifier misfire? Did the database write fail? Did the LLM hallucinate a policy? A monolithic prompt gives you none of that.
3. No knowledge of your bank. Claude was trained on public internet data. It has zero knowledge of your bank's KYC document requirements, SLA commitments, or card replacement procedures. Ask it, and it will confidently make something up.
The solution is a multi-agent architecture with three key technologies: LangGraph for orchestration, RAG for grounded knowledge, and MCP for tool abstraction. Let me explain each.
What I Built: AI Customer Support
AI Customer Support handles six real customer scenarios with distinct agent paths:
| Customer Says | What Happens |
|---|---|
| "Thank you, your team was amazing!" | Classifier → Positive feedback agent → Personalised warm reply |
| "My debit card hasn't arrived in 3 weeks" | Classifier → Negative feedback agent → Empathetic reply + auto-created ticket |
| "What documents do I need for KYC?" | Classifier → Query router → RAG agent → Grounded answer from policy docs |
| "What is the status of ticket TKT042?" | Classifier → Query router → Ticket lookup → Live DB response |
| "Please close ticket TKT042" | Classifier → Query router → Close ticket (with ownership validation) |
| "Why was my forex card blocked abroad?" | Classifier → Query router → RAG → Low confidence → Fallback ticket |
Every single one of these paths is handled by a different agent — or combination of agents — each with a specific system prompt, specific tools, and specific failure modes.
2. Tech Stack — Full Rationale
Backend Decision Matrix
| Component | Chosen | Alternatives Considered | Why This Choice |
|---|---|---|---|
| LLM | Claude claude-sonnet-4-5 | GPT-4o, Gemini 1.5 Pro, Mistral | Reliable structured JSON output (critical for classifier), strong instruction following, low hallucination rate on constrained prompts |
| Agent Orchestration | LangGraph 0.2.28 | LangChain AgentExecutor, CrewAI, AutoGen, custom DAG | Explicit stateful graph with TypedDict — full auditability required for banking; conditional edges map directly to routing logic; no "reasoning loop" ambiguity |
| LLM Client | Anthropic SDK 0.34.2 | LangChain ChatAnthropic | Direct SDK avoids LangChain abstraction overhead; simpler retry logic; cleaner streaming interface |
| Vector Store | FAISS-CPU 1.9.0 | Pinecone, Weaviate, Chroma, Qdrant | Zero infrastructure — single Python process, no network calls; in-process L2 search at <5 ms; deployable without external services; swappable via LangChain interface |
| Embedding Model | all-MiniLM-L6-v2 (sentence-transformers) | OpenAI text-embedding-ada-002, Cohere embed, bge-large-en | Local execution (no API key, zero cost, no latency); 384 dimensions sufficient for 5-doc corpus; pre-normalised vectors |
| RAG Framework | LangChain 0.3.1 | LlamaIndex, Haystack | FAISS + RecursiveCharacterTextSplitter integration is standard; only used for ingestion utilities — retrieval logic is custom |
| API Framework | FastAPI 0.115.0 | Flask, Django REST, Starlette | Auto OpenAPI docs (swagger at /docs); Pydantic v2 native; async support for future streaming; dependency injection for DB sessions |
| Database ORM | SQLAlchemy 2.0.35 | Tortoise-ORM, Peewee, raw sqlite3 | Mature, battle-tested; WAL mode support; easy migration path to PostgreSQL |
| Database | SQLite (WAL) | PostgreSQL, MySQL | Zero infrastructure for a capstone demo; WAL mode handles two processes (port 8000 and 8001) reading/writing concurrently; trivial migration path |
| HTTP Client (agent→MCP) | httpx 0.27.2 | requests, aiohttp | Sync + async in one library; connection pooling; HTTPX timeout control critical for MCP call reliability |
| Retry Logic | tenacity 8.x | custom loops, backoff library | Declarative @retry decorator with exponential backoff; used on Claude API calls |
| Validation | Pydantic v2 2.9.2 | dataclasses, attrs, marshmallow | V2 performance improvement; native FastAPI integration; used in both API schemas and MCP schemas |
Frontend Decision Matrix
| Component | Chosen | Alternatives Considered | Why This Choice |
|---|---|---|---|
| UI Framework | React 18 | Vue 3, Svelte, SolidJS | Widest ecosystem; hooks model clean for chat + streaming state; team familiarity in most settings |
| Build Tool | Vite 5 | Create React App, Webpack, Parcel | Sub-100 ms HMR; first-class TypeScript; native ESM in dev |
| State Management | Zustand 5 | Redux Toolkit, Jotai, Recoil, Context | Minimal boilerplate; no provider wrapping; perfect for chat message array + customer ID |
| HTTP | Axios 1.7.7 | Fetch API, SWR, React Query | Interceptors for error handling; automatic JSON parsing; cleaner TypeScript generics |
| Styling | Tailwind CSS 3 | styled-components, CSS Modules, Emotion | Utility-first prevents style drift; design tokens via config; no JS-in-CSS overhead |
| Icons | Lucide React | Heroicons, Feather, React Icons | Consistent line-weight; tree-shakeable; named imports |
| Router | React Router v6 | TanStack Router, Next.js | Standard SPA routing; declarative route tree |
The Agent Graph: 8 Nodes, 3 Decision Points
The system is a directed acyclic graph built with LangGraph. Think of it as a flowchart where every box is an AI agent and every arrow is a conditional routing decision.
Customer Message
│
┌────────▼─────────┐
│ CLASSIFIER │ ← Claude reads the message,
│ AGENT │ outputs a JSON label
└────────┬─────────┘
│
┌─────────────────┼──────────────────┐
▼ ▼ ▼
[Positive Feedback] [Negative Feedback] [Query Router]
Agent Agent Agent
(warm reply) (empathy + ticket) (intent flags)
│
┌───────────────────┼──────────────────┐
▼ ▼ ▼
[Close Ticket] [Ticket Lookup] [RAG Agent]
Agent Agent (FAISS search)
│
┌────────────┴────────────┐
▼ ▼
[Grounded [Fallback Ticket]
Answer] Agent
(conf ≥ 0.55) (conf < 0.55)
│
└──────► [LOG NODE] ──► END
Three things make this powerful:
Every node is replaceable. Want to swap Claude for Gemini on the classifier? Change one function. The rest of the graph is untouched.
State flows through the entire graph. Every node reads from and writes to a shared AgentState TypedDict. By the time the response reaches the user, the state carries the customer ID, classification label, RAG confidence score, ticket ID, route chain, tool called, and list of agents invoked — all assembled automatically.
Routing is explicit, not inferred. In reasoning-loop architectures (like ReAct agents), the LLM decides what to do next. Here, conditional Python functions make those decisions. This matters enormously in banking: you want deterministic, auditable routing — not an LLM that occasionally takes a creative detour.
Feature 1: Intent Classification
Every message enters through the same gate: the Classifier Agent.
Input: "My debit card hasn't arrived in 3 weeks"
Output: { "classification": "negative_feedback" }
Input: "Thank you so much for your help!"
Output: { "classification": "positive_feedback" }
Input: "What is the status of ticket TKT042?"
Output: { "classification": "query" }
The system prompt is deliberately minimal:
"You are a classifier. Categorise the message into exactly one of: positive_feedback, negative_feedback, query. Return only valid JSON:
{"classification": "<label>"}"
The constraint to output only JSON is critical. No preamble, no explanation — just the label. This makes the classifier composable: the routing function after it reads state["classification"] and picks the next node. If the LLM outputs natural language instead, the routing breaks. The JSON-only instruction prevents that.
Feature 2: Personalised Positive Feedback
When the classifier returns positive_feedback, the system does something small but meaningful: it fetches the customer's name from the database before generating the reply.
MCP call: GET /mcp/get_customer_profile/CUST001
← { customer_name: "Priya Sharma", segment: "PREMIUM" }
Claude generates: "Thank you for your kind words, Priya! We're absolutely
delighted to hear about your positive experience with SecureBank. Your
satisfaction means the world to us..."
The difference between "Thank you for your feedback!" and "Thank you, Priya!" seems minor. In customer support, it's the entire experience.
Feature 3: Negative Feedback + Auto-Ticket
This is where the system earns its keep. When a complaint comes in, three things happen automatically:
- A unique ticket number is generated (
TKT042— 3 uppercase letters + 3 digits = 17.5 million combinations) - The ticket is written to the database with status
OPEN - Claude generates an empathetic response that includes the ticket number
Customer: "My debit card replacement still hasn't arrived after 3 weeks!"
System:
→ MCP: generate_ticket_number() → "TKT042"
→ MCP: create_support_ticket(...) → ticket persisted, status = OPEN
→ Claude: "We sincerely apologize for the inconvenience, Priya. We've
created ticket TKT042 for your debit card replacement. Our team will
follow up within 24 hours..."
No human had to read the complaint, assess it, or create a ticket. The entire triage happened in under 3 seconds.
Feature 4: RAG-Grounded Policy Q&A
This is the most technically interesting feature — and the one that required the most thought.
When a customer asks "What documents do I need for KYC at SecureBank?", Claude has no idea. It was trained on generic internet data, not your bank's specific policies. If you ask it without grounding, it will either refuse or hallucinate a plausible-sounding answer that may be completely wrong.
RAG (Retrieval-Augmented Generation) solves this. Here's how it works end to end:
Step 1: Ingest Policy Documents (Run Once)
The bank's policy documents are chunked into ~100 overlapping pieces of 512 characters each, then converted to 384-dimensional vectors using a local embedding model (all-MiniLM-L6-v2). These vectors are stored in a FAISS index on disk.
debit_card_policy.txt (5.4 KB) ─┐
kyc_guidelines.txt (6.2 KB) ─┤ → ~100 text chunks
dispute_resolution.txt (6.9 KB) ─┤ → 100 × 384-dim vectors
net_banking_reset.txt (5.8 KB) ─┤ → FAISS index (saved to disk)
sla_commitments.txt (6.8 KB) ─┘
Step 2: Retrieve on Every Query
When the customer asks a question, it's embedded with the same model and compared against all 100 stored vectors using L2 distance. The four closest chunks are returned in ~5 milliseconds.
Step 3: Ground Claude's Answer
Claude never reads the original documents. It only sees the retrieved chunks, with a strict instruction:
"Base your answer STRICTLY on the provided context documents. Do NOT use external knowledge or make assumptions. If the context does not contain enough information, say so."
The result: Claude answers with the bank's actual policies, and cannot invent content it wasn't given.
Step 4: Confidence Gate
This is the part most RAG tutorials skip. Not every customer question is answerable from the knowledge base. A query about "why my forex card was blocked in Singapore" might not appear in any policy document.
The system computes a confidence score after retrieval:
confidence = 0.7 × (top match score) + 0.3 × (average of top 3 scores)
If confidence ≥ 0.55: answer using the retrieved chunks.
If confidence < 0.55: don't guess — create a fallback ticket for a human specialist.
This single gate is what prevents the system from confidently hallucinating answers to questions it doesn't know.
Feature 5: Live Ticket Status Lookup
Customer: "What is the status of ticket TKT042?"
→ Query router extracts "TKT042" from the message
→ MCP: GET /mcp/get_ticket_status/TKT042
← { status: "IN_PROGRESS", days_open: 2, sla_breached: false }
→ Claude: "Your ticket TKT042 is currently marked as In Progress.
It was opened 2 days ago and is within our SLA commitment."
The SLA breach flag is computed in real time — not stored. If a ticket is OPEN and older than 3 days, sla_breached is true. This means it's always accurate, with no batch job needed.
Feature 6: Ownership-Validated Ticket Closure
A subtle but important security feature: customers can only close their own tickets.
Customer CUST001: "Please close ticket TKT042"
MCP ownership check:
ticket.customer_id == "CUST001"? ✅
→ UPDATE support_tickets SET status='CLOSED' WHERE ticket_id='TKT042'
Customer CUST002 trying to close CUST001's ticket:
ticket.customer_id == "CUST002"? ❌ HTTP 403
→ "You can only close tickets that belong to your account."
This business rule lives in exactly one place — the MCP tool layer — and is enforced regardless of which agent calls the tool.
Feature 7: Full Agent Trace on Every Response
Every response the system generates includes a collapsible debug trace in the UI:
{
"classification": "query",
"route_taken": "classifier → query_router → ticket_lookup(TKT042:IN_PROGRESS)",
"rag_confidence": null,
"tool_called": "get_ticket_status",
"agents_invoked": ["classifier_node", "query_router_node", "ticket_lookup_node", "log_node"],
"latency_ms": 2341
}
In a banking context, this isn't optional — it's necessary. When a response is wrong, you need to know if the classifier mislabelled, if the query router extracted the wrong ticket number, or if the LLM ignored the retrieved context.
The Technology Behind It: Three Core Choices
Why LangGraph?
LangGraph treats the agent pipeline as a stateful directed graph. Each node is a function that receives the full agent state and returns an updated state. Routing decisions are explicit Python functions, not LLM inferences.
The alternative is a "reasoning loop" agent (like LangChain's AgentExecutor or AutoGen), where the LLM itself decides what to do next. For general-purpose assistants, that's fine. For banking — where every decision needs to be auditable and every routing path needs to be testable — explicit graph routing is the right choice.
LangGraph also gives you the trace for free. Because state grows as it passes through each node (appending to route_taken and agents_invoked), you get a complete audit trail without any extra instrumentation.
| Framework | Routing | State | Auditability | Banking Suitability |
|---|---|---|---|---|
| LangGraph | Explicit conditional edges | TypedDict, full propagation | Full trace | ✅ High |
| CrewAI | Agent roles with task assignment | Per-agent memory | Medium | ⚠️ Medium |
| AutoGen | Multi-agent conversation | Message history | Low | ⚠️ Low |
| LangChain AgentExecutor | ReAct loop (reason-act) | Tool call history | Low | ❌ Low |
| Custom Python | Manual if/else | Dict passing | Custom | ✅ (if well-built) |
Why MCP?
MCP (Model Context Protocol) is an architectural pattern where tools — database operations, API calls, business logic — are exposed as HTTP endpoints that agents call over HTTP. Agents never import database models or write SQL.
Here's the concrete benefit. Imagine the negative feedback agent needs to check ownership before updating a ticket. Without MCP, that ownership check logic lives inside the agent. If you add a second agent that also updates tickets, you either duplicate the check or extract it into a shared utility. Then a third agent. Then a fourth.
With MCP, the ownership check lives in one place: the update_ticket_status tool. Every agent that calls this endpoint gets the validation automatically. Change the business rule once; all agents benefit immediately.
WITHOUT MCP WITH MCP
───────────────────────────────── ──────────────────────────────────────
Agent imports db models directly Agent calls HTTP endpoint
Agent writes SQLAlchemy queries MCP server owns all DB logic
Agent knows about TicketStatus enum Agent only knows tool name + JSON shape
Change DB schema → fix every agent Change DB schema → fix MCP only
Can't reuse logic across agents Any agent can call any tool
The six tools in this system cover the complete lifecycle of a support interaction:
| Tool | What It Does |
|---|---|
generate_ticket_number | Returns a unique alphanumeric ID |
create_support_ticket | Persists a new ticket with deduplication |
get_ticket_status | Fetches live status with SLA breach calculation |
update_ticket_status | Updates status with ownership validation |
get_customer_profile | Returns profile with email masking (PII protection) |
log_interaction | Async audit trail — never blocks the response |
Why RAG?
The short answer: because you cannot fine-tune your way to correctness, and you cannot prompt-engineer your way to accuracy on proprietary policies.
Fine-tuning would require retraining the model every time a policy changes. RAG requires adding a .txt file and re-running an ingestion script.
Prompt-stuffing the entire policy corpus into the context window is expensive, slow, and hits token limits. RAG retrieves only the relevant 4 chunks — around 2,000 characters — per query.
The key insight is that RAG separates what the model knows how to do (generate fluent, empathetic text) from what it knows (your specific bank's policies). Claude provides the language skill; the FAISS index provides the knowledge.
System-Level Design
Process Topology
┌─────────────────────────────────────────────────────────────────────────────┐
│ Client Browser │
│ React SPA (Vite dev server :5173 / Nginx :80 in prod) │
│ │
│ Pages: Chat │ Tickets │ Logs │ Evaluation │
└──────────────────────────────┬──────────────────────────────────────────────┘
│ HTTP/JSON (REST over localhost or CDN)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ FastAPI Application Server (:8000) │
│ │
│ POST /api/query → routers/query.py → run_graph() │
│ GET /api/tickets → routers/tickets.py │
│ GET /api/logs → routers/logs.py │
│ GET /api/evaluation → routers/evaluation.py │
│ GET /health │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ LangGraph Orchestrator │ │
│ │ (compiled StateGraph, singleton at module import) │ │
│ │ │ │
│ │ classifier_node → route_after_classification │ │
│ │ ├── positive_feedback_node │ │
│ │ ├── negative_feedback_node │ │
│ │ └── query_router_node → route_after_query │ │
│ │ ├── ticket_lookup_node │ │
│ │ ├── close_ticket_node │ │
│ │ └── rag_node → route_after_rag │ │
│ │ ├── log_node → END │ │
│ │ └── fallback_ticket_node → log_node │ │
│ └───────────────────┬────────────────────────┬───────────────────┘ │
│ │ httpx │ local │
│ ▼ ▼ │
│ MCP Server calls FAISS Index │
│ (localhost:8001) (rag/faiss_index/) │
└──────────────────────┬─────────────────────────────────────────────────────┘
│ HTTP/JSON (inter-process, localhost)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MCP Tool Server (:8001) — FastAPI app (mcp_app) │
│ │
│ POST /mcp/generate_ticket_number │
│ POST /mcp/create_support_ticket │
│ GET /mcp/get_ticket_status/{ticket_id} │
│ POST /mcp/update_ticket_status │
│ GET /mcp/get_customer_profile/{customer_id} │
│ POST /mcp/log_interaction │
│ │
│ mcp/server.py → mcp/tools.py → db/crud.py │
└──────────────────────────────────────┬──────────────────────────────────────┘
│ SQLAlchemy ORM
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SQLite (banking_support.db, WAL mode) │
│ │
│ Tables: customers · support_tickets · interaction_logs │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Anthropic API (external, HTTPS) │
│ claude-sonnet-4-5 — called from llm_client.py by every agent node │
└─────────────────────────────────────────────────────────────────────────────┘
Request Lifecycle — Sequence Diagram
Browser FastAPI(:8000) LangGraph MCP(:8001) Anthropic API
│ │ │ │ │
│ POST /api/query │ │ │ │
│ ─────────────────►│ │ │ │
│ │ run_graph() │ │ │
│ │──────────────►│ │ │
│ │ │ classify_message │ │
│ │ │─────────────────────────────────►│
│ │ │◄─────────────────────────────────│
│ │ │ (label: "negative_feedback") │
│ │ │ │ │
│ │ │ POST /mcp/generate_ticket_number │
│ │ │─────────────────►│ │
│ │ │◄─────────────────│ │
│ │ │ ("TKT042") │ │
│ │ │ │ │
│ │ │ POST /mcp/create_support_ticket │
│ │ │─────────────────►│ │
│ │ │◄─────────────────│ │
│ │ │ (ticket created) │ │
│ │ │ │ │
│ │ │ generate empathy reply │
│ │ │─────────────────────────────────►│
│ │ │◄─────────────────────────────────│
│ │ │ │ │
│ │ │ POST /mcp/log_interaction (async) │
│ │ │─────────────────►│ │
│ │ │ (fire-and-forget) │ │
│ │ │ │ │
│ │◄──────────────│ final state │ │
│ │ build response│ │ │
│◄──────────────────│ │ │ │
│ { response, trace } │ │ │
Thread Safety Model
Port 8000 (FastAPI) Port 8001 (MCP Server)
│ │
│ Both processes share │
└──────────────┬───────────────┘
│
banking_support.db
(SQLite, WAL mode)
WAL (Write-Ahead Log) mode allows:
- Multiple concurrent READERS
- One WRITER at a time
- Readers don't block the writer
- busy_timeout = 5000ms prevents deadlocks
PRAGMA journal_mode = WAL; ← set at connection time in db/database.py
PRAGMA busy_timeout = 5000; ← 5s timeout before "database locked" error
FastAPI + MCP Server each use scoped_session per request thread.
Connection returned to pool after each request — no cross-process leaks.
Module Dependency Graph
main.py
└── routers/query.py
└── agents/orchestrator.py ← compiled LangGraph DAG
├── agents/classifier_agent.py
│ └── agents/llm_client.py
├── agents/feedback_agent.py
│ ├── agents/llm_client.py
│ └── httpx → MCP :8001
├── agents/query_router_agent.py
│ └── agents/llm_client.py
├── agents/rag_agent.py
│ ├── rag/retriever.py
│ │ └── rag/faiss_index/ (disk)
│ ├── agents/llm_client.py
│ └── httpx → MCP :8001
├── agents/ticket_agent.py
│ ├── agents/llm_client.py
│ └── httpx → MCP :8001
└── (log_node inline in orchestrator.py)
└── httpx → MCP :8001
mcp/server.py (mcp_app)
└── mcp/tools.py
└── db/crud.py
└── db/models.py
└── db/database.py (SQLite engine)
rag/ingest.py (one-time script)
└── rag/documents/*.txt
└── LangChain loaders + splitter
└── sentence-transformers (HuggingFaceEmbeddings)
└── FAISS → rag/faiss_index/ (disk)
Architecture Diagram
High-Level Architecture
╔═══════════════════════════════════════════════════════════════════════════╗
║ AI Customer Support PLATFORM ║
║ ║
║ ┌─────────────────────────────────────────────────────────────────────┐ ║
║ │ PRESENTATION LAYER │ ║
║ │ │ ║
║ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │ ║
║ │ │ Chat │ │ Tickets │ │ Logs │ │ Evaluation │ │ ║
║ │ │ Page │ │ Page │ │ Page │ │ Dashboard │ │ ║
║ │ └────┬─────┘ └─────┬────┘ └────┬─────┘ └──────┬────────┘ │ ║
║ │ └───────────────┴─────────────┴─────────────────┘ │ ║
║ │ Axios + Zustand │ ║
║ └──────────────────────────────────┬───────────────────────────────────┘ ║
║ │ REST/JSON ║
║ ┌──────────────────────────────────▼───────────────────────────────────┐ ║
║ │ API GATEWAY LAYER │ ║
║ │ FastAPI :8000 │ ║
║ │ POST /api/query │ GET /api/tickets │ GET /api/logs │ ║
║ └──────────────────────────────────┬───────────────────────────────────┘ ║
║ │ ║
║ ┌──────────────────────────────────▼───────────────────────────────────┐ ║
║ │ ORCHESTRATION LAYER │ ║
║ │ LangGraph StateGraph │ ║
║ │ │ ║
║ │ ┌─────────────┐ │ ║
║ │ │ CLASSIFIER │─── positive ──►┌──────────────────────┐ │ ║
║ │ │ NODE │─── negative ──►│ POSITIVE FEEDBACK │ │ ║
║ │ │ (Claude) │ │ NODE (Claude) │ │ ║
║ │ └──────┬──────┘ └──────────────────────┘ │ ║
║ │ │ query │ ║
║ │ │ ┌──────────────────────┐ │ ║
║ │ ▼ │ NEGATIVE FEEDBACK │ │ ║
║ │ ┌─────────────┐ │ NODE (Claude + MCP) │ │ ║
║ │ │ QUERY │── close+num ──►└──────────────────────┘ │ ║
║ │ │ ROUTER │ ┌──────────────────────┐ │ ║
║ │ │ NODE │── num only ───►│ CLOSE TICKET NODE │ │ ║
║ │ │ (Claude) │── no number ──►│ (Claude + MCP) │ │ ║
║ │ └─────────────┘ ├──────────────────────┤ │ ║
║ │ │ TICKET LOOKUP NODE │ │ ║
║ │ │ (Claude + MCP) │ │ ║
║ │ └──────────────────────┘ │ ║
║ │ │ ║
║ │ ┌──────────────────────┐ │ ║
║ │ │ RAG NODE │──conf≥0.55──► │ ║
║ │ │ (FAISS + Claude) │ │ ║
║ │ └──────────┬───────────┘ │ ║
║ │ │ conf<0.55 │ ║
║ │ ▼ │ ║
║ │ ┌──────────────────────┐ │ ║
║ │ │ FALLBACK TICKET │ │ ║
║ │ │ NODE (Claude + MCP) │ │ ║
║ │ └──────────────────────┘ │ ║
║ │ │ all paths │ ║
║ │ ▼ │ ║
║ │ ┌──────────────────────┐ │ ║
║ │ │ LOG NODE │──────► END │ ║
║ │ │ (MCP async) │ │ ║
║ │ └──────────────────────┘ │ ║
║ └───────────────────────────────────────┬───────────────────────────-┘ ║
║ ┌────────────────────────┤ ║
║ │ │ ║
║ ┌───────────────▼───────┐ ┌─────────────▼──────────────────────────┐ ║
║ │ KNOWLEDGE LAYER │ │ TOOL LAYER (MCP :8001) │ ║
║ │ │ │ │ ║
║ │ FAISS Vector Index │ │ generate_ticket_number │ ║
║ │ all-MiniLM-L6-v2 │ │ create_support_ticket │ ║
║ │ 384-dim embeddings │ │ get_ticket_status │ ║
║ │ │ │ update_ticket_status │ ║
║ │ 5 policy documents: │ │ get_customer_profile │ ║
║ │ · debit_card_policy │ │ log_interaction │ ║
║ │ · kyc_guidelines │ │ │ ║
║ │ · dispute_resolution │ │ Business Rules: │ ║
║ │ · net_banking_reset │ │ · Ownership validation │ ║
║ │ · sla_commitments │ │ · SLA breach detection │ ║
║ │ │ │ · Email masking (PII) │ ║
║ └───────────────────────┘ │ · Ticket deduplication │ ║
║ └──────────────────────┬────────────────-─┘ ║
║ │ SQLAlchemy ORM ║
║ ┌──────────────────────▼────────────────────┐ ║
║ │ PERSISTENCE LAYER │ ║
║ │ SQLite (WAL mode) │ ║
║ │ customers │ support_tickets │ logs │ ║
║ └────────────────────────────────────────────┘ ║
║ ║
║ ┌────────────────────────────────────────────────────────────────────┐ ║
║ │ EXTERNAL SERVICES │ ║
║ │ Anthropic API (claude-sonnet-4-5) │ ║
║ │ Called by: classifier, feedback, rag, ticket agents │ ║
║ └────────────────────────────────────────────────────────────────────┘ ║
╚═══════════════════════════════════════════════════════════════════════════╝
Data Layer Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA ARCHITECTURE │
│ │
│ STRUCTURED DATA (SQLite) VECTOR DATA (FAISS on disk) │
│ ───────────────────── ───────────────────────────── │
│ │
│ customers rag/faiss_index/ │
│ ┌──────────────────────┐ ┌─────────────────────────────┐ │
│ │ customer_id (PK) │ │ index.faiss │ │
│ │ customer_name │ │ (FAISS FlatL2 index) │ │
│ │ segment │ │ ~100 384-dim float vectors │ │
│ │ email (maskable) │ └─────────────────────────────┘ │
│ │ account_since │ │
│ │ preferred_lang │ ┌─────────────────────────────┐ │
│ │ created_at │ │ index.pkl │ │
│ └──────────┬───────────┘ │ (LangChain FAISS wrapper) │ │
│ │ 1:N │ doc metadata + content │ │
│ support_tickets └─────────────────────────────┘ │
│ ┌──────────────────────┐ │
│ │ ticket_id (PK) │ SOURCE DOCUMENTS │
│ │ customer_id (FK) │ 5 × .txt policy files (~31 KB) │
│ │ issue_text │ Chunked: ~100 × 512-char pieces │
│ │ status │ Embedded once, stored in FAISS │
│ │ sla_breached │ │
│ │ created_at │ │
│ │ updated_at │ │
│ └──────────┬───────────┘ │
│ │ 1:N (nullable) │
│ interaction_logs │
│ ┌──────────────────────┐ │
│ │ id (PK auto) │ │
│ │ customer_id (FK) │ │
│ │ message │ │
│ │ classification │ │
│ │ route_taken │ │
│ │ response_text │ │
│ │ tool_called │ │
│ │ rag_confidence │ │
│ │ ticket_id (FK null) │ │
│ │ created_at │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
RAG Pipeline — Deep Dive
The Two Phases
RAG has two completely separate phases. Ingestion happens once. Retrieval happens on every query.
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 1 — INGESTION │
│ Run once: python -m rag.ingest │
│ │
│ Policy documents (.txt) │
│ │ │
│ ▼ Split into chunks │
│ │ RecursiveCharacterTextSplitter │
│ │ chunk_size = 512 chars · overlap = 128 chars │
│ │ │
│ ▼ ~100 overlapping text chunks │
│ │ │
│ ▼ Convert each chunk to a vector │
│ │ HuggingFaceEmbeddings (all-MiniLM-L6-v2) │
│ │ → 384 floating-point numbers per chunk │
│ │ │
│ ▼ Build the FAISS index │
│ │ FAISS.from_documents(chunks, embeddings) │
│ │ │
│ └──▶ Saved to disk │
│ rag/faiss_index/index.faiss │
│ rag/faiss_index/index.pkl │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 2 — RETRIEVAL │
│ Runs on every customer query │
│ │
│ Customer question: "What documents do I need for KYC?" │
│ │ │
│ ▼ Embed the question with the same model │
│ │ all-MiniLM-L6-v2 → [0.21, -0.44, 0.87, ...] │
│ │ │
│ ▼ Search the FAISS index │
│ │ similarity_search_with_score(query_vector, k=4) │
│ │ → returns top 4 closest chunks + their L2 distances │
│ │ │
│ ▼ Convert distances → similarity scores │
│ │ score = 1 / (1 + L2_distance) → range (0, 1] │
│ │ │
│ ▼ Compute confidence │
│ │ conf = 0.7 × top1_score + 0.3 × avg(top3_scores) │
│ │ │
│ ├── conf ≥ 0.55 ──▶ Send chunks to Claude → answer │
│ │ │
│ └── conf < 0.55 ──▶ Fallback ticket created │
└─────────────────────────────────────────────────────────────────────┘
How Embeddings Work
An embedding model converts text into a list of numbers (a vector) that captures the meaning of the text. Texts with similar meaning produce numerically similar vectors.
Text Vector (384 numbers)
─────────────────────────────────────────────────────────────────────
"What documents for KYC?" → [0.21, -0.44, 0.87, 0.13, ...]
"KYC requires Aadhaar or PAN..." → [0.23, -0.41, 0.85, 0.15, ...]
↑ very similar numbers = same topic ✅
"Cards dispatched via post..." → [-0.31, 0.72, -0.12, -0.55, ...]
↑ very different numbers = different topic ❌
FAISS finds the KYC chunk as the closest match because its vector is numerically close to the query vector. This is pure math — no LLM involved at this step.
Why We Chunk Documents
Full document (150 lines) — too large
┌───────────────────────────────────────────────────────────────┐
│ Debit Card Policy │
│ Section 1: Issuance... │ ← about issuance
│ Section 2: Replacement... │ ← about replacement
│ Section 3: Blocking... │ ← about blocking
└───────────────────────────────────────────────────────────────┘
One vector for all topics → weak signal for any specific question
After chunking (512 chars each, 128 overlap)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ (Issuance) │ │ (Replacement) │ │ (Blocking) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
→ high score for → high score for → high score for
issuance query replacement query blocking query
Why overlap of 128 characters? Sentences that span a chunk boundary are not lost. The last 128 characters of each chunk are repeated as the first 128 of the next, preserving context continuity.
Confidence Scoring — The Formula
Step 1: Convert L2 distance to similarity score
similarity = 1 / (1 + L2_distance)
L2_distance = 0 → similarity = 1.0 (identical)
L2_distance = 1 → similarity = 0.5
L2_distance = ∞ → similarity = 0.0 (completely different)
Step 2: Compute weighted confidence
confidence = 0.7 × top1_similarity
+ 0.3 × average(top1, top2, top3 similarities)
Using only top1 can be misleading — if one chunk scores 0.70 but chunks 2, 3, 4 all score 0.30, the question may be only partially covered. Blending with avg_top3 (30% weight) penalises poor broader coverage.
Threshold Calibration
Score distribution for this corpus (all-MiniLM-L6-v2):
0.40 0.50 0.55 0.60 0.70
│ │ │ │ │
──┼─────────────┼───────┼───────┼─────────────┼──
│ off-topic │ gap │ │ on-topic │
│ queries │ │ THRESHOLD │
↑
0.55
| Query | Confidence | Result |
|---|---|---|
| Debit card replacement procedures | 0.69 | Answered by RAG |
| Net banking password reset | 0.67 | Answered by RAG |
| KYC document requirements | 0.64 | Answered by RAG |
| SLA response time commitments | 0.61 | Answered by RAG |
| Forex card blocked internationally | 0.52 | Fallback ticket created |
| Completely off-topic question | 0.40 | Fallback ticket created |
The original threshold of 0.65 was too high — KYC queries scored 0.635 and were incorrectly sent to fallback. After empirical testing, 0.55 was the natural split for this corpus.
What Claude Actually Receives
Claude never reads the raw .txt files. It only sees the retrieved chunks:
System:
"Base your answer strictly on the provided context.
Do NOT use external knowledge or make assumptions.
If the context does not contain enough information, say exactly:
'I don't have enough information in our knowledge base to answer this.'"
User:
CONTEXT DOCUMENTS:
[Document 1 — Source: kyc guidelines]
To apply for KYC, customers must submit a valid government-issued
photo ID such as Aadhaar, PAN, or Passport. Address proof must be
less than 3 months old...
[Document 2 — Source: kyc guidelines]
Digital KYC is available via the SecureBank mobile app for
Aadhaar-linked accounts...
────────────────────────────────────────────────────────────
CUSTOMER QUESTION:
What documents do I need for KYC?
Answer using ONLY the context above.
This grounding constraint is what prevents hallucination. Claude cannot say "you need a utility bill" unless that is in the retrieved chunks.
MCP — Deep Dive
Architecture: Three Layers
┌──────────────────────────────────────────────────────────────────────┐
│ AGENT LAYER (orchestrator.py · ticket_agent.py · rag_agent.py) │
│ │
│ Agents know: tool name + JSON input shape │
│ Agents do: httpx.post("http://localhost:8001/mcp/tool_name") │
│ Agents don't: know SQL, ORM models, or business rules │
└──────────────────────────────┬───────────────────────────────────────┘
│ HTTP / JSON (:8001)
┌──────────────────────────────▼───────────────────────────────────────┐
│ MCP SERVER LAYER (mcp/server.py) │
│ │
│ Receives HTTP requests, validates with Pydantic │
│ Calls tool functions in mcp/tools.py │
│ Returns typed JSON responses │
│ Handles HTTP errors (404 ticket not found, 403 ownership error) │
└──────────────────────────────┬───────────────────────────────────────┘
│ Python function calls
┌──────────────────────────────▼───────────────────────────────────────┐
│ TOOL LOGIC LAYER (mcp/tools.py · db/crud.py) │
│ │
│ All business rules live here: │
│ - Ownership checks before ticket updates │
│ - SLA breach calculation (OPEN > 3 days) │
│ - Email masking before returning customer profile │
│ - Ticket ID deduplication on creation │
│ - Auto-create anonymous customer for demo flows │
└──────────────────────────────┬───────────────────────────────────────┘
│ SQLAlchemy ORM
┌──────────────────────────────▼───────────────────────────────────────┐
│ DATABASE LAYER (SQLite) │
└──────────────────────────────────────────────────────────────────────┘
The 6 MCP Tools — Full Specification
Tool 1: generate_ticket_number
POST /mcp/generate_ticket_number
Input: none
Output: { "ticket_number": "TKT042" }
Logic: 3 random uppercase letters + 3 random digits
26³ × 10³ = 17,576,000 possible combinations
Tool 2: create_support_ticket
POST /mcp/create_support_ticket
Input: { "customer_id": "CUST001", "issue_text": "...", "ticket_id": "TKT042" }
Output: { "ticket_id": "TKT042", "status": "OPEN", "created_at": "..." }
Logic: 1. Ticket ID collision → regenerate
2. Unknown customer_id → auto-create customer record
3. Persist ticket with status OPEN
Tool 3: get_ticket_status
GET /mcp/get_ticket_status/{ticket_id}
Output: { "status": "IN_PROGRESS", "days_open": 3, "sla_breached": false, ... }
Logic: days_open = (now - created_at).days
sla_breached = status == OPEN AND days_open > 3
Returns 404 if ticket not found
Tool 4: update_ticket_status
POST /mcp/update_ticket_status
Input: { "ticket_id": "TKT042", "customer_id": "CUST001", "new_status": "CLOSED" }
Output: { "old_status": "OPEN", "new_status": "CLOSED", "updated_at": "..." }
Errors: 404 TICKET_NOT_FOUND | 403 TICKET_OWNERSHIP_ERROR
Tool 5: get_customer_profile
GET /mcp/get_customer_profile/{customer_id}
Output: { "customer_name": "Priya Sharma", "segment": "PREMIUM",
"email": "p***@gmail.com" } ← masked for PII
Logic: Never fails — returns "Valued Customer" / "RETAIL" if unknown
Tool 6: log_interaction
POST /mcp/log_interaction
Input: { customer_id, message, classification, route_taken,
response_text, tool_called, rag_confidence, ticket_id }
Output: { "log_id": 47, "status": "LOGGED" }
Design: Fire-and-forget — failure silently swallowed, never blocks user response
End-to-End: How an Agent Calls MCP
Here is the full journey for "Please close ticket TKT042":
1. AGENT CODE (close_ticket_node in ticket_agent.py)
result = httpx.post(
"http://localhost:8001/mcp/update_ticket_status",
json={"ticket_id": "TKT042", "customer_id": "CUST001", "new_status": "CLOSED"},
timeout=10.0
)
2. MCP SERVER (mcp/server.py)
@mcp_app.post("/mcp/update_ticket_status")
def update_ticket_status(data: UpdateTicketStatusInput, db: Session):
try:
return _update_ticket_status(db, data)
except ValueError as e:
if "TICKET_NOT_FOUND" in str(e): raise HTTPException(404, ...)
if "TICKET_OWNERSHIP_ERROR" in str(e): raise HTTPException(403, ...)
3. TOOL LOGIC (mcp/tools.py)
ticket = get_ticket(db, "TKT042")
if ticket.customer_id != "CUST001":
raise ValueError("TICKET_OWNERSHIP_ERROR:TKT042")
updated = db_update_ticket_status(db, "TKT042", TicketStatus.CLOSED)
return UpdateTicketStatusOutput(old_status="OPEN", new_status="CLOSED", ...)
4. BACK IN THE AGENT
→ Claude call: "Confirm closure of TKT042 for Priya..."
← "Dear Priya, your ticket TKT042 has been successfully closed..."
5. USER SEES
"Dear Priya, your ticket TKT042 has been successfully closed.
Thank you for banking with SecureBank."
Database Design
Schema
CREATE TABLE customers (
customer_id VARCHAR(50) PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
segment VARCHAR(20) DEFAULT 'RETAIL'
CHECK (segment IN ('RETAIL', 'PREMIUM', 'CORPORATE', 'STUDENT')),
email VARCHAR(150),
account_since DATETIME,
preferred_lang VARCHAR(10) DEFAULT 'en',
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE support_tickets (
ticket_id VARCHAR(6) PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL
REFERENCES customers(customer_id) ON DELETE RESTRICT,
issue_text TEXT NOT NULL,
status VARCHAR(20) DEFAULT 'OPEN'
CHECK (status IN ('OPEN', 'IN_PROGRESS', 'RESOLVED', 'CLOSED')),
sla_breached BOOLEAN DEFAULT FALSE,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE interaction_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
customer_id VARCHAR(50) REFERENCES customers(customer_id),
message TEXT NOT NULL,
classification VARCHAR(30) CHECK (classification IN
('positive_feedback', 'negative_feedback', 'query', 'unknown')),
route_taken VARCHAR(300),
response_text TEXT,
tool_called VARCHAR(100),
rag_confidence FLOAT CHECK (rag_confidence IS NULL OR
(rag_confidence >= 0.0 AND rag_confidence <= 1.0)),
ticket_id VARCHAR(6) REFERENCES support_tickets(ticket_id),
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
SLA Breach Logic — Why It's Computed, Not Stored
sla_breached is never written as a derived value from a batch job. It's computed at query time:
# In mcp/tools.py — get_ticket_status
days_open = (datetime.now() - ticket.created_at).days
sla_breached = ticket.status == TicketStatus.OPEN and days_open > 3
If stored, it would be stale the moment a ticket ages past day 3. Computed at query time from created_at, it is always accurate with zero maintenance.
WAL Mode — Why Two Processes Need It
Standard SQLite: Writer locks entire file → readers BLOCKED during write
FastAPI + MCP Server sharing the file = intermittent lock errors
WAL Mode: Writer appends to .wal file
Readers see last committed snapshot
Reader and writer proceed simultaneously
→ No lock contention between the two services
PRAGMA journal_mode = WAL; ← set at connection time
PRAGMA busy_timeout = 5000; ← 5s before "database locked" error
Six Complete Data Flows
Path 1: Positive Feedback
Input: "Thanks for sorting out my account issue so quickly!"
classifier_node → "positive_feedback"
positive_feedback → MCP: get_customer_profile/CUST001
→ Claude: warm reply using "Priya Sharma"
log_node → MCP: log_interaction (async)
Latency: ~2s | Side effects: none (read-only)
Path 2: Negative Feedback + Auto-Ticket
Input: "My debit card replacement still hasn't arrived after 3 weeks!"
classifier_node → "negative_feedback"
negative_feedback → MCP: generate_ticket_number() → "TKT042"
→ MCP: create_support_ticket(...) → OPEN ticket
→ MCP: get_customer_profile/CUST001 → "Priya Sharma"
→ Claude: empathy reply with TKT042
log_node → MCP: log_interaction (async)
Latency: ~3s | Side effects: new row in support_tickets
Path 3: RAG Query — High Confidence
Input: "What documents do I need for KYC at SecureBank?"
classifier_node → "query"
query_router_node → { has_ticket_number: false, close_intent: false }
rag_node → FAISS search, scores: [0.69, 0.64, 0.61, 0.48]
confidence = 0.7×0.69 + 0.3×avg(0.69, 0.64, 0.61) = 0.677 ✅
→ Claude: grounded answer from kyc_guidelines chunks
log_node → MCP: log_interaction(rag_confidence=0.677)
Latency: ~3-4s | Side effects: none
Path 4: RAG Fallback — Low Confidence
Input: "Why was my forex card declined in Singapore?"
classifier_node → "query"
query_router_node → { has_ticket_number: false }
rag_node → FAISS scores: [0.53, 0.47, 0.41, 0.38]
confidence = 0.512 < 0.55 ❌ → fallback
fallback_ticket → MCP: generate_ticket_number() → "TKT043"
→ MCP: create_support_ticket(...) → OPEN ticket
→ Claude: apology + specialist escalation message
log_node → MCP: log_interaction(rag_confidence=0.512)
Latency: ~4s | Side effects: new fallback ticket for specialist
Path 5: Ticket Status Lookup
Input: "What is the status of ticket TKT042?"
classifier_node → "query"
query_router_node → { has_ticket_number: true, ticket_number: "TKT042", close_intent: false }
ticket_lookup_node → MCP: get_ticket_status/TKT042
← { status: "IN_PROGRESS", days_open: 2, sla_breached: false }
→ Claude: formats status reply
log_node → MCP: log_interaction
Latency: ~2s | Side effects: none (read-only)
Path 6: Ticket Closure (Ownership-Validated)
Input: "Please close my ticket TKT042"
classifier_node → "query"
query_router_node → { has_ticket_number: true, ticket_number: "TKT042", close_intent: true }
close_ticket_node → MCP: update_ticket_status(TKT042, CUST001, CLOSED)
Ownership check: ticket.customer_id == "CUST001" ✅
← { old_status: "OPEN", new_status: "CLOSED" }
→ Claude: confirmation reply
log_node → MCP: log_interaction
Failure: CUST002 tries to close CUST001's ticket → HTTP 403 → "You can only close your own tickets."
Latency: ~2s | Side effects: ticket status updated to CLOSED
The Dashboard
The React frontend has four pages:
Chat — A standard chat interface where you can switch between customer IDs and see the agent trace panel on every response. The RAG confidence bar is colour-coded: green if the system answered from the knowledge base, orange if it created a fallback ticket.
Tickets — A filterable list of all support tickets with status badges, days open, and SLA indicators.
Logs — A paginated table of every interaction: who sent what, how the system classified it, what route it took, what tool it called, and the final response.
Evaluation — Aggregated metrics over the last N days: classification breakdown, RAG answer rate vs fallback rate, average confidence score, ticket counts by status.
Evaluation: How Do You Know If It Works?
For the LLMOps component of the capstone, the system is evaluated across four dimensions:
1. Classification accuracy — Hold-out set of 30 labelled messages (10 per class). Target: ≥ 90% F1 on all three labels. Ambiguous messages default to query, which routes to the safest path.
2. RAG answer quality — 5 in-domain questions (one per policy document) + 5 deliberately out-of-scope questions. In-domain queries should score ≥ 0.55; out-of-domain should fall below the threshold.
3. Agent routing accuracy — 25 test messages (5 per route). Because routing is deterministic after classification, routing errors are always traceable to classification errors.
4. Response quality — For negative feedback: does the reply acknowledge the issue? Is there an apology? Is the ticket number included? Is the tone professional? All verifiable from the logs table.
Metrics Collected
Every interaction logged to interaction_logs enables:
Classification: total_interactions, count per label, % breakdown
RAG: rag_queries_total, rag_answered, rag_fallback,
avg_rag_confidence, rag_answer_rate
Tickets: tickets_created, tickets_by_status, sla_breached_count
Routing: route_distribution, tool_usage counts
Logs and Debugging View
Per Interaction:
customer_id → who sent the message
classification → label assigned by classifier
route_taken → "classifier→query_router→rag(conf=0.68)"
tool_called → last MCP tool used
rag_confidence → float or null
ticket_id → created or referenced ticket
Debug use cases:
Classification wrong? → check 'classification', note which messages misclassify
RAG not answering? → filter by route containing 'rag', check rag_confidence
Ticket not created? → filter classification = 'negative_feedback', check tool_called
Routing error? → compare route_taken to expected path for message type
Non-Functional Properties
Latency Budget
| Path | Breakdown | Total |
|---|---|---|
| Positive feedback | 1×Claude + 1×MCP | ~1.5s |
| Negative feedback + ticket | 1×Claude + 3×MCP | ~2.5s |
| RAG query (high confidence) | 2×Claude + FAISS(5ms) | ~3–4s |
| RAG fallback + ticket | 2×Claude + FAISS + 2×MCP | ~4–5s |
| Ticket lookup | 1×Claude + 1×MCP | ~2s |
| Ticket closure | 1×Claude + 1×MCP | ~2s |
Bottleneck: Claude API (1–2s per call). FAISS: ~5ms. MCP: ~5–10ms.
Scalability Path
Step 1: SQLite → PostgreSQL — change DATABASE_URL in .env only
Step 2: FAISS → Pinecone — change rag/retriever.py only (LangChain interface)
Step 3: MCP Server → separate — change MCP_SERVER_URL in .env only
Step 4: FastAPI → multi-worker — uvicorn --workers 4 (graph compiled at import, stateless)
Security Properties
PII Protection: email masked in get_customer_profile (p***@gmail.com)
Ownership checks: ticket updates require customer_id to match ticket owner
HTTP 403 on mismatch
Input validation: Pydantic on all MCP inputs; SQLAlchemy parameterised queries
Missing (production): JWT auth, rate limiting per customer_id,
audit log immutability, encryption at rest
Tech Stack at a Glance
| Layer | Technology | Why |
|---|---|---|
| LLM | Claude claude-sonnet-4-5 | Reliable JSON output, strong instruction following |
| Agent graph | LangGraph | Explicit routing, typed state, full auditability |
| Tool layer | MCP (FastAPI :8001) | Decoupled business logic, one place for rules |
| Vector search | FAISS + all-MiniLM-L6-v2 | Zero infra, ~5ms local search, swappable |
| API | FastAPI | Auto docs, Pydantic v2, async-ready |
| Database | SQLite (WAL mode) | Zero infra, two-process safe, Postgres-migratable |
| Frontend | React + Zustand + Tailwind | Minimal boilerplate, clean state management |
Extension Points
Adding a New Agent (Example: Escalation)
Step 1: Add label to classifier_agent.py
labels = ["positive_feedback", "negative_feedback", "query", "escalation"]
Step 2: Create escalation_agent.py
def escalation_node(state: AgentState) -> AgentState:
# MCP: notify supervisor system
# Claude: generate handoff message
return { ...state, "final_response": "...", "route_taken": "...→escalation" }
Step 3: Add to orchestrator.py
g.add_node("escalation_node", escalation_node)
g.add_edge("escalation_node", "log_node")
# update route_after_classification to include "escalation"
Step 4: Add MCP tool if needed
POST /mcp/notify_supervisor in mcp/server.py + mcp/tools.py
No other files change.
Adding a New RAG Document
Step 1: Add .txt file to backend/rag/documents/
Step 2: python -m rag.ingest (rebuilds the FAISS index)
Step 3: Restart backend
No code changes needed.
Upgrading to Streaming Responses
@router.post("/api/query/stream")
async def stream_query(request: QueryRequest):
async def generate():
async for token in claude.stream_tokens(state):
yield f"data: {token}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Upgrading to Native Claude Tool Use
response = client.messages.create(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": customer_message}],
tools=[
{"name": "create_support_ticket", "description": "...", "input_schema": {...}},
{"name": "get_ticket_status", "description": "...", "input_schema": {...}},
{"name": "update_ticket_status", "description": "...", "input_schema": {...}},
{"name": "search_knowledge_base", "description": "...", "input_schema": {...}},
]
)
# Claude decides which tool to call
# Tool execution calls the same MCP HTTP endpoints — MCP Server unchanged
The Bigger Lesson
The most important thing I learned from this project is that architecture matters more than model quality for production AI systems.
A single GPT-4 call with a perfect prompt would answer most of these customer messages correctly most of the time. But "most of the time" isn't acceptable in banking. When it fails — and it will fail — you need to know exactly where and why.
The multi-agent architecture with explicit graph routing, MCP tool abstraction, RAG confidence gating, and async audit logging isn't over-engineering. Each piece solves a specific, concrete failure mode:
- LangGraph solves the "I don't know what the system did" problem
- MCP solves the "business logic is scattered across agents" problem
- RAG + confidence gate solves the "LLM hallucinating bank policies" problem
- Async logging solves the "adding observability slows down responses" problem
Build the simplest thing that works. Then identify the failure modes. Then add exactly the architecture you need to prevent them. That's the process this project followed — and it's the process I'll carry into every AI system I build after this.
What I Would Do Differently
Streaming responses. The current system waits for the full Claude response before sending anything to the browser. For a 4-second response, that's a blank screen for 4 seconds. FastAPI supports SSE; LangGraph supports streaming nodes. This is the highest-impact UX improvement.
Native Claude tool use. Right now, routing decisions are made by explicit Python conditionals. The more scalable approach is to pass all MCP tools to Claude directly and let the model decide which tool to call. The MCP server stays completely unchanged — only the orchestration logic moves from Python to Claude's reasoning. For 6 tools, explicit routing is cleaner. At 50 tools, native tool use becomes necessary.
PostgreSQL for production. SQLite with WAL mode handles this demo well. At real banking scale — concurrent agents, millions of tickets, audit requirements — you'd migrate to PostgreSQL. The only change is DATABASE_URL in the environment config; SQLAlchemy handles the rest.
Authentication layer. Right now, any client can call POST /api/query with any customer_id. A production deployment needs JWT authentication and per-customer authorisation before any agent runs.
Try It Yourself
# 1. Set your Anthropic API key
cp .env.example .env
# Add ANTHROPIC_API_KEY to .env
# 2. Install and build
cd backend && pip install -r requirements.txt
python -m rag.ingest # Build the FAISS index from policy docs
python seed_data.py # Seed demo customers and tickets
# 3. Start three processes
uvicorn mcp.server:mcp_app --port 8001 # Tool layer
uvicorn main:app --port 8000 # API + agents
cd ../ui && npm install && npm run dev # React UI at localhost:5173
Then open localhost:5173, pick a customer ID, and try:
- "Thanks for sorting out my net banking issue!" — positive feedback path
- "My debit card hasn't arrived in 3 weeks" — auto-ticket path
- "What documents do I need for KYC?" — RAG path
- "What is the status of ticket TKT042?" — live lookup path
- "Why was my forex card blocked abroad?" — RAG confidence gate → fallback path
Each message shows a different agent path, different tools called, and different confidence scores in the trace panel.
Live Demo
The video below walks through AI Customer Support end-to-end — sending customer messages, watching the agent graph route in real time, seeing tickets created automatically, and observing the RAG confidence score on policy queries.
The full source code for AI Customer Support is available on GitHub. Refer to the repository for implementation details, setup instructions, and the complete codebase. github.com/nselvar/AICustomerSupport