Julio Sousa

The Problem

Most “RAG demos” you find online are wrappers around ChatGPT. They prove that retrieval-augmented generation works, but they skip everything that makes a real product: tenant isolation, cost observability, async ingestion, hybrid retrieval, citations the user can actually verify. The other extreme — Pinecone + dedicated workers + custom auth + Langfuse cloud — takes months and costs hundreds per month before you have a single user.

I wanted to design and build something in the middle: an actual multi-tenant SaaS architecture, with the patterns a senior engineer would defend in review, running entirely on free tiers and costing literal cents per month for demo traffic. The constraint forces every decision to be deliberate.

My Approach

The architecture is built around four non-negotiables: tenant isolation must be enforced at the database, not the application, retrieval must be hybrid because pure vector loses keyword precision, answers must be verifiable, and every LLM call must be tracked for cost from day one, not bolted on later.

Multi-tenancy via Postgres Row Level Security. Every tenant-scoped table carries org_id. The JWT issued by Supabase Auth contains app_metadata.org_id. RLS policies of the form org_id = (auth.jwt() -> 'app_metadata' ->> 'org_id')::uuid are enforced by the database itself — application code physically cannot leak data across tenants because the DB refuses the query. This is a stronger guarantee than app-level filtering, and it survives the kind of bugs that ship to production unnoticed.

Hybrid retrieval with Reciprocal Rank Fusion. Vector search alone misses queries where the user remembers the exact phrasing from a document. Keyword search alone misses semantic paraphrases. The retriever runs both — pgvector cosine similarity (HNSW index) for semantic, Postgres tsvector for full-text — and fuses the two ranked lists via RRF before sending the top-K chunks to the LLM. No re-ranker in the MVP because the cost is not justified yet, but the abstraction is in place to swap one in.

Streaming answers with verifiable citations. The chat endpoint streams tokens from gpt-4o-mini directly to the browser via the Vercel AI SDK. Inline [N] markers in the model output are parsed in flight; clicking one opens a popover with the exact source chunk and page number. No tool-use overhead, no JSON-mode latency penalty — the system prompt is stable enough (≥1024 tokens) that OpenAI’s automatic prompt caching kicks in and gives a 50% discount on input tokens.

Async ingestion as a separate concern. Uploads go directly from the browser to Supabase Storage via signed URLs (no file ever touches the Next.js server). The upload triggers an Inngest workflow that parses the PDF with unpdf (no native dependencies, runs in serverless), chunks it with RecursiveCharacterTextSplitter (800 tokens, 120 overlap, page-aware), batches embeddings (100 chunks per OpenAI call, concurrency limited to 3), and updates the document status. The UI tracks progress over Supabase Realtime — no polling.

Cost tracking as a first-class table. Every embedding call and every chat completion writes a row to usage_events with tokens in/out and cost in cents. This is not “we’ll add observability later” — it ships in the first migration. You can query cost per tenant, per kind, per day from SQL.

Architecture

Browser
  └── Next.js 16 (App Router, Server Actions, Streaming)
        ├── @supabase/ssr cookie session (auth)
        ├── /api/chat    → hybrid search → streamText (gpt-4o-mini)
        └── /api/inngest → ingestion worker (steps with retries)

OpenAI                Supabase                          Inngest
  ├ embeddings        ├ Postgres + pgvector (HNSW)      └ async queue
  └ chat completions  ├ tsvector full-text                + retries
                      ├ Storage (private bucket,
                      │   tenant-prefixed paths)
                      ├ Auth (JWT with org_id)
                      └ Realtime (status updates)

Upstash Redis
  └ rate limit per org_id

Architecture & Patterns

The patterns I picked over the obvious alternatives are what I would defend in any architecture review:

RLS as the isolation primitive, not the application layer — guarantees survive code bugs.
Reciprocal Rank Fusion instead of a heavier re-ranker — works well below the volume that would justify Cohere or a local cross-encoder, and the abstraction lets you swap one in without rewriting the retriever.
Inline [N] citation format parsed from the stream instead of OpenAI tool-use — works with token-by-token streaming and avoids the latency penalty of structured-output mode.
Signed URLs for direct browser → Storage upload — Next.js never sees the file, eliminating a memory bottleneck and a source of cold-start lag.
Vendor abstraction at the right level — VectorStore and IngestionQueue interfaces wrap pgvector and Inngest. Migrating to Pinecone or pg-boss is a one-file change, not a rewrite.
Page-aware chunking — page_number is preserved in chunk metadata, so citations can deep-link to the exact page when the PDF viewer ships in v1.2.
Prompt caching by stable prompt design — system prompt deliberately exceeds OpenAI’s 1024-token threshold so the input is cached automatically (50% discount on the cached portion).

Stack Decisions

Every choice has an explicit alternative I rejected. This is what separates “I used X” from “I picked X over Y for these reasons.”

pgvector over Pinecone / Weaviate / Qdrant. One database to operate, native joins between vectors and metadata, RLS extends to vectors automatically, HNSW recall ~95% at this scale. The migration trigger is documented (>10M chunks per tenant), not implicit.

Shared schema + RLS over schema-per-tenant. Single migration story, single connection pool, single point of operational concern. The trade-off is that a noisy tenant could degrade neighbors — mitigated with per-tenant rate limiting and partial indexes on org_id.

Inngest over Supabase Edge Functions over pg-boss. Inngest gives durable steps with automatic retries and a visual dashboard for free, runs as a Next.js route handler so there’s no extra infrastructure, and is wrapped behind an IngestionQueue interface for portability.

gpt-4o-mini over gpt-4o / Claude Sonnet 4.6. This is a deliberate cost decision. The quality delta is small for grounded RAG (the retrieval does the heavy lifting), and the cost delta is 30×. The retriever is the lever, not the model. The model can be swapped per-workspace in v1.1 if a tenant wants to pay for it.

Inline [N] citation parsing over OpenAI tool-use mode. Tool-use blocks streaming, costs more per call, and is harder to recover from when the model deviates. A regex parser over the stream + few-shot examples in the prompt is more robust at this complexity.

Cost engineering

Demo traffic — 50 PDFs ingested (~10 pages each), 200 chat queries per month — runs the entire stack at well under a euro:

Component	Cost / month
Supabase (Free)	$0.00
Vercel (Hobby)	$0.00
Inngest (Free)	$0.00
Upstash (Free)	$0.00
OpenAI embeddings	~$0.005
OpenAI chat	~$0.14
Total	~$0.15

A four-page test PDF goes from upload to query-ready in ~5 seconds, costing about $0.000004 in embeddings.

These numbers are not estimates — they are pulled from the usage_events table that ships with the first migration.

Testing

The project follows a four-layer test pyramid that maps to where bugs actually live:

Unit (Vitest) — chunker, RRF, citation parser, prompt builder, key-scoping helpers
Integration (Vitest + Supabase test container) — full ingestion pipeline against a real PDF, retriever hybrid mode end-to-end
E2E (Playwright) — sign up → upload → wait for ready → ask a question → click a citation
LLM eval (custom) — 20-question golden set scored on groundedness, citation accuracy, answer relevance

Plus seven check-*.ts scripts (check-env, check-db, check-storage, check-vector, check-signup-trigger, check-ingestion, check-chat) that smoke-test the entire stack from a fresh clone — fail fast in setup, before someone wastes a day debugging a missing env var.

What I’d Do Differently

The biggest miss is that I shipped without a re-ranker. RRF gets us to good top-K but a cross-encoder (or Cohere) on the top-20 would tighten precision noticeably for questions that depend on a single specific chunk. I left the abstraction but skipped the implementation — that order should have been reversed.

The chunking is recursive character-based, which is fine for prose but loses structure on documents with headings, tables, and lists. Semantic chunking (split by paragraph similarity) or structure-aware chunking (split by markdown headings) would be a better default. The migration is mechanical because chunking is wrapped behind a single function.

Citations are inline [N] markers parsed from the stream. This works well in 95% of cases but the model occasionally generates a [7] when there are only six chunks in context. A post-stream validation step would catch this, and it would be cheap to add.

Still Proud Of

The fact that the very first migration creates the usage_events table. Cost observability was not an afterthought — every LLM call writes its tokens and estimated cents to Postgres before the response returns to the user. You can answer “what does tenant X cost me per month” with one SQL query, on day one. That kind of decision compounds: every feature added afterward inherits cost visibility for free, and the operational discipline it imposes (“if I can’t measure it, I shouldn’t ship it”) shaped every subsequent design choice in the project.