Next.js 15 + FastAPI: a pragmatic architecture for AI-first products

Building a compliance intelligence platform at traze.ai — and later a regulatory BI tool at zertia.ai — pushed me toward a clean split: TypeScript on the client and server edge, Python in the analytical backend. After shipping this pairing to over 50 enterprises I can say the friction points are well-understood. This article is the architecture document I wish existed when I started.

I'll cover why the split works, how the two services communicate, where auth lives, how to stream AI responses across the boundary, how to keep types consistent without manual duplication, and how to deploy the whole thing without surprising cold-start bills.

Why this pairing works

Next.js 15 and FastAPI solve different problems well. The mistake is treating them as interchangeable and picking one for everything.

Next.js 15 App Router is excellent at UI composition, server-side data fetching close to the database, caching at the CDN edge, and the React component model. With Server Components you get zero-JS by default, rendering HTML at the server with direct Prisma calls when the query is simple. With tRPC you get end-to-end type safety for mutations. Vercel's edge network handles deployment, preview environments, and ISR without any operational overhead.

FastAPI is excellent at everything Python owns: LangChain, LlamaIndex, NumPy, Pandas, scikit-learn, Hugging Face transformers, Azure OpenAI SDK. Python's async story is mature enough for I/O-bound AI workloads. Pydantic v2 gives you a validation layer that is just as serious as Zod. And when you need to drop into a C extension for tokenisation or a GPU kernel for inference, Python is the right host language.

The rule is simple: UI and edge logic in Next.js, AI and data-science logic in FastAPI. Never put a LangChain chain in a Next.js API route. Never put React state management in Python.

Architecture overview

The diagram below shows a typical request for an AI-powered compliance check — the user submits a document, the UI posts to a Next.js Server Action, the action calls the FastAPI service, and a streaming response flows back.

┌─────────────────────────────────────────────────────────────┐ │ Browser │ │ React Client Component (useChat / EventSource) │ └───────────────────────┬─────────────────────────────────────┘ │ HTTPS (stream or fetch) ┌───────────────────────▼─────────────────────────────────────┐ │ Next.js 15 (Vercel Edge) │ │ ┌────────────────┐ ┌─────────────────────────────────┐ │ │ │ Server Actions │ │ Route Handlers (/api/stream) │ │ │ │ (mutations) │ │ (proxy SSE to the client) │ │ │ └────────┬───────┘ └────────────────┬────────────────┘ │ │ │ │ │ │ ┌────────▼────────────────────────────▼────────────────┐ │ │ │ Internal fetch() / tRPC HTTP client │ │ │ └────────────────────────┬─────────────────────────────┘ │ │ │ Bearer JWT │ └───────────────────────────┼─────────────────────────────────┘ │ Private network (VNet / Railway) ┌───────────────────────────▼─────────────────────────────────┐ │ FastAPI (Python 3.12) │ │ ┌─────────────────────┐ ┌───────────────────────────────┐ │ │ │ /v1/analyse POST │ │ /v1/analyse/stream GET SSE │ │ │ └──────────┬──────────┘ └──────────────┬────────────────┘ │ │ └──────────────┬─────────────┘ │ │ ┌───────▼────────┐ │ │ │ LangChain / │ │ │ │ Azure OpenAI │ │ │ └───────┬────────┘ │ │ ┌───────▼────────┐ │ │ │ PostgreSQL / │ │ │ │ Azure AI Srch │ │ │ └────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Two things are worth noting. First, the FastAPI service is not publicly routable — it sits inside an Azure Virtual Network or a Railway private network. Only the Next.js deployment can reach it, reducing the attack surface considerably. Second, the Next.js Route Handler at /api/stream acts as a thin proxy: it forwards the SSE stream from FastAPI directly to the browser without buffering. This avoids the Vercel function timeout for long AI responses.

Auth boundary: where JWT flows

Auth lives in Next.js. We use NextAuth (Auth.js v5) backed by the same PostgreSQL database Prisma reads from. On sign-in, Auth.js issues a session cookie. For requests that need to reach FastAPI we mint a short-lived JWT from the session data:

// lib/fastapi-token.ts
import { SignJWT } from 'jose';
import { getServerSession } from 'next-auth';
import { authOptions } from '@/lib/auth';

const secret = new TextEncoder().encode(process.env.FASTAPI_JWT_SECRET);

export async function getFastAPIToken(): Promise<string> {
  const session = await getServerSession(authOptions);
  if (!session?.user?.id) throw new Error('Unauthenticated');

  return new SignJWT({
    sub: session.user.id,
    org: session.user.orgId,
    role: session.user.role,
  })
    .setProtectedHeader({ alg: 'HS256' })
    .setIssuedAt()
    .setExpirationTime('2m') // short-lived: one request
    .sign(secret);
}

The FastAPI side validates the JWT with the same secret on every request. The token is never stored in a cookie or local storage — it is generated server-side, used once for the outgoing fetch, and discarded.

On the FastAPI side, a reusable dependency handles validation:

# auth.py
from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
from jose import JWTError, jwt
from pydantic import BaseModel
from app.config import settings

bearer_scheme = HTTPBearer()

class TokenPayload(BaseModel):
    sub: str
    org: str
    role: str

def get_current_user(
    credentials: HTTPAuthorizationCredentials = Depends(bearer_scheme),
) -> TokenPayload:
    try:
        payload = jwt.decode(
            credentials.credentials,
            settings.fastapi_jwt_secret,
            algorithms=["HS256"],
        )
        return TokenPayload(**payload)
    except JWTError:
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid token",
        )

Streaming AI responses from FastAPI to Next.js

This is the part that catches people out. LangChain's .astream() yields tokens as they arrive from the model. FastAPI's StreamingResponse wraps that generator in an HTTP response. But getting those tokens to the browser in real time requires careful handling on the Next.js side.

FastAPI: the SSE endpoint

# routers/analyse.py
import asyncio
import json
from collections.abc import AsyncIterator
from fastapi import APIRouter, Depends
from fastapi.responses import StreamingResponse
from langchain_openai import AzureChatOpenAI
from langchain_core.messages import HumanMessage
from app.auth import TokenPayload, get_current_user
from app.schemas import AnalyseRequest

router = APIRouter(prefix="/v1", tags=["analyse"])

llm = AzureChatOpenAI(
    azure_deployment="gpt-4o",
    api_version="2024-10-01-preview",
    streaming=True,
)

async def _token_stream(prompt: str) -> AsyncIterator[str]:
    """Yield SSE-formatted chunks from the LLM."""
    async for chunk in llm.astream([HumanMessage(content=prompt)]):
        token = chunk.content
        if token:
            # SSE format: data: {json}\n\n
            yield f"data: {json.dumps({'token': token})}\n\n"
    yield "data: [DONE]\n\n"

@router.post("/analyse/stream")
async def analyse_stream(
    body: AnalyseRequest,
    user: TokenPayload = Depends(get_current_user),
) -> StreamingResponse:
    prompt = _build_prompt(body, user.org)
    return StreamingResponse(
        _token_stream(prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable Nginx buffering
        },
    )

def _build_prompt(body: AnalyseRequest, org_id: str) -> str:
    return (
        f"Analyse the following document for regulatory compliance.\n"
        f"Organisation: {org_id}\n\n"
        f"Document:\n{body.content}"
    )

The X-Accel-Buffering: no header is important when there is an Nginx reverse proxy in front of FastAPI — without it, Nginx buffers the entire response before forwarding it, killing the streaming effect. Azure Container Apps does not add Nginx by default, but Railway does.

Next.js: consuming the stream in a Route Handler

A Next.js Route Handler proxies the FastAPI SSE stream to the browser. The key is to pipe the readable stream directly without awaiting the full body.

// app/api/analyse/stream/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { getFastAPIToken } from '@/lib/fastapi-token';

export async function POST(req: NextRequest) {
  const body = await req.json();
  const token = await getFastAPIToken();

  const upstream = await fetch(
    `${process.env.FASTAPI_BASE_URL}/v1/analyse/stream`,
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        Authorization: `Bearer ${token}`,
      },
      body: JSON.stringify(body),
      // do not buffer — we want the ReadableStream
      duplex: 'half',
    } as RequestInit
  );

  if (!upstream.ok) {
    return NextResponse.json(
      { error: 'Upstream error' },
      { status: upstream.status }
    );
  }

  // Pipe the ReadableStream directly to the client response
  return new Response(upstream.body, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      Connection: 'keep-alive',
    },
  });
}

On the client, a React component reads the stream using the EventSource API or by manually reading the response body as a ReadableStream. For most use cases the latter is simpler because EventSource does not support POST requests.

Type safety across the boundary

The single biggest source of bugs in a polyglot stack is the gap between what the Python service sends and what the TypeScript consumer expects. The answer is a build step that generates Zod schemas from Pydantic models.

We use pydantic-to-typescript (or the newer datamodel-code-generator) to emit a TypeScript interface file, then wrap the interfaces in Zod schemas for runtime validation. The pipeline runs in CI on every change to the Python schemas directory.

# schemas.py — single source of truth for the API contract
from pydantic import BaseModel, Field
from enum import StrEnum

class RiskLevel(StrEnum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"

class AnalyseRequest(BaseModel):
    content: str = Field(min_length=1, max_length=50_000)
    document_type: str = Field(default="contract")

class Finding(BaseModel):
    clause: str
    risk: RiskLevel
    explanation: str
    confidence: float = Field(ge=0.0, le=1.0)

class AnalyseResponse(BaseModel):
    findings: list[Finding]
    summary: str
    processed_at: str  # ISO 8601

// generated — do not edit by hand
// run: make generate-types
import { z } from 'zod';

export const RiskLevel = z.enum(['low', 'medium', 'high']);
export type RiskLevel = z.infer<typeof RiskLevel>;

export const Finding = z.object({
  clause: z.string(),
  risk: RiskLevel,
  explanation: z.string(),
  confidence: z.number().min(0).max(1),
});
export type Finding = z.infer<typeof Finding>;

export const AnalyseResponse = z.object({
  findings: z.array(Finding),
  summary: z.string(),
  processed_at: z.string().datetime(),
});
export type AnalyseResponse = z.infer<typeof AnalyseResponse>;

export const AnalyseRequest = z.object({
  content: z.string().min(1).max(50_000),
  document_type: z.string().default('contract'),
});
export type AnalyseRequest = z.infer<typeof AnalyseRequest>;

The generated file is committed to the repo. If a Pydantic model changes and the generation step is not re-run, the CI type-check step fails because the Zod schema will be stale. This creates a mechanical guarantee that the two sides stay in sync — no runtime surprises.

Deployment: Next.js on Vercel, FastAPI on Azure Container Apps

Next.js deploys to Vercel — the zero-config path is correct here. Preview environments, ISR invalidation, Edge Middleware, and the global CDN are all handled. There is nothing to configure beyond environment variables.

FastAPI deploys to Azure Container Apps. We chose this over AKS because it abstracts cluster management while still supporting custom VNET integration, Dapr, and proper scaling-to-zero with configurable minimum replicas. The Azure Data Scientist Associate workloads (Azure ML, Azure OpenAI) sit in the same subscription, so managed identity and role assignments are straightforward.

# Dockerfile
FROM python:3.12-slim AS base

WORKDIR /app

# install uv for deterministic, fast dependency resolution
RUN pip install uv==0.4.10

COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

---

FROM base AS runtime

COPY app/ ./app/
COPY alembic/ ./alembic/
COPY alembic.ini ./

# non-root user
RUN adduser --disabled-password --gecos "" appuser
USER appuser

EXPOSE 8000

CMD ["uv", "run", "uvicorn", "app.main:app", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--workers", "2", \
     "--loop", "uvloop"]

Two workers is intentional. Azure Container Apps scales horizontally by adding replicas, not vertically within a replica. More than two workers per replica creates memory pressure without throughput benefit on a standard consumption workload.

The Container Apps configuration sets minimum replicas to 1 for the production environment. This eliminates cold starts entirely for paying users. For staging, minimum replicas is 0 — the cold start there is acceptable and keeps costs near zero.

What to watch out for

Cold starts

Python containers take longer to start than Node containers. A fresh FastAPI container with LangChain and the Azure OpenAI SDK imported typically takes 4–8 seconds to reach a healthy state. Two mitigations: keep minimum replicas at 1 in production (as above), and use a /health endpoint that returns immediately without touching the database so the load balancer marks the container ready quickly. Do not do any blocking I/O in the module-level scope — defer database connection pool creation to the startup lifecycle event.

CORS in production

Because FastAPI is not directly browser-accessible — all requests are proxied through Next.js — CORS on the FastAPI service only needs to allow the Next.js server's outbound IP range, not *. In practice we restrict to the Vercel CIDR or, better, rely on VNET integration so the FastAPI service is unreachable from the public internet entirely. Never set allow_origins=["*"] on a service that accepts authenticated JWTs.

Streaming gotchas

Three things break streaming in ways that are non-obvious to debug:

Vercel function timeout. Streaming functions have a different timeout limit than standard functions. On the Pro plan the streaming limit is 300 seconds. If your AI responses routinely exceed two minutes, set maxDuration = 300 in the route segment config.
Nginx buffering. Already mentioned: X-Accel-Buffering: no on the FastAPI response headers. Azure Application Gateway has an equivalent setting (request-timeout) that can swallow slow streams — set it to 300 seconds.
Fetch duplex: 'half'. Node 18+ requires this option when you want to read a streaming response body from fetch(). Without it you get a TypeError: Cannot set property body of #<Request>. The TypeScript type definition does not include duplex yet, so the cast to RequestInit in the example above is intentional.

tRPC vs direct fetch

tRPC is the right choice for Next.js ↔ Next.js API routes — it gives you end-to-end type safety with no schema file. It is not the right choice for Next.js ↔ FastAPI, because tRPC requires a Node.js server on both ends. Use direct fetch() with the generated Zod schemas for validation at the boundary. Reserve tRPC for internal Next.js server-to-client data loading where you want the React Query integration.

Closing thoughts

The architecture described here is running in production at two companies handling sensitive enterprise data. The split is not about trendy polyglot engineering — it is about using the right tool for each job. TypeScript is better at the UI and edge layer. Python is better at the AI and data-science layer. The boundary between them is narrow, well-defined, and type-safe.

The parts that need the most care are the auth boundary (one-use JWTs, no cookies crossing services), the streaming pipeline (proxy the ReadableStream, set the right headers, configure timeouts), and the type generation step (run it in CI, fail loudly on drift). Get those three right and the rest is straightforward.