LangGraph for compliance: building deterministic AI pipelines

Why compliance is a graph problem

Most AI application frameworks optimise for flexibility — you give the model a prompt, it decides what to do next, it calls tools, it reasons. That works well for open-ended assistants. It is a liability in compliance management.

At Zertia and at traze.ai, we process audit evidence for ISO 27001, ISO 9001, and GDPR across more than fifty enterprise clients. The defining characteristic of an audit workflow is that it has a fixed, auditable sequence of steps. An auditor does not want an AI deciding whether to skip the evidence-collection step because it feels confident enough. An auditor wants a system that proves — provably, with a log — that every step ran, in order, with specific inputs and outputs attached.

This is the core insight that makes LangGraph the right tool for compliance automation. LangGraph models execution as a directed graph of typed state transitions. You define nodes (discrete processing steps), you define edges (the routing logic between them), and you define a shared state schema that each node reads and writes. The graph is compiled once and executed deterministically. It cannot skip nodes. It cannot invent edges. The execution trace is a first-class object.

That last point matters enormously in regulated contexts. Under ISO 19011, audit activities must be documented. Under GDPR Article 5(2), controllers must demonstrate compliance via the accountability principle. A compiled LangGraph execution gives you both: every node invocation is timestamped, every state transition is recorded, and the final state carries a complete provenance chain.

Mapping ISO audit workflows to graph nodes

An ISO 27001 audit, simplified, looks like this: plan (define scope, objectives, criteria) → collect evidence (documents, interviews, system exports) → evaluate controls (gap analysis against the Annex A control set) → generate findings (nonconformities, observations) → produce report (audit report with action plan).

Each of those is a graph node. The state schema carries everything the graph knows at any point in time. Here is the state definition we use in production:

Python — state schema

from typing import Annotated, TypedDict, Literal
from datetime import datetime
import operator


class ControlFinding(TypedDict):
    control_id: str            # e.g. "A.9.2.3"
    status: Literal["conformant", "minor", "major", "na"]
    evidence_refs: list[str]
    justification: str
    generated_at: datetime


class AuditState(TypedDict):
    # Immutable inputs — set once, never mutated
    audit_id: str
    standard: Literal["ISO27001", "ISO9001", "GDPR"]
    scope: str
    criteria: list[str]

    # Accumulated across nodes — operator.add reducer merges parallel writes
    evidence: Annotated[list[dict], operator.add]
    findings: Annotated[list[ControlFinding], operator.add]
    trace: Annotated[list[str], operator.add]

    # Mutable workflow state
    current_stage: str
    report_url: str | None
    error: str | None

The Annotated[list, operator.add] pattern is important. LangGraph state updates are partial dicts — a node returns only the keys it modified. The operator.add reducer means that when two parallel nodes both append to findings, the list is merged correctly rather than one result overwriting the other. This is the mechanism that makes parallel evidence-collection safe.

Node definitions and edge routing

Each node is a plain Python function that receives the current state and returns a partial update. We keep nodes thin: they call a service layer, append to the trace, and return. No business logic inside the node itself.

Python — nodes and graph compilation

from langgraph.graph import StateGraph, END
from datetime import datetime, timezone

from .services import evidence_service, evaluation_service, report_service
from .state import AuditState


def collect_evidence(state: AuditState) -> dict:
    ts = datetime.now(timezone.utc).isoformat()
    docs = evidence_service.fetch_for_scope(
        audit_id=state["audit_id"],
        criteria=state["criteria"],
    )
    return {
        "evidence": docs,
        "current_stage": "evidence_collected",
        "trace": [f"collect_evidence completed at {ts}, {len(docs)} documents"],
    }


def evaluate_controls(state: AuditState) -> dict:
    ts = datetime.now(timezone.utc).isoformat()
    findings = evaluation_service.evaluate(
        evidence=state["evidence"],
        standard=state["standard"],
    )
    return {
        "findings": findings,
        "current_stage": "controls_evaluated",
        "trace": [f"evaluate_controls completed at {ts}, {len(findings)} findings"],
    }


def route_after_evaluation(state: AuditState) -> str:
    # Deterministic routing: if any major nonconformity, escalate.
    has_major = any(f["status"] == "major" for f in state["findings"])
    return "escalate" if has_major else "generate_report"


def generate_report(state: AuditState) -> dict:
    ts = datetime.now(timezone.utc).isoformat()
    url = report_service.generate(state)
    return {
        "report_url": url,
        "current_stage": "report_generated",
        "trace": [f"generate_report completed at {ts}"],
    }


# Graph compilation
builder = StateGraph(AuditState)

builder.add_node("collect_evidence", collect_evidence)
builder.add_node("evaluate_controls", evaluate_controls)
builder.add_node("generate_report", generate_report)
builder.add_node("escalate", escalate_to_reviewer)

builder.set_entry_point("collect_evidence")
builder.add_edge("collect_evidence", "evaluate_controls")
builder.add_conditional_edges(
    "evaluate_controls",
    route_after_evaluation,
    {"generate_report": "generate_report", "escalate": "escalate"},
)
builder.add_edge("generate_report", END)
builder.add_edge("escalate", END)

audit_graph = builder.compile()

The routing function route_after_evaluation is a plain Python predicate — no LLM involved. This is the key discipline: LLM calls live inside nodes; routing decisions are deterministic Python. The LLM told you status: "major" in the finding, but it is Python that decides where to go next based on that value. The two concerns stay separate and testable independently.

Tracing and audit trail patterns

In regulated environments you need more than LangGraph's built-in checkpointing. You need an immutable, append-only trail that maps to your compliance evidence records. The trace field in AuditState is one layer — every node stamps what it did and when. But we add two more layers on top.

LangSmith for operational observability

LangSmith traces give you the full token-level view of every LLM call: input prompt, output, latency, model version, and cost. For a compliance audit you want to be able to show an auditor exactly what the model was asked and what it answered when it classified control A.9.2.3. Set LANGCHAIN_TRACING_V2=true and your traces are stored automatically. In production at traze.ai we tag each run with {"audit_id": "...", "client_id": "..."} via the metadata parameter so we can retrieve the full trace for any specific audit from the LangSmith UI without grepping logs.

Structured event log with PostgreSQL

Python — append-only audit event logging

from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any


@dataclass
class AuditEvent:
    audit_id: str
    node: str
    input_snapshot: dict[str, Any]
    output_snapshot: dict[str, Any]
    created_at: datetime = field(
        default_factory=lambda: datetime.now(timezone.utc)
    )
    model_version: str | None = None


def make_audit_callback(db, audit_id: str):
    """
    Returns a node wrapper that logs input/output to audit_events
    before and after execution. The table has an append-only trigger
    enforced at the database level — rows cannot be updated or deleted.

    Usage: collect_evidence = make_audit_callback(db, id)(collect_evidence)
    """
    def wrapper(node_fn):
        def traced(state: AuditState) -> dict:
            input_snap = {
                k: state[k]
                for k in ("current_stage", "standard", "audit_id")
            }
            result = node_fn(state)
            event = AuditEvent(
                audit_id=audit_id,
                node=node_fn.__name__,
                input_snapshot=input_snap,
                output_snapshot=result,
            )
            db.audit_events.insert(event)
            return result
        return traced
    return wrapper

We store input and output snapshots, not diffs. Storage is cheap; re-constructing what happened when a major client disputes a finding is not. The audit_events table has an append-only trigger enforced at the database level — application code cannot update or delete rows.

Lessons from production at traze.ai

After running this architecture across 50+ enterprises for a year, the things that surprised us most were not technical.

Human-in-the-loop placement matters more than you expect. We initially put a human review step only at the final report stage. Clients with strict internal sign-off processes needed review after the control evaluation step, before findings were locked. Adding a human_review interrupt node with LangGraph's interrupt_before mechanism was straightforward, but it required us to rethink how we serialised and resumed state across HTTP requests. Use LangGraph's Postgres checkpointer from day one — retrofitting it is painful.

Token cost compounds at scale. A mid-size ISO 27001 audit has roughly 130 Annex A controls, each requiring evidence retrieval and an LLM evaluation call. At 50 clients per month that is 6,500+ LLM calls per audit cycle. We batched controls into groups of 10 per LLM call (structured output with a list return type), which cut costs by ~73% with no measurable quality drop. The batching logic lives in the evaluation node, not in the graph structure — the graph topology stayed clean.

Versioning the graph is a compliance requirement. If you update your control evaluation prompt, every audit run after that change may produce different classifications than runs before it. You must be able to reproduce historical results. We version the compiled graph alongside the model version and store both in the AuditEvent. Deployments only affect new audits; in-flight audits run to completion on the graph version they started with, using LangGraph's checkpointed state.

Pitfalls: non-determinism, hallucinations, and test strategies

The graph structure is deterministic. The LLM calls inside nodes are not. This is the central tension you must manage.

Temperature is not enough

Setting temperature=0 reduces variance but does not eliminate it. OpenAI and Anthropic both document that even at temperature zero, outputs can vary across API versions, model updates, and parallel execution. For compliance purposes, "reduce variance" is not sufficient — you need to validate every LLM output against an explicit schema before allowing it to mutate state.

We use Pydantic models as the contract for every LLM call. The evaluation_service calls the model with with_structured_output(ControlFindingList) and re-raises a ValidationError if the model returns a status value outside the Literal["conformant", "minor", "major", "na"] set. The node then marks the finding as requiring manual review rather than silently propagating a bad value.

Hallucination in evidence citations

The most damaging failure mode is a model citing evidence that does not exist — fabricating a document reference that supports a "conformant" finding. We mitigate this with grounded generation: evidence passed to the evaluation node is chunked and embedded at ingestion time; the LLM only receives the top-k retrieved chunks as context. Every citation in the output is validated against the set of chunk IDs passed into the prompt. Citations to IDs not in that set are rejected and flagged for manual review.

Test strategy

We maintain three test layers:

Node unit tests — mock the service layer, assert on the returned state dict. No LLM calls. Run in CI on every push, complete in under 10 seconds.
Graph integration tests — run the full compiled graph against a synthetic audit fixture with mocked LLM responses (deterministic JSON fixtures). Assert on final state and trace contents. These validate routing logic and reducer behaviour.
Evaluation harness — weekly, against a set of 30 real historical audits with known ground-truth findings. Measures finding accuracy, citation validity, and hallucination rate. Results are stored in LangSmith evaluations and compared against the previous week's run before any model upgrade is approved for production.

The evaluation harness is the most valuable investment. It took two weeks to build and has caught three regressions in twelve months — each time a model provider silently updated their weights and our classification accuracy dropped by 4–8 points on edge-case controls.

The graph controls the workflow. The model does the reasoning. Keep those responsibilities separate and you can test, audit, and explain each one independently.

That separation is what makes LangGraph the right substrate for compliance automation. The model does not decide what to do next. It informs a decision that deterministic Python makes. Auditors understand Python conditionals. They do not understand transformer attention weights. Build to that constraint and compliance AI becomes something you can actually certify.