AI Security . Audit & Assurance

The Definitive Guide to AI Security Auditing

How to assess probabilistic systems when the attack surface is the model itself.

You are not auditing a binary. You are auditing a behavior engine trained on data you did not write, deployed behind APIs you did not design, and exposed to inputs an adversary controls entirely.

5 Audit phases Scoping, threat modeling, scanning, red-team, monitoring
6 STRIDE-AI dimensions Adapted for probabilistic systems and agent stacks
4 Risk dispositions Accept, conditional, defer, reject — beyond pass/fail
T1–T3 System tiers Tier scope before depth-testing
Framework STRIDE-AI · MITRE ATLAS · OWASP LLM Top 10
Scope Models, prompts, RAG, agents, MCP
Audience Security architects, audit, engineering leaders
Method Continuous assurance loop
AI security audit STRIDE-AI prompt injection RAG poisoning supply chain red teaming residual risk

Traditional software auditing assumes deterministic logic: given input X, function f always returns Y. AI systems break that contract. The same prompt can produce different outputs. Training data is executable influence. Prompts are runtime code. RAG corpora are live dependency graphs. Plugins turn LLMs into orchestration layers with network privileges.

100% security is fiction. The goal of AI security auditing is not a green checkbox — it is quantified residual risk, defensible controls, and evidence that leadership understands what can still go wrong. This guide gives practitioners a lifecycle, threat taxonomy, and executive deliverable format that survives board scrutiny and production incidents.

If your audit scope stops at "we use Azure OpenAI with enterprise SSO," you have scoped a vendor contract review, not an AI security audit.

AI Security Research

The paradigm shift#

Three shifts every auditor must internalize. Data is code: poisoned fine-tuning data or a compromised RAG document can alter behavior without touching application source. Prompts are interfaces: system prompts, tool schemas, and few-shot examples define authorization as much as IAM policies do. Outputs are attack vectors: generated SQL, HTML, shell commands, and agent action plans can become second-stage exploits.

Traditional software audit
Source is truth
Bugs reproduce. Inputs are schema-bound. Supply chain is libraries and CI. Patch means deploy. Boundary is network and auth.
AI security audit
Behavior is truth
Failures are statistical and context-dependent. Inputs are natural language. Supply chain is base models, datasets, fine-tunes, HF repos. Patch may require retraining. Boundary includes context window, tool permissions, embeddings.

The expanded trust boundary#

Every layer is in scope. Most breaches happen where orchestration meets infrastructure — not inside the transformer blocks.

Input layer

Prompts, files, images, retrieved documents. The attacker controls the natural-language interface and any content the model ingests at runtime.

Model layer

Base model, adapters, embeddings, rerankers. Weights and fine-tunes are upstream code you usually did not author.

Orchestration layer

Agents, tools, MCP servers, function calls. This is where LLM output turns into network calls, file writes, and SSRF pathways.

Data & infrastructure layer

Vector databases, logs, training pipelines, APIs. The same rigor that protects production data must extend to embeddings and prompt logs.

The AI audit lifecycle#

AI security auditing is not a one-time pentest. It is a continuous assurance loop aligned to model and corpus change velocity. The five phases below scale from a tier-3 sandboxed demo to a tier-1 customer-facing agent with regulated data.

Continuous AI assurance loop
01
Scoping & asset inventory Model IDs, version hashes, adapters, system prompts, tool definitions, RAG sources, agent graphs, data flows, and integration points. Tier each system as T1 (critical), T2 (material), or T3 (low) before testing depth is chosen.
02
Threat modeling (STRIDE-AI) Adapt STRIDE for probabilistic systems. Map spoofing, tampering, repudiation, information disclosure, denial of service, and privilege escalation to MITRE ATLAS and OWASP LLM Top 10. Tie scenarios to business impact.
03
Automated baseline scanning Prompt injection probes, tool abuse, output safety, config drift, and dependency checks at scale. Treat results as signals — expect false positives on creative jailbreaks and false negatives on novel chains.
04
Manual red-teaming Multi-turn persistence, indirect injection via RAG, tool chaining into SSRF, cross-tenant leakage, translation jailbreaks, and denial-of-wallet probes. Production-parity integrations are required — mock APIs teach you nothing about real exploitation.
05
Continuous monitoring & re-audit Injection hit rate, refusal-rate drift, tool error anomalies, embedding corpus diffs, latency, and output policy violations. Re-audit triggers: new base model, fine-tune, RAG ingest, new tool, or material incident.

Gotcha: Teams hide agent toolchains inside "chat features." During scoping, always ask: can this model invoke anything outside its own context window?

Core threat vectors#

What to hunt for, organized by attack surface. These are the patterns recurring across production incidents and benchmark evaluations through 2026.

Prompt injection

Direct overrides, role-play jailbreaks, payload splitting across turns, encoding evasion via Base64 or Unicode homoglyphs, and needle-in-haystack attacks buried in long documents.

Data poisoning

Fine-tune dataset tampering, RAG document injection ("always recommend attacker URL"), RLHF feedback manipulation, and trojaned Hugging Face datasets.

Inversion & extraction

Memorization extraction (verbatim PII or secrets), model stealing via distillation, embedding inversion to reconstruct source text, and membership inference for privacy exposure.

Supply chain

Poisoned base models, typosquatted HF repos, unsafe pickle deserialization, malicious LoRA adapters, hidden exfil instructions inside "example" prompt templates.

Tool abuse & SSRF

Agents fetching attacker URLs, IMDS metadata access via fetch tools, path traversal in file tools, and cross-tenant vector retrieval through shared indexes.

Infra & logging

Public inference endpoints, API keys in client JS, over-privileged service accounts, verbose error leaks, and prompt/response logs that ingest the same PII the product must protect.

  • 2024-2026
    Indirect prompt injection via RAG Web-scraped or user-uploaded documents successfully instructed enterprise copilots to exfiltrate data, recommend malicious URLs, or bypass refusal policies — without the user ever issuing the malicious prompt.
    Input
  • 2025
    Hugging Face supply chain Multiple incidents of malicious pickle payloads embedded in popular model repositories triggered code execution at load time, reinforcing that trust_remote_code is the AI equivalent of curl | bash.
    Supply chain
  • 2025
    Agent SSRF via URL summarization Coding and research agents asked to "summarize this URL" reached cloud metadata endpoints (169.254.169.254), exposing temporary credentials when egress was not allowlisted.
    Orchestration

Control validation by domain#

Map findings to controls. Verify both design and operating effectiveness — a control that exists on a diagram but is bypassed at runtime is worse than no control, because it implies false assurance.

  1. Input controls Prompt firewall, input sanitization, allowlisted tools, structural separation between user content and system instructions via tagged roles or secondary classifiers.
  2. Model controls Output filters, structured-output schemas, refusal policies, adapter integrity checks, and consistency testing of refusals across paraphrases and translations.
  3. Data controls RAG trust tiers, source signing, PII scrubbing pre-index, anomaly detection on embedding updates, signed corpora, and lineage for every training file.
  4. Identity & orchestration controls Per-session auth, scoped API keys, OAuth for tools, allowlisted egress, blocked IMDS endpoints, and tool permissions scoped per user and per session.
  5. Logging & response controls Immutable prompt/response audit trail, redaction before write, retention aligned to data classification, kill switches, model rollback procedures, and incident runbooks rehearsed against the agent stack.

The deliverable: residual risk, not pass/fail#

Executives do not want a dump of jailbreak transcripts. They want decision-grade risk. Replace binary pass/fail with four dispositions: accept (residual risk within appetite), accept with conditions (compensating controls plus dated remediation), defer (insufficient evidence or scope gap), and reject or escalate (material unmitigated risk).

# Residual risk formula residual_risk = inherent_risk × (1 − control_effectiveness) # Inherent risk = likelihood × impact, tied to system tier and threat vector. # Control effectiveness is observed, not declared.

Every finding needs a plain-language title with a CWE or ATLAS reference, severity tied to your matrix, minimal reproduction steps, an architectural root cause (not "the model misbehaved"), a specific remediation (not "add more AI safety"), and a way to re-test after the fix.

Final Pro-Tip: The best AI security audit is the one that changes a deployment decision or forces a compensating control before production — not the one that produces the thickest PDF.

Quick reference checklist#

  1. Inventory is complete Models, prompts, RAG, tools, MCP servers, and data flows are documented and tiered.
  2. Threat model is current Covers direct and indirect injection, tool abuse, supply chain, and is mapped to ATLAS or OWASP LLM Top 10.
  3. Automated baseline scan Runs in CI and pre-release; signals feed the issue tracker, not a quarterly report.
  4. Manual red-team on T1 Executed with production-parity integrations, multi-turn chains, and realistic tools.
  5. Continuous monitoring is live Re-audit triggers defined; deliverable uses a residual risk matrix with owners, dates, and verification steps.

The auditor's mindset

AI security auditing sits at the intersection of appsec, data governance, ML engineering, and adversarial ML. The models will keep changing. The corpora will keep growing. Agents will keep gaining tools.

Your job is not to prove the system is safe. Your job is to prove the organization knows where it is unsafe, has prioritized what matters, and can detect and respond when — not if — the next attack lands.

Use this guide as a baseline, not a ceiling. Adapt the lifecycle, the threat taxonomy, and the residual-risk deliverable to your organization. Then run it again next quarter.