What We Learned Building Reliable AI for Security Analysis
Lessons from building a production AI system for threat modeling: why specialization matters, how to verify AI output, and principles for reliable security analysis at scale.
Dave Barton
Co-founder
Why multi-agent?
When we started building ThreatKrew, the obvious first approach was a single large language model with a detailed prompt: “Here’s an architecture, give me a threat model.” The output was… fine. It hit some real threats, missed others, and occasionally invented things that sounded plausible but weren’t grounded in the actual architecture.
The problem is scope. Comprehensive security analysis isn’t a single task. It requires multiple distinct types of thinking: understanding architecture at a structural level, reasoning about security assumptions, applying threat methodologies systematically, and mapping recommendations to control frameworks. Asking a single model to do all of that in one pass is like asking one person to be an architect, a security analyst, an auditor, and a technical writer simultaneously. Each requires a different expertise and focus.
The breakthrough came when we stopped thinking of threat modeling as one problem and started thinking of it as multiple specialized problems. We needed agents that each excel at their specific task, with independence to focus entirely on doing one thing well. That independence forced rigor — each agent couldn’t hide behind vague reasoning or borrow assumptions from other stages.
A philosophy, not a blueprint
The core architecture follows a principle we discovered through iteration: specialization works, but only if the specialists genuinely check each other’s work.
The pipeline orchestrates multiple types of analysis in sequence. First, it builds a deep understanding of the architecture itself — the systems, the connections, the trust boundaries. Then it surfaces implicit assumptions, because security assumptions are where most systems fail. It applies threat methodologies systematically to identify risks, and it maps those risks to control frameworks that actually apply to your infrastructure.
What matters more than the mechanics is how these pieces fit together. Each stage reads the outputs of previous stages and produces structured output for the next. Each stage is independent enough to be evaluated on its own quality. And crucially, stages that generate findings are separate from stages that verify those findings. The analysis isn’t just flowing from one agent to the next — there’s genuine checkpoints along the way.
This architecture emerged because we learned through painful iteration that generalized AI output can’t be trusted without verification, and self-verification doesn’t work. You need another pair of eyes — even if those eyes are artificial.
Independent verification is not optional
One of our most important discoveries: AI systems need to check their own work, and the checking must be genuinely independent.
In early versions, we had the same agents that generated findings also review them. The results were predictable and disappointing. When an agent identifies a threat and is then asked “is this a real threat?”, it naturally tends to confirm its own reasoning. The bias isn’t malicious — it’s structural. The agent has already committed to a conclusion and explaining why it was wrong requires admitting a mistake.
We separated generation from verification completely. Different instructions. Different role framing. An explicit mandate to be critical rather than confirmatory. The second agent wasn’t asked to validate the first agent’s work — it was asked to challenge it, to look for weaknesses, to identify gaps and false positives.
This simple change had an outsized impact on quality. Findings that survive genuine adversarial review are more reliable. They’ve been stress-tested from a different angle. The consistency of output improved dramatically.
This isn’t specific to our implementation. It’s a principle: any AI system making consequential decisions needs independent verification built in. The verification can’t be done by the same reasoning that generated the original decision.
Fail clearly, not silently
One of the hardest lessons: AI systems will produce subtly malformed output, and if you don’t catch it, it propagates downstream and becomes indistinguishable from real analysis.
Large language models can produce output that looks correct but is structurally wrong: fields that look right but don’t parse, values that make sense in isolation but violate constraints, or data that’s present but inconsistent. In a multi-stage pipeline, these errors compound. If one stage produces output that the next stage can’t properly parse, the analysis becomes corrupted.
We built strict quality controls so the system fails clearly rather than producing unreliable output.
Every stage produces structured output with defined requirements. If that output doesn’t meet those requirements, the system says so explicitly — it doesn’t silently downgrade to best-guess interpretation. We keep the raw output so we can see what actually happened. We attempt constrained repair when feasible, but we don’t invent data or retroactively change what the model generated.
Most importantly: if a stage can’t produce valid output after repair attempts, the stage fails. We don’t retry endlessly or silently accept degraded quality. A clearly failed assessment is better than an assessment that looks complete but contains corrupted analysis.
The principle scales beyond our implementation: any system using LLMs for consequential work needs to distinguish between “I didn’t find anything” and “I couldn’t find anything because the output was malformed.” Users need to understand what worked and what didn’t.
The infrastructure
ThreatKrew is built on AWS serverless infrastructure using Amazon Bedrock, Lambda, DynamoDB, and S3. We chose serverless because it scales automatically with demand and we only pay for compute when assessments are actually running.
The pipeline architecture prioritizes resilience and traceability. Each stage is independent — it reads the outputs of previous stages and writes its own artifacts. This independence means a failure in one stage doesn’t cascade to the entire pipeline. It also means we can debug and replay individual stages without rebuilding the entire analysis.
The foundation models come from Amazon Bedrock. Our model selection strategy optimizes for both quality and cost, configured through infrastructure-as-code so we can update models without redeploying the pipeline.
Broader lessons
Building a reliable AI system for security analysis across multiple organizations and architectures taught us several principles that extend beyond threat modeling.
Specialization works, but requires independence. A single AI system trying to excel at multiple distinct tasks produces mediocre results across the board. Specialized systems produce higher quality, but only if each specialist has genuine autonomy. When analysis flows from one stage to the next without independent checkpoints, errors compound and quality degradation is hard to detect.
Verification must be independent of generation. This is straightforward once you learn it, but it’s fundamental. An AI system that generates findings and then reviews them will not catch its own errors effectively. Building verification as a separate, distinct process with different instructions and explicit critical mandate changes the dynamics entirely. This principle applies to any consequential AI work, not just threat modeling.
Structured constraints prevent silent failure. LLM output that looks correct but is structurally wrong is worse than obviously wrong output because it propagates downstream and becomes indistinguishable from real findings. Enforce structure — not as a nice-to-have, but as a core part of the pipeline. When structure is violated, fail clearly rather than attempting to work with degraded data.
Observability of failure is essential. When something goes wrong, you need to know what actually happened. Save raw outputs before validation. Preserve error traces. Make it possible to debug individual stages without having to re-run the entire pipeline. Users need confidence that they understand what the system did and what it didn’t do.
Defensible degradation. Systems that try to recover silently from errors end up being unreliable in subtle ways. It’s better to clearly fail at a stage and explain why than to produce output that looks complete but is partially corrupted. A clearly failed assessment is more trustworthy than one that appears to succeed but may contain invalid data.
The road ahead
The architecture we’ve built is the foundation, but the mission is bigger. We’re working toward interactive refinement so users can provide feedback that improves analysis in real-time, continuous monitoring so threat models update as architecture evolves, and deeper integration with development workflows so security analysis is part of how teams build.
Security analysis that happens once and sits on a shelf isn’t very useful. The principles we’ve learned — specialization, independent verification, clear failure modes, observable reasoning — apply whether you’re analyzing an architecture once or continuously monitoring it over years. That’s where we’re heading.
Want to understand the mission behind ThreatKrew? Read why we built ThreatKrew, explore why threat modeling matters, or see how it works.
Ready to see these principles in action? Try the automated threat modeling tool or join the Founders Program and get your first AI threat model in minutes.
Dave Barton
Co-founder
Co-founder of ThreatKrew. Former AWS security specialist with years of experience securing enterprise infrastructure. Passionate about making professional security analysis accessible to every team.