Start your day with intelligence. Get The OODA Daily Pulse.

Home > Analysis > A Spy’s Guide to Running AI

A Spy’s Guide to Running AI

Three Former Intelligence Officers on What Intelligence Tradecraft Teaches Us About Generative AI

By John O’Neil, Jim Lawler, and Mike Mears

Generative AI should be managed like a human source: useful, fast, sometimes brilliant, sometimes wrong, and never a substitute for disciplined questioning and human judgment.

The three of us spent our careers in an environment where bad information costs lives. We learned early that the most dangerous source isn’t someone who lies to you. It’s someone who tells you what you want to hear—and does it convincingly. As we watch organizations race to adopt generative AI, we keep seeing the same mistake: treating these tools like oracle machines rather than sources that need to be run.

We are not AI experts. We are not here to debate model architectures or training data. What we know is how to extract reliable insights from sources whose motivations can’t be fully verified, whose outputs may be biased or based on incomplete information, and whose reliability must be continuously earned. That is exactly the problem organizations face with AI today.

This is what HUMINT tradecraft has taught us—and what it has to teach anyone who wants to get honest, useful work from a generative AI system.

The Source Who Was Never Wrong

Early in our careers, two of us ran sources who were brilliant, well-placed, articulate, and deeply motivated. They produced detailed, confident, and consistent reporting. Senior analysts loved them. Their product sailed through review. For months, everything they said checked out—until it didn’t.

The problem wasn’t that they were lying, exactly. In both cases, they filled gaps with inference. They’d learned what we wanted to hear, and their natural intelligence and experience let them produce it fluently. The reporting wasn’t fabricated—it was confabulated. Coherent and plausible, but in key places, wrong.

We’ve all seen this pattern in the early months of AI adoption. The tool is fast. It’s articulate. It never pauses, never says “I’m not sure,” and it formats its answers with the confident authority of a briefing document. A recent Science study found that across eleven state-of-the-art AI models, sycophantic behavior—affirming users’ views even when inaccurate—was widespread and measurable. Stanford researchers found that AI systems trained on human preference feedback are systematically rewarded for being agreeable rather than correct, because agreeable outputs receive higher ratings. The models learn to please.

We’ve seen that source before. We know how the story ends.

Selection: Not All Sources Are Equal

Before you run a source, you select one. That’s a discipline in itself. And a discipline to which AI tools may in fact be able to add value in identifying and sorting stressors that can be exploited (anything that causes stress and then outlines for case officers which levers to pull on a recruitment).  You don’t recruit someone simply because they have access. You also generally don’t recruit happy people. You have to evaluate reliability, motivation, and susceptibility to manipulation. A source with wide access and poor judgment can be more dangerous than no source at all.

The same applies to AI. Not all AI systems are created equal for every task or mission. Each must be evaluated on access, expertise, responsiveness, and the quality of reporting—and the last criterion is harder to assess than it appears.

A few selection questions worth building into any AI adoption process:

  • What is this model’s known track record on this specific type of task, not in general but specifically?
  • Where does it tend to confabulate? What are its known failure modes?
  • Is it current? A model with a training cutoff is like a source who’s been out of the field for a year—still useful, but with blind spots.
  • How does it behave when it doesn’t know something? Does it admit it, or does it keep talking?

Choosing an AI because it’s fast or because leadership read about it in a business magazine isn’t source selection. It’s the equivalent of recruiting the first walk-in who shows up at the door.

Elicitation, Not Interrogation

One of the first lessons a new case officer learns is that interrogation and elicitation are not the same. Interrogation demands. Elicitation draws out. A blunt question produces a guarded answer. A layered conversation yields insight the source didn’t realize they were sharing.

Most people using AI are interrogating it. “What’s the answer?” “Summarize this.” “Give me options.” That approach works, up to a point, but it caps the quality of what you get.

Effective elicitation with AI means:

  • Never ask a direct question when an indirect one is better. Instead of “What should we do?” try “What factors would a skeptic weigh against this recommendation?”
  • Compartmentalize your tasking. Don’t dump the entire problem into a single prompt. Break it into discrete, well-scoped questions. Discrete tasking yields more verifiable output.
  • Build layered follow-ups. Ask: “What are you assuming?” “What would change your conclusion?” “Give me the strongest argument against this.”
  • Probe for alternatives before you settle on an answer. A source that only confirms your hypothesis may be problematic.

This turns AI from a content generator into something closer to a thinking partner. But it requires the same discipline as running a source well: preparation, precision, and the intellectual humility to recognize that your framing shapes what you get back.

The Hostile Source Problem

There is a risk the standard AI adoption literature doesn’t spend enough time on. In intelligence work, we worry not just about sources who are wrong—we worry about sources who have been co-opted or doubled, or who are feeding us what we want to hear because they’ve learned our preferences and decided that’s what keeps the relationship alive.

AI systems have structural analogs to all three failure modes:

  • Sycophancy as a design artifact. Because models are trained on human preference feedback, they are incentivized to produce outputs that feel satisfying. Researchers at Carnegie Mellon and Stanford have documented an “artificial hivemind” effect in which outputs from multiple AI models converge—reducing epistemic diversity at the very moment organizations need independent judgment.
  • Training data is a contamination channel. A source’s worldview is shaped by their environment. An AI model’s worldview is shaped by its training corpus. That corpus reflects the biases, omissions, and assumptions of the material it was built on. You may not know where those biases are, and the model won’t volunteer them.
  • Automation bias as a user vulnerability. A series of recent studies confirms what experienced case officers know: people grant far more credibility to confident, fluent reporting than the underlying evidence warrants. Research published in 2025 found that even users with high “AI literacy” were not significantly protected against automation bias—the tendency to accept AI output without critical evaluation.

The practical implication: approach your AI system with the same structured skepticism you’d bring to a well-placed source who has given you no reason to doubt them. That’s when discipline matters most.

Debrief Discipline: The Protocol That Makes It Real

After every source meeting, a case officer writes up not only what the source said but also their assessment of reliability—what was corroborated, what was assumed, and what needs follow-up. That habit is the difference between a professional intelligence organization and a rumor factory.

Most organizations using AI lack an equivalent discipline. Someone prompts the model, takes the output, and puts it in a slide. No one records what was asked, what caveats the model offered, or whether the output was independently verified. The result is institutional memory built on unexamined reporting.

A working AI reporting protocol should mirror the post-meeting debrief:

  • Requirement—What question are we actually trying to answer?
  • Prompt—What, precisely, did we ask? (Save it.)
  • Output—What did the AI say?
  • Source check—What in this output is reliable? What is uncertain? What is unsupported?
  • Human judgment—What do we actually believe, independent of the AI?
  • Action—What will we do?
  • Review—What happened after we acted? Did the AI’s analysis hold up?

The review step is the one that organizations most consistently skip. But it’s where calibration happens. A source you never debrief after the fact is one whose reliability you can never actually assess.

A useful team habit before closing out any AI-assisted analysis: “Before we accept this answer, what would disconfirm it?” That question alone will catch more errors than any amount of AI governance policy.

Separating Collection from Analysis

This is a fundamental discipline in intelligence work, and it translates directly. AI is a tool for collecting and synthesizing. It can ingest, summarize, organize, and compare. What it cannot reliably do is interpret—to ask what the information means here, in this context, for this organization, with these constraints.

The error organizations make is treating AI as if it collapses the divide between collection and analysis. It doesn’t. It accelerates collection. The analytical function—applying judgment, context, institutional knowledge, and accountability—remains human.

Teams that hand over analytical responsibility to AI are not just making an efficiency error. They are making an accountability error. Someone has to own the conclusion. AI cannot.

Burning a Source: When to Stop Trusting the AI

This is the part of the tradecraft literature on AI that doesn’t exist yet, and it needs to.

Every experienced case officer has had to decide to terminate a source relationship. Not because the source was obviously lying—if that were clear, the decision would be easy. You terminate when the source’s reliability has fallen below a threshold, when you have reason to believe the source has been compromised, or when the cost of continuing to run them outweighs the value of their reporting.

The equivalent decisions will come for AI systems, and organizations should prepare for them:

  • When a model’s known failure modes consistently overlap with your mission-critical questions, it is time to stop relying on it for those questions—regardless of how it performs elsewhere.
  • When an AI system has been demonstrably wrong in a consequential context and the organization has not developed a clear explanation for why, continuing to use it at the same level of trust is an operational error.
  • When a model is updated or retrained by its provider, treat it as a new source and revalidate. Prior reliability does not transfer automatically.
  • When you discover that the model has been systematically producing outputs shaped by the framing of your prompts rather than by evidence—that you have been leading the witness without realizing it—you may need to reset the relationship.

Burning a source is not a failure of the source-handling relationship. It is often the proof that the relationship was being handled well.

What This Means for How You Lead

The three of us came to this issue through intelligence work, but the problem is not limited to intelligence organizations. Any leadership environment where AI tools are proliferating faces the same structural challenge: the tools are fast, fluent, and confident, and organizational incentives often reward those who use them most rather than those who use them best.

The research bears this out. INSEAD’s 2025 analysis of firm-level AI adoption found that generative AI shifts value toward higher-order human judgment—not away from it. Microsoft’s research confirms that organizations with a well-calibrated understanding of AI perform better across missions than those that simply maximize usage. The tool is the easy part. The discipline is the hard part.

For leaders, the implications are practical:

  • Build the habit of debriefing discipline before you scale AI adoption. The protocol above should be standard practice, not optional.
  • Create psychological safety so people can flag AI errors. The greatest risk in any source-handling operation is the team member who saw the problem but didn’t say anything because the source had too much credibility.
  • Distinguish between AI as a collection tool and as an analytical tool. Automate the former aggressively. Guard the latter carefully.
  • Evaluate AI systems with the same rigor you would apply to any source—including periodic reviews of whether the relationship continues to produce reliable value.

Used with discipline, generative AI can be a genuinely powerful analytical partner—the kind of well-placed, high-access source that an experienced handler learns to work with carefully and derive real value from. Used without discipline, it becomes a certainty-destroyer—introducing noise, eroding judgment, and producing false confidence at scale.

The HUMINT model doesn’t make AI safer by limiting what it does. It makes AI safer by raising the standard for what we do with what it gives us.

AI doesn’t give you answers. It gives you reports. And reporting always requires a handler’s skeptical, trained eye.

About the Authors

The authors are former national security officers with combined experience across human intelligence operations, national laboratories, and management. Their views are their own and do not represent the position of the Central Intelligence Agency or the United States government.

Mike Mears is a leadership expert, bestselling author, creator of LeadCultureChange.com and former CIA Chief of Human Capital. As the founder of the CIA Leadership Academy, he trained managers and senior executives in practical leadership strategies grounded in neuroscience and human behavior. Mears holds an MBA from the Harvard Business School, and a BS degree from the US Military Academy at West Point.

James “Jim” Lawler served for 25 years as a CIA operations officer in various international posts and as Chief of the Counterproliferation Division’s Special Activities Unit. He was a member of the CIA’s Senior Intelligence Service from 1998 to 2005. Lawler was a specialist in the recruitment of foreign spies, and he spent over half of his CIA career battling the proliferation of weapons of mass destruction, including serving as the chief of the A. Q. Khan Nuclear Takedown team, which resulted in the disruption of the most dangerous nuclear weapons network in history.

John O’Neil, Ph.D. has extensive service in numerous leadership roles in academia and at Oak Ridge National Laboratory, where his work lay at the intersection of critical science and technology development, intelligence, issues of disruptive technical and WMD proliferation threats, and national security.  He and his distinguished teams delivered numerous mission critical insights and solutions for intelligence, defense, and homeland security.

Jim Lawler

About the Author

Jim Lawler

James “Jim” Lawler served for 25 years as a CIA operations officer in various international posts and as Chief of the Counterproliferation Division’s Special Activities Unit. He was a member of the CIA’s Senior Intelligence Service from 1998 to 2005. Lawler was a specialist in the recruitment of foreign spies, and he spent over half of his CIA career battling the proliferation of weapons of mass destruction, including serving as the chief of the A. Q. Khan Nuclear Takedown team, which resulted in the disruption of the most dangerous nuclear weapons network in history.