Running Claude Code in CI sounds simple. Swap the interactive session for the -p flag, point it at a pull request, post the result, ship it. The catch is that a workflow that runs fine on your laptop behaves differently on a GitHub Actions runner, where no one is watching for a runaway loop, there is no Ctrl+C to reach for, and the API bill shows up at the end of the month.
That gap between terminal Claude Code and CI Claude Code is where most first attempts go wrong. Timeouts nobody set. Tools enabled that should not be. Output that no downstream script validates. Costs that nobody tracks until they blow past the team's AI budget.
This guide covers the patterns that close that gap: headless invocation, JSON output parsing, cost control, multi-turn validation, and the security footguns that are easy to miss when you copy a workflow from a blog post and push it to main.
The flag triangle
Everything in headless Claude Code comes down to three flags and one choice.
-p (or --print) drops into non-interactive mode. Claude reads your prompt from the command line, executes, prints the result, exits. No REPL, no waiting for keyboard input. This is the baseline for any automation.
--output-format json wraps Claude's response in structured JSON instead of plain text. You get the assistant message, the tool calls it made, token usage, cost estimate, and session metadata. In automation you almost always want this, because you almost always need to check something before acting.
--bare is the flag most CI setups skip and then regret. By default, Claude Code boots with the full user environment: auto-discovered hooks, skills, plugins, MCP servers, your CLAUDE.md chain, OAuth pulled from the keychain. In CI, none of that loads the way you expect, and silent misconfiguration becomes impossible to debug. --bare strips all of it. It forces authentication through ANTHROPIC_API_KEY only, skips keychain reads, and loads no ambient config. You get a deterministic startup that behaves the same on every runner.
The choice: use the CLI, the Claude Agent SDK (Python or TypeScript), or the official anthropics/claude-code-action GitHub Action. CLI is best for simple one-shot jobs. The SDK gives you the same agent loop in library form, useful when you are embedding Claude inside a larger program. The GitHub Action wraps CLI invocation with GitHub-specific plumbing like PR diffs and inline comments. Pick the lowest-level tool that solves your problem.
A GitHub Actions workflow that does not embarrass you
Here is a stripped-down workflow that runs Claude Code against a pull request and fails gracefully:
name: Claude PR review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install Claude Code
run: npm install -g @anthropic-ai/claude-code
- name: Run review
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
BASE_SHA: ${{ github.event.pull_request.base.sha }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
run: |
DIFF=$(git diff --unified=3 "$BASE_SHA"..."$HEAD_SHA")
echo "$DIFF" | claude -p \
--bare \
--output-format json \
--allowedTools "Read,Grep,Glob" \
--max-turns 5 \
"Review this diff. Flag only issues that would cause bugs or regressions. If there are none, say so. Keep the response under 400 words." \
> review.json
- name: Post review
run: node .github/scripts/post-review.js
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Five things worth calling out:
timeout-minutes: 10is not optional. Without it, a stuck Claude process will run until the default six-hour Actions timeout. That is how $180 bills happen.--bareis there. Deterministic startup, no ambient config.--allowedTools "Read,Grep,Glob"restricts Claude to read-only operations. In a review job it should never touchBashorEdit.- The diff goes in via stdin, the prompt is explicit about scope and length. Bounded output means bounded tokens means bounded cost.
- JSON is written to a file, a separate step parses and posts. Never let Claude talk to your APIs directly. Parse its output, validate it, then act.
Parsing Claude's output without trusting it
The JSON output format is stable, but the content inside is not. Assume Claude will occasionally return empty responses, malformed markdown, polite refusals, or an explanation that it could not find anything to review. Your automation has to tolerate all of them.
A minimal TypeScript parser:
import fs from 'node:fs';
type ClaudeResult = {
type: 'result';
subtype: 'success' | 'error_max_turns' | 'error_during_execution';
result?: string;
total_cost_usd?: number;
num_turns?: number;
session_id?: string;
};
const raw = fs.readFileSync('review.json', 'utf8');
const parsed: ClaudeResult = JSON.parse(raw);
if (parsed.subtype !== 'success') {
console.error(`Claude did not finish cleanly: ${parsed.subtype}`);
process.exit(0);
}
const review = (parsed.result ?? '').trim();
if (review.length < 50) {
console.log('Review too short to post, skipping.');
process.exit(0);
}
if (review.length > 4000) {
console.warn('Review exceeded length budget, truncating.');
}
The pattern: assume the output might be missing, truncated, or error-flagged, and decide what to do in each case before you ever post anything to a pull request. Exiting zero on a non-success subtype is intentional. A failed Claude run should not fail the pipeline unless you want every flaky API call to block merges.
The cost trap
Cost is the thing that bites you in week two.
Anthropic's prompt caching helps enormously. The official anthropics/claude-code-action reports costs around $0.0015 per review run with caching on. A team running 10-15 pull requests a week lands somewhere between $15 and $25 a month. Without caching, multiply that by roughly ten, because every review sends the full context fresh.
Three concrete controls:
Bound the input. Pass the diff, not the full repo. If you need wider context, pass specific files explicitly instead of letting Claude walk the tree. Claude is very good at filling context when you let it, and context has a per-token cost.
Bound the output. Tell Claude how long the response should be. "Under 400 words" in the prompt actually works. Open-ended review requests produce open-ended bills.
Bound the turns. Use --max-turns to cap agent loop iterations. Default is effectively unbounded. A reasonable starting point: 5 for review jobs, 15 for jobs that need to run tests and iterate on the result.
The discipline matters more the moment you move to the SDK: the Agent SDK's loop will happily spin through 40 tool calls if you let it. A good habit is logging num_turns on every run, setting a per-job budget, and alerting when a run blows past it. Most runaway loops trace back to Claude re-reading a file it already had, because the prompt lacked a concrete stopping condition.
Multi-turn loops, with validation
The most useful thing headless Claude Code does in CI is iterate. You give it a task, let it run, run your tests against its output, and feed the failures back into the next turn.
The pattern in pseudo-shell:
attempts=0
max_attempts=3
PROMPT="Fix the failing tests in src/auth. Changes must pass npm test and npm run lint."
while [ $attempts -lt $max_attempts ]; do
claude -p --bare --output-format json --max-turns 10 \
"$PROMPT" > out.json
if npm test && npm run lint; then
echo "Passed on attempt $((attempts+1))"
break
fi
FAIL_OUTPUT=$(cat test-output.txt lint-output.txt)
PROMPT="Your previous attempt failed. Failures below. Fix only what the failures point to. $FAIL_OUTPUT"
attempts=$((attempts+1))
done
This is close to how the Claude Code team itself uses the tool internally: let the model attempt, validate externally, iterate with concrete failure messages. It is also the pattern the Claude Agent SDK exposes in its query() function, with retry semantics and tool-call tracking built in.
Two warnings.
The feed-back prompt matters more than you think. "Please try again" produces worse results than pasting the actual failing test output and the specific line the linter complained about. Specificity in the loop prompt is the difference between convergence in three iterations and spinning out on the third with an answer worse than the first.
Cap the loop. Infinite retry on a bad task is worse than failing cleanly and handing it to a human. Three attempts is a sensible default. Past that, the problem is almost always the prompt, not the model. This is the same shape of loop used in autonomous AI agents for CI/CD pipeline automation, just applied to a narrower tool. Validation-in-the-loop is the reason either approach works at all.
When to graduate from the CLI
The CLI is the right tool until it is not. Signs you have outgrown it:
- Your workflow runs Claude Code, parses output, triggers another workflow, parses that output, and so on. You are building an orchestrator in bash and it is fragile.
- You need to run concurrent Claude jobs sharing a rate-limit budget.
- You want real-time streaming output into a dashboard, not a retrospective log.
- You need to plug Claude into an existing job queue system.
Two options when you get there.
Claude Agent SDK. Install with pip install claude-agent-sdk for Python 3.10+, or npm install @anthropic-ai/claude-agent-sdk for TypeScript. You get the full agent loop, tool use, subagents, and context management as a library. Use this when you want to embed Claude inside a service, customize the tool set, or run multiple sessions from one process. The SDK was renamed from Claude Code SDK in September 2025, so older tutorials referencing claude_code_sdk or @anthropic-ai/claude-code are out of date.
Claude Code Dispatch. Dispatch, which shipped in Q1 2026, treats Claude Code as a queueable job. You submit a task to the API, Claude Code picks it up, runs it, and streams events through Channels. This is the right layer when you need observability, concurrent execution, and retry semantics across many jobs. It is also where you want to be if someone on your team will watch a dashboard instead of reading GitHub Actions logs.
Rule of thumb: CLI for one-shot per-PR jobs, SDK for bespoke integrations, Dispatch for production multi-tenant workflows. If you are using subagents to run parallel AI work inside your CI jobs, graduate early. The CLI is fine for a single agent, painful for orchestrating several.
Security and trust boundaries
In CI, Claude Code has more capability than you probably want. Bash lets it run arbitrary shell commands. Edit lets it modify your repo. WebFetch lets it hit external URLs. None of these belong in a review job.
--allowedTools is a whitelist. Pass exactly the tools the job needs. For a review job: Read,Grep,Glob. For a code-generation job that commits back: Read,Grep,Glob,Edit,Write. Never use Bash in CI without a very specific reason and a sandboxed runner.
--disallowedTools is the inverse, useful when you want most tools but need to block one or two.
Secrets. Claude Code in CI will read whatever is on disk. If your repo contains a .env or build artifacts with credentials, Claude can see them. Two mitigations: add them to .gitignore properly, and use --bare so the runner does not auto-load a user MCP config that might include connectors you do not want running in CI. The full permission model that --bare bypasses is covered in the Claude Code power-user guide on skills, hooks, and subagents.
Prompt injection. The diff in a pull request is user-controlled content. A malicious contributor can put instructions in a code comment that tell Claude to leak secrets or rewrite files. Mitigations: read-only tools in review jobs, never give Claude write access on untrusted PRs (only on PRs from your own branches), and log every tool call so you can audit what actually happened. If you are comparing review quality across tools, the AI coding assistants comparison has more on how Claude handles this versus Copilot and Cursor.
Where this breaks
Honest limits, because nobody posts about these.
Cold start on every CI run. Claude Code has to load, authenticate, and initialize on every job. That is 3-6 seconds before your prompt even starts. For high-frequency triggers, the overhead dominates. For those, use the SDK or Dispatch and keep a warm session.
Non-deterministic output. Even with temperature locked, Claude's output varies run to run. If your CI makes hard decisions based on Claude's answer (approve, block, merge), wrap the decision in explicit validation. Do not trust a single review as a quality gate. The same principle applies to any zero-downtime deployment pipeline: never let a non-deterministic check be the only gate.
Repository-scale context. Headless Claude Code does not have the interactive session's ability to build up context across many prompts. It sees what you give it, plus whatever tools you let it call. For large refactors or cross-module changes, use the SDK with a persistent session, not the CLI.
Debugging failed runs. When a CI Claude run does something weird, you have only the JSON log. Interactive sessions are far easier to debug because you can inspect state live. In CI, logging every tool call with --verbose is the only way to reconstruct what happened.
Cost at scale. The $15-25/month figure is for moderate pull request review. Generative tasks, large refactors, and multi-file changes can run ten to twenty times more expensive per job. Budget before you roll out, and track total_cost_usd in your logs per job.
Start small
A sensible rollout: one job, one repo, one week. Pick pull request review because it is read-only, contained, and easy to evaluate. Put the workflow behind a label like claude-review so it only runs on PRs you opt in. Watch the cost dashboard every day for the first week. Once the invocation is stable and the cost is bounded, expand scope.
Do not start with code-generation jobs in CI. They are more useful but they are also where every cost and security footgun lives. Generation works best interactively in a reviewed session, not fired from an automated trigger against user-controlled content.
Claude Code in CI is not magic. It is a fast, capable coding model with sharp edges when you run it unattended. Handle the edges up front, using the flags, validation, and cost controls above, and the automated review workflow earns its place in the pipeline. Skip them, and unattended Claude Code will find creative ways to surprise your API bill.