Stripe's AI 'Minions' Now Ship 1,300 Pull Requests Per Week With Zero Human-Written Code

Stripe just published the second installment of its engineering deep-dive into Minions, the company's internal autonomous coding agents. The numbers have jumped since Part 1 landed earlier this month: over 1,300 pull requests merged per week now carry zero human-written code. Engineers review everything before it ships, but from Slack message to merged PR, no human touches a keyboard.

For a company processing over $1 trillion in annual payment volume across hundreds of millions of lines of mostly Ruby code, that is not a toy demo. It is production infrastructure.

How Minions Work: The Full Stack

The architecture is a five-layer pipeline that turns a chat message into a production-ready pull request.

Invocation

Engineers trigger Minions through Slack (the most common path), a CLI, a web interface, or automated systems like flaky-test detectors. A typical run starts with someone tagging the Minions app in a Slack thread and ends with a PR that has already passed CI.

Devboxes: The Execution Sandbox

Each Minion runs on a "devbox" - a standardized AWS EC2 instance pre-loaded with Stripe's full source tree, warmed Bazel and type-checking caches, and code generation services. Stripe provisions these from a warm pool in under 10 seconds.

The philosophy is "cattle, not pets." Every devbox is identical and disposable. Engineers already use these same environments via remote SSH through their IDEs, so the infrastructure existed before Minions did. The isolation means agents get full shell permissions without confirmation prompts - any mistake stays confined to one throwaway instance.

The Agent Core: A Goose Fork

Stripe's agent harness is a heavily modified fork of Block's open-source Goose coding agent, adapted for fully unattended operation. Where tools like Cursor or Claude Code are designed for interactive pair-programming, Minions strip out everything meant for humans - interruptibility, confirmation dialogs, human-triggered commands - and optimize for one-shot task completion.

Blueprints: The Orchestration Layer

This is the most interesting design decision. Blueprints are hybrid workflows that mix deterministic code nodes with free-flowing agentic subtasks:

Slack trigger
  → [Deterministic] Clone repo, set up environment
  → [Agentic] Understand task, plan implementation
  → [Agentic] Write code
  → [Deterministic] Run configured linters (<5 sec)
  → [Deterministic] Push branch
  → [Agentic] Fix CI failures (if any)
  → [Deterministic] Push final branch
  → PR ready for review

The rationale is pragmatic: encoding small, predictable decisions deterministically "saves tokens (and CI costs) at scale and gives the agent a little less opportunity to get things wrong." Teams can also build custom blueprints for specialized needs like large-scale codebase migrations.

Toolshed: 500 MCP Tools

Minions connect to Toolshed, Stripe's centralized internal MCP server, which exposes nearly 500 tools spanning internal systems and third-party SaaS platforms. Different agents request task-relevant tool subsets rather than loading the full catalog.

The security model is straightforward: devboxes operate in QA environments with no access to production services, real user data, or arbitrary network egress.

Context Without Overflow

One challenge with a codebase this large is context management. Stripe's solution is directory-scoped rule files that attach automatically as the agent traverses the filesystem, rather than a single global context dump that would overflow any model's window.

In a clever interoperability move, Stripe adopted Cursor's rule file format and synchronized it across three agent systems - Minions, Cursor, and Claude Code - so any guidance written for one works with all three. Engineers maintaining rule files get triple the return on their effort.

The CI Feedback Loop

Stripe runs a two-attempt CI policy:

Minion pushes changes and the full CI suite runs against Stripe's 3+ million existing tests
Auto-fixers handle trivially failing tests automatically
Remaining failures go back to the agent for one retry
After two CI rounds, unresolved failures require human intervention

Component	Detail
Agent framework	Fork of Block's Goose
Execution environment	AWS EC2 devboxes, 10-second provisioning
MCP tools available	~500 via Toolshed
CI test suite	3+ million tests
Max CI retry rounds	2
PRs merged per week	1,300+
Human-written code in PRs	0%

The two-round limit is deliberate. As Stripe's engineers put it, "there are diminishing marginal returns if an LLM is running against indefinitely many rounds of a full CI loop." Each CI run costs tokens, compute, and time. Two shots and a human handoff is the sweet spot they landed on.

Why Stripe Built In-House

Stripe did not reach for an off-the-shelf solution. The reasons are structural: hundreds of millions of lines of code, an uncommon stack (Ruby with Sorbet typing), extensive homegrown libraries that no general-purpose LLM has seen in training, and compliance requirements that come with handling global payment infrastructure.

Generic coding assistants struggle with large, mature codebases precisely because the most important context - internal APIs, team conventions, deployment constraints - is not in the training data. Stripe's approach puts that context directly into the agent's environment through rule files and MCP tools.

Where It Falls Short

The 1,300 weekly PRs are impressive, but context matters. Stripe employs thousands of engineers. This is not replacing engineering teams - it is automating the predictable, repetitive slice of work: fixing flaky tests, applying straightforward migrations, implementing well-specified features.

The two-round CI limit is telling. When tasks require complex debugging or architectural judgment, Minions bail out to humans. Stripe is explicit that every PR gets human review before merging. The agents handle execution, not decision-making.

There is also the question of transferability. Stripe's system works because of a decade of investment in standardized developer environments, comprehensive test suites, and internal tooling. The devbox infrastructure, the 3 million tests, the 500 MCP tools - that is not something a startup can replicate overnight. The lesson is less "use AI agents for coding" and more "if your developer platform is already excellent, agents can leverage it."

The Goose fork also means Stripe is maintaining its own agent runtime rather than staying on the open-source upgrade path. That is a maintenance burden that only makes sense at Stripe's scale.

What to Watch

Stripe's 30% week-over-week growth in Minion output (from 1,000 to 1,300 PRs in under two weeks) suggests this is still in the steep part of the curve. The company is framing its existing developer productivity investments - devboxes, CI infrastructure, linting daemons - as a foundation that yields dividends for both human and AI developers.

The real signal here is not the PR count. It is that a company processing a trillion dollars in payments trusts autonomous agents to write production code. The guard rails are heavy (sandboxed environments, human review, limited CI retries), but the direction is clear: unattended coding is moving from experiment to infrastructure.

The engineering details are worth reading in full. Stripe's transparency about what works and what doesn't - particularly the CI retry limits and context management trade-offs - is more useful than most "we shipped an AI agent" announcements.

Sources:

Minions: Stripe's one-shot, end-to-end coding agents - Part 2 (Stripe Dot Dev Blog)
Minions: Stripe's one-shot, end-to-end coding agents - Part 1 (Stripe Dot Dev Blog)
Stripe's Autonomous Coding Agents Generate Over 1,300 PRs a Week (Analytics India Magazine)
Stripe's AI agents now write 1,000+ pull requests per week (Medium / Reading.sh)