LangChain Releases Deep Agents for Long-Horizon Tasks

TL;DR

LangChain shipped Deep Agents, an open-source agent harness for long-running, complex tasks
It's built on LangGraph and includes planning, subagents, persistent memory, and a virtual filesystem
The Deep Agents CLI scored ~42.65% on Terminal Bench 2.0 using Claude Sonnet 4.5, on par with Claude Code at the same model tier
Available now via pip install deepagents; MIT licensed

LangChain dropped Deep Agents today - a standalone open-source framework designed for the class of tasks that simple LLM tool loops consistently fail on: long-horizon, multi-step work where context spills, artifacts build up, and you need more than one call to finish anything meaningful.

The release formalizes a pattern LangChain has been building toward for months. It's positioned explicitly as an agent harness rather than an agent framework - meaning it wraps LangGraph's runtime with a set of opinionated defaults that get you a working autonomous agent out of the box, rather than requiring you to assemble components from scratch.

What Problem This Solves

The standard tool-calling loop is fine for short tasks. Ask an agent to rename a file or summarize a document - it works. Ask it to refactor a codebase, write a research report, or run a test suite and fix every failure it finds, and naive architectures fall apart. Context windows overflow, the model loses track of earlier decisions, and artifacts from subtasks have nowhere to live.

Deep Agents addresses this by pulling in four architectural choices that characterize the most capable coding agents - Claude Code being the most studied example - and making them available without tying the developer to any specific model or cloud stack.

Planning

A built-in write_todos tool gives the agent a way to decompose its goal and track progress explicitly. Instead of trying to hold the full task in memory, it writes down what it needs to do and marks items complete as it goes. This reduces context pressure and gives long-running runs a recoverable state.

Subagents

Deep Agents can spawn isolated subagents for individual subtasks using a built-in task tool. Each subagent gets its own context window, which means a parent agent can delegate parallel work without burning tokens on unrelated history. The subagent model can be configured separately - you can run a cheaper model for routine subtasks and a more capable one for the top-level planner.

Virtual Filesystem

Rather than stuffing every artifact into the prompt, the framework provides a virtual filesystem backed by LangGraph's state store. Tool results, intermediate files, and long outputs get offloaded to storage instead of staying in the active context. In the 0.2 release last October, LangChain added pluggable backends - S3, local disk, or in-memory - and composite routing, so you can map /memories/ to persistent cross-thread storage while keeping working files on local disk.

System Prompt Design

Deep Agents ships with a detailed, example-heavy system prompt modeled on Claude Code's approach. LangChain describes this as one of the key engineering choices separating agents that work reliably from agents that don't. It's customizable.

A developer typing code in a terminal interface Deep Agents CLI runs in a terminal UI built with Textual. It supports interactive and headless modes for CI/CD pipelines. Source: unsplash.com

The CLI

The SDK ships with deepagents-cli, a terminal coding agent similar to Codex CLI or Claude Code. It launched in its current form on March 11 and reached version 0.4.11 by March 13. Recent releases added a token breakdown command, a /reload command for refreshing config mid-session, an --acp flag to run the agent as an ACP server, and a subagent model parameter.

The CLI runs a Textual TUI for interactive use or headless for automation. It includes web search (requires a Tavily API key), HTTP requests, shell execution with optional sandboxing, and persistent memory across sessions via LangGraph's Memory Store.

Installation is pip install deepagents or pip install deepagents-cli for the CLI only.

Benchmark Numbers

LangChain ran the Deep Agents CLI against Terminal Bench 2.0 - an 89-task benchmark covering software engineering, biology, security, and gaming scenarios. Using Claude Sonnet 4.5 as the underlying model, Deep Agents scored 42.65% (44.9% and 40.4% across two runs). LangChain describes this as on par with Claude Code running the same model.

The current Terminal Bench 2.0 leaderboard tells a different story at the top. ForgeCode paired with GPT-5.4 or Claude Opus 4.6 both sit at 81.8%. The highest Deep Agents submission in the public leaderboard - using GPT-5.2-Codex - sits at 66.5% at rank 20. That's a meaningful gap to the leaders, though it's also a different model tier completely. It's worth keeping in mind that the leaderboard fills up fast with closed-source, purpose-built agents with proprietary scaffolding.

The honest summary: Deep Agents is competitive with Claude Code at matched model tiers. It doesn't lead the leaderboard, and it's not trying to.

Server infrastructure and data center hardware Terminal Bench 2.0 tests agents on 89 tasks ranging from chess optimization to COBOL modernization. Complex tasks can require over 100 tool calls. Source: unsplash.com

How It Fits the LangChain Ecosystem

LangChain now has three distinct products for agent builders:

Product	Purpose	Choose When
LangGraph	Low-level runtime, fine-grained state control	You need precise orchestration, custom retry logic, or complex branching
LangChain	Standard framework for rapid agent setup	Quick start, RAG pipelines, one-shot agents
Deep Agents	Agent harness for complex, long-running tasks	Autonomous work over extended horizons: research, coding, analysis

The documentation includes a comparison against Claude Agent SDK and OpenAI's Codex SDK. Deep Agents is model-agnostic and provider-agnostic - it works with Anthropic, OpenAI, Google, or any provider LangChain supports. Claude Agent SDK offers first-class Claude support across Anthropic, Azure, Bedrock, and Vertex, with customizable hooks, but has no long-term memory and no observability tooling. Codex SDK has OS-level sandboxing and MCP server mode but limited deployment options. Deep Agents integrates LangSmith natively for tracing.

Where It Falls Short

The documentation is honest about gaps. Deep Agents has no OS-level sandbox modes - it uses a "trust the LLM" model with tool-level boundaries. That's a risk if you're running untrusted inputs in production. There's also no cloud execution option: everything runs locally or on whatever infrastructure you bring.

The benchmark numbers, while decent, also reflect that this is a general-purpose harness, not a specialized coding agent. ForgeCode and others at the top of Terminal Bench are purpose-built for terminal tasks. Deep Agents is built to be applied anywhere.

For comparisons to existing open-source agent work, it's worth looking at Nous Research's Hermes agent, which takes a different architectural approach to persistent memory. Princeton's OpenClaw-RL framework - covered earlier this month - addresses a different problem: continuous training from conversation feedback rather than task completion.

The best AI agent frameworks article has a broader comparison if you're trying to figure out where Deep Agents fits in the current landscape.

Deep Agents is MIT-licensed and available on GitHub now. The LangChain Academy is offering a Deep Agents curriculum with the release.

Sources: