JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate

Johns Hopkins and Microsoft's JBDistill achieves 81.8% attack success rate across 13 LLMs by auto-generating fresh adversarial prompts on demand.

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate

Every time a major lab ships a new model, safety benchmarks built on the last generation quietly become unreliable. The adversarial prompts that tripped GPT-4 don't work on GPT-4o. Attacks that exposed Llama 2 were patched away before Llama 3 even launched. A research team from Johns Hopkins University and Microsoft has now published a framework that treats this not as a baked-in limitation but as a pipeline problem - and the numbers are difficult to dismiss.

JBDistill, published in Findings of EMNLP 2025 and drawing wider attention this week, doesn't curate a static library of known jailbreaks. It creates a fresh one. Run a battery of existing attack algorithms against cheap open-source development models, collect the outputs, select the prompts that transfer best, then assess against your actual target. On 13 evaluation LLMs spanning recent chat models, reasoning systems, and specialized coding and healthcare variants, JBDistill's best configuration hit a 81.8% attack success rate.

BenchmarkApproachAttack Success Rate
JBDistill (RankBySuccess)Auto-created, renewable81.8%
WildJailbreaksStatic, curated collection63.2%
Best individual dynamic attackPer-model, static51.3% - 69.9%
CoSafeStatic32.5%
DAN promptsStatic27.4%
HarmBenchStatic18.4%

How It Works

The framework runs in three stages. The design is intentionally wasteful - and that's the point.

Stage 1 - Generating the candidate pool

JBDistill starts by running eight off-the-shelf attack algorithms against four open-source development models: Llama 2-7B-Chat, Llama 3.1-8B-Instruct, Gemma 2-9B-IT, and OLMo2-7B-Instruct. Four of those algorithms operate in a single turn (Tree of Attacks with Pruning, Persuasive Adversarial Prompts, AutoDAN-Turbo, and Adversarial Reasoning), and four operate across multiple turns (ActorAttack, Red Queen, Context Compliance Attack, Speak Easy). The seed goals come from HarmBench: 200 goals across seven categories of harmful content.

The output is a large pool of adversarial prompts, most of which won't be useful. Over-generation is the whole strategy.

Stage 2 - Selecting what transfers

Four selection algorithms filter that pool down to a working subset: RandomSelection as a baseline, RankBySuccess (rank prompts by observed success on the development models), BestPerGoal (pick the single best-performing prompt per seed goal), and CombinedSelection as a hybrid. In the paper's evaluations, RankBySuccess consistently beat the others.

Stage 3 - Benchmark evaluation

The selected subset becomes a reproducible benchmark you can run against any target model. Because stages 1 and 2 can be repeated any time you need to evaluate a new model generation, the benchmark doesn't go stale. That's where the "renewable" framing comes from.

JBDistill framework overview diagram from the research paper The JBDistill pipeline: produce a large candidate pool from cheap development models, select the highest-transferring prompts, assess against held-out target LLMs. Source: arxiv.org

Why Static Benchmarks Keep Failing

The standard approach to safety benchmarking is to build a fixed dataset of harmful prompts, run the model against it, and report a refusal rate. The problem is straightforward once you say it: the benchmark leaks.

Curated jailbreak datasets end up in public repositories. From there they get scraped into training runs. Within months, the benchmark you built to measure safety is teaching models which prompts to refuse. Labs are aware of this cycle - it's part of why OpenAI acquired Promptfoo to maintain internal red-team tooling outside the public commons.

JBDistill's answer isn't to lock the benchmark behind a wall. It's to make regeneration cheap enough that you don't need to.

"While previous work mainly focused on generating more transferable attack prompts, we demonstrate that over-generating attack prompts and then selecting a highly effective subset of them is a simple and effective method."

Lead author Jingyu Zhang, a Johns Hopkins PhD candidate who built the framework during a Microsoft internship, also highlighted a scaling property: adding more development models and attack algorithms to the candidate pool consistently improved benchmark strength. The researchers don't have data showing a ceiling.

Attack success rate breakdown comparing JBDistill selection strategies against prior methods Figure 2 from the paper: ASR comparisons across benchmark methods and model families. RankBySuccess and CombinedSelection consistently beat static benchmarks across both single-turn and multi-turn settings. Source: arxiv.org

The press coverage this week isn't coincidental. Safety benchmarking is under pressure from multiple directions at once. OBLITERATUS demonstrated earlier this month that safety fine-tuning can be stripped from open models in a matter of minutes. Research published in February found AI models generating jailbreaks against other AI models autonomously, with 97% success rates across tested systems. JBDistill doesn't address those vectors directly. What it offers instead is something the field has struggled to maintain: a way to test production models against adversarial prompts that haven't already been optimized away.

What It Does Not Tell You

The paper is technically careful, and that carefulness also exposes the gaps.

English only. The current framework handles English text. Multilingual attack transfer, image-based jailbreaks, and audio or video modalities are listed as future work. Some of the most consistent jailbreak evasion in the wild uses non-English phrasing that slips past English-trained safety classifiers - JBDistill doesn't touch that.

A narrow seed set. Two hundred harmful goals across seven categories is enough to demonstrate proof of concept. Real-world adversarial evaluation covers much longer tails: nuanced social engineering, jurisdiction-specific harms, subtly misleading outputs that don't look harmful to a judge model. The 81.8% figure was measured on a constrained problem definition.

Development models are all small and open. All four models used to produce the candidate pool are sub-10B open-source systems. Attacks that transfer from Llama 2-7B may or may not generalize to architectures with very different training pipelines. The researchers acknowledge this but don't have controlled data on the gap.

No severity grading. JBDistill measures whether a target model produces output that a judge model rates as harmful. It doesn't assess real-world severity or practical exploitability. A benchmark can achieve high attack success rate against outputs that would never cause material damage, and still look like a strong result.

The GitHub release at microsoft/jailbreak-distillation includes the benchmark dataset in HarmBench format. Full pipeline code is listed as pending.


The 81.8% figure warrants real attention. But the more important question is whether labs treat this as another static artifact to patch against - in which case the benchmark will decay like every one before it - or actually run the regeneration workflow as part of their release process. The framework exists. Whether the incentive structure exists to use it correctly is a different problem completely.

Sources:

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.