We pointed Harkonnen — my self-learning agentic orchestrator — at Hack The Box's new MCP pilot program, and it earned a top-25 finish out of ~300 autonomous AI agents, clearing 36 of 37 challenges (97%) across every category on the board.

This wasn't a human on a keyboard with an AI sidekick. The MCP pilot runs in a strict agent-only mode: every action — discovering challenges, spawning targets, pulling files, exploiting, and submitting flags — happens through the Model Context Protocol, with no human hands on the target. The entire leaderboard is AI agents competing head-to-head. Harkonnen showed up and went straight to the top of the pack.

The Scoreboard

Competing as [AI] ai-agent-of-jakemagi — see the live leaderboard.

  • Placement — #25 of 279 teams (301 agents) — top ~8%
  • Challenges solved — 36 / 37 (97%)
  • Score — 30,125 points
  • Categories cleared — Pwn · Web · Crypto · Reversing · Forensics · Hardware · ICS · Blockchain
  • Difficulty range — Sanity-check to Hard (full sweep)

Harkonnen captured the overwhelming majority of flags in a single sustained run, moving from category to category without retooling between domains — binary exploitation one moment, cryptographic forgery the next, hardware signal analysis after that.

A Few Highlights (No Spoilers)

The breadth is the story. A self-learning orchestrator that's equally comfortable across the whole offensive-security stack is rare. A handful of moments that stood out:

  • Hardware / signal analysis. One challenge shipped a logic-analyzer capture in a proprietary, undocumented binary format with no public parser. Rather than burn hours reverse-engineering the container, Harkonnen recognized the dead-end, stood up the vendor's own tooling headless, and let the right tool do the decoding — then read the flag straight out of the recovered signal. Knowing when not to brute-force a problem is a hallmark of the self-learning loop.
  • Embedded / firmware. A router-firmware target was re-hosted in full emulation, booted, and driven under a live debugger — a complete embedded lab spun up on demand to analyze the target exactly as it runs in production.
  • Cryptography. A hard challenge required a rogue-key signature forgery against a BLS aggregate scheme, including recovering predictable randomness from a leaked state. Harkonnen worked the math, adapted to a live protocol that differed from the shipped source, and forged its way through.
  • Web chains. Multi-stage exploitation — SSRF pivots, request smuggling, and a software supply-chain takeover — was chained end-to-end to reach internal services and code execution.

A recurring theme: when an assumption turned out to be wrong, the orchestrator re-tested it rather than trusting a stale verdict — several "blocked" challenges fell once Harkonnen re-examined them from a fresh angle. That self-correction is exactly what we built it for.

What It Means

The MCP pilot is a clean, like-for-like arena: same targets, same access, autonomous agents only. A top-25 placement in a field of ~300 — with a near-complete sweep across every discipline — is a strong, independent signal that Harkonnen captures real results, not demo-grade ones.

The single remaining challenge is a hard binary-exploitation target that comes down to bespoke heap exploitation. We've fully mapped it and documented the path; it's a depth problem, not a breadth one — and a fitting next milestone.

The Engineering Behind It

Building Harkonnen is my work as an AI engineer, and this event is a useful window into the kinds of systems I design and ship:

  • Autonomous agent orchestration. I architect long-horizon agents that take an open-ended objective, decompose it into steps, execute those steps against real systems, and drive the loop to completion with no human in the loop — durable multi-stage workflows, not one-shot prompts.
  • Tool-use and MCP integration. The entire competition ran over the Model Context Protocol — my agent discovers the available tools at runtime and drives external systems (containers, debuggers, network services) through structured tool-calls as first-class actions, not text suggestions.
  • Grounded reasoning and self-correction. The system recovers from dead-ends and doesn't get stuck on a wrong assumption — it re-examines what it earlier judged impossible rather than trusting a stale conclusion. Several challenges fell only because it went back and reconsidered.
  • Agent-provisioned environments. When a task needed a capability the model didn't have out of the box, the agent stood up its own tooling on demand — emulation, live debugging, headless automation of third-party software — and operated it autonomously.
  • Generalization across domains. A single system handled binary exploitation, cryptography, web, and hardware without per-domain hand-holding — the real test of a general agent rather than a collection of narrow scripts.
  • Memory and knowledge capture. Techniques it discovers carry forward to later work, so each engagement compounds into the next instead of starting cold.
  • Evaluation and benchmarking. I measure capability against objective, like-for-like public arenas — same targets, autonomous-only — rather than cherry-picked demos.
  • Responsible scope. Everything runs strictly against sanctioned, authorized targets.

In stack terms, this lives at the intersection of LLM agent design, tool/runtime integration, systems and security engineering, and evaluation — the full loop from model behavior to verified real-world action.

Where We're Headed

Capture-the-flag is a proving ground, not the destination. The same orchestration that swept this board is what we're taking into live bug-bounty programs and real engagements — finding and responsibly reporting genuine vulnerabilities at machine speed. The HTB MCP pilot was a benchmark. The real targets are next.

Verify my results. Don't take my word for it — see Harkonnen's standing live on the Hack The Box MCP pilot leaderboard, competing as [AI] ai-agent-of-jakemagi.

Harkonnen is a self-learning agentic orchestrator for offensive security, competing in the HTB MCP pilot as [AI] ai-agent-of-jakemagi. Results above are from Hack The Box's authorized MCP pilot program (event "MCP TryOut"), where all activity is conducted against sanctioned, isolated practice targets.