AI Agents for Pen Testing and Cybersecurity

Posted by Tom 12 Jan 2026 at 03:53AM

Over the weekend, I read an interesting paper^[1] by researchers at Stanford, Carnegie Mellon, and Gray Swan AI about an AI agent they built that outperformed 9 out of 10 humans. It got me thinking about AI tooling for pen testing, and I learned about a neat tool called Strix^[2]. Strix built autonomous AI agents that act just like real hackers - they run your code dynamically, find vulnerabilities, and validate them through actual proofs-of-concept. Strix works with many models, including Claude.

I’ve been using Claude Code for a little while and have really enjoyed my time with it buiding applications, troubleshooting code, and using it as a sounding board for the new curriculum I plan to use this semester. As I dove into Strix using Claude for pen testing, I found that it is extremely capable. I had it run a pen test against one of my applications, and it returned with a 65-page write-up with findings; one of the findings had a CVSS score of 9.8 (oof!). Strix validated the findings, walked me through the methodology, and recommended fixes. I gave it the URL of my application, access to my GitHub repo, and login credentials and Strix did the rest. Strix uses APIs to interact with the models it supports and can get expensive when doing deep analysis. I wanted a way to use my already existing Claude Code Max subscription instead of paying for the more expensive API usage.

I paired with Claude to modify the Strix code base to work with Claude Code in a local environment. I wanted to expose the Strix agents via an MCP server so the Strix tooling would be available to Claude Code using the subscription we already had. Additionally, I wanted the tests to continue running after my session ended on my EC2 instance. We set out to build the app^[3] so we could run the tests in a screen and come back to them later. I also needed a good way to manage concurrent scans. I wanted to know which scans were running, an easy way to attach to them to view status, and a way to clean them up quickly after they finished. So we built a terminal UI wrapper that calls the app that runs the tests. We can now run pen tests using our Claude Code subscription on remote Linux servers that persist between sessions.

If you have not looked at AI for pen testing I think you might want to. I don’t think it will replace the pros anytime soon but it seems to be a pretty good sidekick for a developer.

Thomas Hastings

Per Scientiam Aedificamus

AI Agents for Pen Testing and Cybersecurity

References