Senior Software Engineer, Agentic Systems

Horizon3 AI

1d•$169,000 - $208,000•Remote

About The Position

Horizon3.ai is seeking a Senior Software Engineer for their Agentic Systems team. This role focuses on building an autonomous, black-box web application penetration tester that mimics skilled human pentesters. The engineer will be responsible for translating offensive expertise into autonomous agent capabilities, including reasoning, orchestration, tooling, and evaluation. The position emphasizes engineering reliability and production-safety, using LLMs as a tool rather than the primary focus. The team is composed of individuals with deep offensive expertise and requires an engineer to build the system that allows an LLM-driven agent to perform penetration testing reliably, at scale, and unattended. The engineer will own and evolve the attack-agent layer, which decides what to probe, forms and tests hypotheses, exploits, and verifies findings without false positives or unintended system impact.

Requirements

5+ years building production software, with strong Python.
Hands-on experience building LLM-powered applications or agents, tool use / function calling, structured outputs, multi-step orchestration, and the glue that makes it all hold together.
A track record of making LLMs reliable in production, you've wrestled nondeterminism, designed around model limitations, and shipped something that worked when it mattered.
Real experience with evaluation: you've built or owned the harness that tells you whether a model or agent change is an improvement, not just a vibe.
Strong instincts for prompt and context engineering, and the judgment to keep the model''s job small and well-scoped.
Solid software fundamentals — testing, observability, and the discipline to keep a complex agent debuggable.
Ownership mentality, comfortable owning a critical, fast-moving subsystem end to end.

Nice To Haves

Working knowledge of web application security, broken access control, IDOR/BOLA, SQLi, XSS, SSRF, SSTI, enough to collaborate fluently with offensive engineers.
Experience building eval harnesses or benchmarks specifically for agents (synthetic environments, CVE-based test targets, capture-the-flag-style scoring).
Experience with agent frameworks, and strong opinions about when not to reach for one.
Familiarity with graph data models (e.g., Neo4j) for representing application state and attack context.
You've shipped an autonomous agent that did real, valuable work unattended in production, and you have scar tissue from making it trustworthy.
You've designed evaluation systems that actually drove improvement, closed the loop between "we changed something" and "it measurably got better."
You pair an offensive-security mindset (CTF, bug bounty, pentesting, or research background) with the engineering chops to turn that intuition into a reliable system.
You have hands-on experience with agent fine-tuning or RL (SFT, GRPO, reward design for tool-using agents) and a grounded view of when it's worth it versus improving the harness.
You've published or spoken on agent reliability, evaluation, or autonomous security tooling.

Responsibilities

Build and evolve the agent harness and orchestration that turns an LLM into a reliable autonomous pentester, the loop that reasons over an application, forms attack hypotheses, acts, and verifies results.
Design the tools and tool-shaped feedback the agent uses to probe and exploit, and the structured-output and validation layers that keep it reliable (e.g., hook-enforced mandatory validation, schema-constrained outputs).
Translate the team's offensive expertise into repeatable agent capabilities — partnering directly with our attackers to encode how they think into something the agent can do consistently.
Own and grow our evaluation infrastructure: benchmark suites, a failure-mode taxonomy across the pipeline (discovery → hypothesis → exploitation → verification), and regression detection, so we actually know whether the agent is getting better.
Manage LLM inference in production: model selection, prompt and context engineering, and keeping cost and latency under control (we run on AWS Bedrock with centralized cost tracking).
Hold the line on production-safety and no-false-positives, every finding the agent reports has to be real and reproducible.