Principal Engineer, AI Platform

Epic Games•Cary, NC

2d•Remote

About The Position

Epic Games is seeking a Principal Engineer for its AI Platform team. This role is responsible for architecting and building production systems from the ground up, focusing on an enterprise-grade stack of agentic AI systems. These systems will automate engineering workflows, accelerate developer productivity, and enable new forms of collaboration across Epic's teams. The work involves foundational infrastructure that will define AI usage at Epic for the next decade, operating at a massive scale with significant architectural impact for every engineer on the team. The Principal Engineer will own the technical direction of the agent infrastructure stack end-to-end, setting architecture, driving alignment, and solving complex distributed systems and security problems. This is a hands-on role involving production code, protocol design, and accountability for system reliability.

Requirements

12+ years of software engineering experience, with at least 4 years at staff or principal scope
Deep expertise in distributed systems: event-driven architectures, durable execution, service mesh, and multi-tenant platform design
Production experience with authentication and authorization infrastructure — OAuth 2.0, OIDC, SPIFFE/SPIRE or equivalent workload identity, token exchange (RFC 8693), and policy engines (OPA, OpenFGA, or comparable)
Strong security engineering fundamentals: credential vaulting, secrets management (OpenBao/Vault), audit trail design, and least-privilege access at scale
Fluency in at least one compiled, systems-capable language (Go preferred, Rust or C++ acceptable); comfort reading and writing Go microservices is essential given the stack
Track record of owning multi-service platform architecture across a full product lifecycle — from design through sustained production operation
Exceptional written communication: design documents and architecture reviews that are clear, precise, and influence without authority
Hands-on experience building LLM-integrated systems: agent orchestration, tool-use frameworks, MCP (Model Context Protocol), or equivalent agent-to-tool middleware
Experience with plugin or extension runtime design — WASM sandboxing, gRPC sidecar patterns, subprocess isolation, or comparable capability security models
Familiarity with knowledge graph systems (Neo4j or comparable), vector databases, and hybrid retrieval (semantic + keyword + graph), as well as experience operating Kubernetes-based platforms: scheduling, workload identity, sidecar injection, and multi-tenancy isolation

Responsibilities

Own the end-to-end technical architecture across Geppetto, EMA, Hodor, Multipass, Vektor, and Roost — ensuring each platform is coherent with the others and that the integration seams are well-defined
Drive architectural decisions for agent identity and workload authorization (SPIFFE/SPIRE, OIDC, token exchange, policy planes), translating security requirements into implementable designs
Establish the patterns for how AI agents authenticate, receive credentials, execute tools, and are audited — and hold the bar for correctness across the stack
Lead design reviews for new capabilities, evaluate build vs. buy decisions, and surface technical risk before it becomes production risk
Design and implement the Cluster API and provider abstractions for EMA — the layer that orchestrators depend on to launch, drive, and recover headless agent runs across Kubernetes, EC2, and other compute backends
Evolve Hodor's plugin runtime (WASM, gRPC sidecar, subprocess multiplexer) and its gateway security posture as external tool surface area grows
Architect Vektor's knowledge graph, vector search, and memory consolidation pipeline for org-wide scale — across teams, scopes, and retention horizons
Define durability, consistency, and isolation requirements across event-driven architectures (NATS JetStream, Redis) shared by multiple agent platforms
Lead the Multipass proposal from strategy into staffed execution — defining the separation of Hodor (human/tool) and Multipass (agent identity) and migrating the existing credential vault
Hold the standard for credential security across the stack: AES-256-GCM vault, AAD binding, scope isolation, default-deny policy, and audit completeness
Work with Epic's security organization to ensure agent-to-service trust models meet enterprise standards
Partner with product, ML, and enterprise platform teams to shape how agent capabilities are exposed to Epic's broader engineering organization
Mentor senior and staff engineers across the team; conduct technical interviews and raise the hiring bar
Write design documents that become the reference architecture for future work, not just approvals for current work

Benefits

Medical insurance
Dental insurance
Vision HRA
Long Term Disability
Life Insurance
401k with competitive match
Robust mental well-being program through Modern Health (free therapy and coaching for employees & dependents)
Events and company-wide paid breaks
Unlimited PTO and sick time
Paid sabbatical for 7 years of employment