Senior Backend Engineer, AI Platform

Epic Games•Cary, NC

4d•Remote

About The Position

Epic Games is building an enterprise-grade stack of agentic AI systems to automate engineering workflows, accelerate developer productivity, and enable new collaboration across Epic's teams. This involves building production systems from the ground up across six interconnected platforms: Geppetto (team AI agents in Slack), EMA (compute and workspace infrastructure for agent harness runs), Hodor (OAuth gateway, plugin runtime, and governance layer), Multipass (agent identity, credential vault, and authorization), Vektor (org-wide memory plane with knowledge graph), and Roost (software distribution and plugin marketplace). This is foundational work with real production usage and consequences, touching every corner of Epic's engineering organization.

Requirements

7+ years of software engineering experience, with a track record of owning and shipping complex backend systems.
Strong distributed systems fundamentals: service design, event-driven architecture, failure handling, and multi-tenant isolation.
Solid understanding of authentication and authorization — OAuth 2.0, OIDC, RBAC — and the ability to implement them correctly in production services.
Experience with security fundamentals in code: secrets handling, credential storage, least-privilege access patterns, and audit logging.
Production Go experience, or strong fluency in a comparable systems language with demonstrated ability to ramp quickly.
Ability to take a loosely-specified problem, write a design doc, get alignment, and execute — without requiring step-by-step direction.
Clear written communication: design docs and code reviews that are precise, readable, and actionable.

Nice To Haves

Hands-on experience building LLM-integrated systems: tool use, agent orchestration, MCP server or client development, or streaming LLM output pipelines.
Experience with plugin or extension runtime patterns — WASM sandboxing, gRPC sidecar, subprocess management, or capability-based security models.
Familiarity with knowledge graph or vector database systems and hybrid search (semantic + keyword + graph).
Kubernetes experience: pod lifecycle, sidecar injection, workload identity, or multi-tenant namespace isolation.
Exposure to software signing or secure distribution: TUF, Sigstore, KMS-backed pipelines, or artifact integrity verification.
Experience in the games industry or with high-concurrency, low-latency consumer-facing backend platforms.
Familiarity with Anthropic Claude APIs, Claude Code, or the Model Context Protocol.
Background contributing to developer tooling or internal platforms used by large engineering organizations.

Responsibilities

Own the design and implementation of major features and subsystems across the AI Platform stack, including Hodor's plugin runtimes and credential manager, Geppetto's agent-service LLM dispatch and session lifecycle.
Build and harden EMA's worker layer: workspace materialization, harness lifecycle management, normalized event streaming, and mid-run input handling across compute backends.
Implement production-grade components for Vektor's memory pipeline: ingestion workers, knowledge graph writes, semantic search, and nightly consolidation jobs.
Contribute to Roost's publish and consume pipelines: TUF signing, artifact storage, marketplace generation, and plugin signature verification.
Implement credential manager components with rigor: AES-256-GCM encryption, AAD binding, scope isolation, and audit trail completeness.
Write services that operate correctly under failure: circuit breakers, rate limiters, DLQ handling, and idempotent replay patterns.
Contribute to Multipass as it moves from strategy to implementation: workload identity, token broker, policy plane, under the guidance of the principal engineer.
Participate in on-call and incident response for Hodor and other production systems, building operational intuition alongside engineering depth.
Write design documents for the features you own — clear enough for async review, precise enough to serve as the implementation spec.
Collaborate across the team on cross-cutting concerns: NATS JetStream event bus patterns, multi-tenant isolation, RBAC enforcement, and observability.
Review code from peers with the goal of raising quality and spreading knowledge, not just catching bugs.
Surface architectural concerns early and engage constructively with the principal engineer and team lead when your implementation work reveals design gaps.