Engineer, Production Engineering

Guild.ai•San Francisco, CA

8d•Hybrid

About The Position

We are building the control plane for AI agents in teams and companies. As a Production Engineer, you will own the infrastructure, security, and compliance systems that allow our platform to ship fast and run reliably at scale. This is not a traditional ops role — you will write real code, contribute directly to the product, and own the full security and compliance surface of an early-stage company. You'll work across Kubernetes infrastructure, cloud delivery, agent sandboxing, SOC2 compliance, IT systems, and production observability — and you'll contribute to the product itself, building security-sensitive features and auditing application code for vulnerabilities. If you want to own the production backbone for the agent-native era — from a Terraform module to a pentest to an API key implementation — we want to talk.

Requirements

5+ years in Production Engineering, Platform Engineering, or a security-focused infrastructure role, ideally at a fast-growing startup or SaaS company.
Strong hands-on experience with Kubernetes and GCP in production; comfortable with Terraform for managing real infrastructure.
Strong programming skills (Python, Go, TypeScript, etc.) with a passion for automating away toil.
Hands-on experience with compliance frameworks (SOC2), vulnerability management, and secure system design.

Nice To Haves

Background with multi-tenant SaaS or enterprise security and procurement requirements.
Exposure to AI/ML infrastructure, particularly agent runtimes.
Experience building security-sensitive product features alongside infrastructure work.
Experience supporting pentests / bug bounties
Experience deploying and operating in customer VPCs or other external cloud environments across AWS, Azure, and/or GCP — navigating enterprise networking, security, and access constraints.

Responsibilities

Manage and evolve our production and staging infrastructure on GCP (GKE) using Terraform. Own DNS, networking, and environment configuration end-to-end.
Deploy and operate within customer VPCs across AWS, Azure, and GCP — adapting to varied infrastructure constraints, security requirements, and enterprise networking configurations.
Build and maintain Kubernetes-based sandboxing for agent execution — ensuring agents operate within strict network boundaries and must route through our API gateway rather than having unfettered internet access.
Own our observability stack, including OpenTelemetry instrumentation and integrations with New Relic and Splunk, to give the team deep visibility into system performance and agent runtime behavior.
Lead infrastructure and operational work to support SOC2 compliance, including audit preparation, evidence collection, and control implementation.
Manage our HackerOne engagement — coordinating pentests, triaging incoming bug bounty reports, and driving remediation.
Audit application code for security vulnerabilities, contribute security-sensitive product features (e.g., API key management), and ensure product and infrastructure security are coherent end-to-end.
Own our IT stack — Okta, device management, and access controls — keeping the company secure as we scale.
Design and maintain safe, automated CI/CD workflows supporting rollout strategies like canary and blue-green deployments.
Make shipping to production a routine, boring, highly automated non-event.