DevOps Engineer

Luminary Cloud•San Mateo, CA

About The Position

We’re building out a cloud platform team and looking for a Senior DevOps Engineer to own the developer infrastructure that powers our products. You will own how we deploy, scale, observe, and secure systems across GCP and AWS, with Kubernetes at the core. This isn’t a ticket-queue role. You’ll work directly with engineers building services in Go and TypeScript, researchers training PyTorch models, and leadership defining the roadmap. You’ll have real ownership and the latitude to build things the right way from the start.

Requirements

5–8 years of experience in DevOps, SRE, or platform engineering roles
Production Kubernetes experience — cluster management, not just deploying workloads
Hands-on experience with GCP or AWS; solid conceptual understanding of both
End-to-end ownership of CI/CD pipelines and GitOps workflows
Proficiency in Go or Python for writing infrastructure tooling and automation
Infrastructure as Code expertise with Terraform or Pulumi
Experience with observability stacks: Prometheus, Grafana, and a log aggregation platform
Strong grasp of cloud security fundamentals: IAM, secrets management, network policies

Nice To Haves

Experience supporting ML training infrastructure, GPU node pools, or model serving (TorchServe, Triton)
Familiarity with TypeScript for build tooling or internal developer platforms
Background in a fast-moving startup or product engineering environment
Contributions to open-source infrastructure tooling

Responsibilities

Design, build, and operate cloud infrastructure on GCP with an emphasis on reliability, security, and cost efficiency
Own and evolve our Kubernetes platform — cluster architecture, RBAC, networking, autoscaling, and workload scheduling
Build and maintain automated CI/CD pipelines using GitHub Actions and ArgoCD, supporting GitOps workflows for all services
Write Go and Python tooling to automate infrastructure tasks, improve developer experience, and extend internal platform capabilities
Establish observability practices — metrics (Prometheus/Grafana), distributed tracing (OpenTelemetry), and centralized logging
Define and enforce security best practices: secrets management (Vault/KMS), image scanning, IAM least-privilege, and network policies
Support GPU-based ML workloads, working with researchers to provision and optimise node pools for PyTorch training and inference
Respond to incidents and lead blameless postmortems to drive continuous improvement in system reliability
Write clear documentation and champion a culture of engineering excellence across the team