Manager, Platform Engineering

Endor Labs•Palo Alto, CA

2d•$200,000 - $275,000

About The Position

We're looking for a Manager of Platform Engineering to lead and grow our small but mighty Platform Engineering team. You'll step in as a player-coach - someone who isn't afraid to get hands-on in the infrastructure, write Terraform, debug a Kubernetes issue, or triage an incident - while also building the team and the systems we'll need at the next stage of Endor Labs' growth. This role is ideal for a technical leader who thrives in a scrappy, fast-moving startup environment, can operate with ambiguity, and is excited by the challenge of building both a team and a platform from the ground up.

Requirements

People Management Experience: You've managed engineers before - hired, onboarded, grown, and sometimes had hard conversations. You know what good looks like and can develop talent, not just direct it.
Hands-On Platform Engineering Chops: 8+ years of SRE or Platform Engineering experience, with demonstrated ability to get technical when needed. You can move fluidly between a 1:1 and a terminal window.
Kubernetes and Cloud Expertise: Deep, production-grade experience with Kubernetes and at least one major cloud provider (Azure, GCP, or AWS). You know the sharp edges and can debug complex multi-cluster issues without a runbook.
Infrastructure as Code Fluency: Strong experience with Terraform, OpenTofu, or similar IaC tools. You've built reusable modules, managed state at scale, and set patterns for teams to follow.
Startup Mindset: Comfortable with ambiguity, resource constraints, and wearing multiple hats. You know the difference between the right long-term architecture and the right thing to ship this week - and when to choose each.
Operational Rigor: You've been on-call, led incident retrospectives, and know how to build systems and processes that reduce toil over time. You care about reliability without becoming a bottleneck.
Genuine Curiosity About Agentic AI: You are actively thinking about how AI agents change the way software gets built and operated. You do not need to have all the answers, but you should have a point of view, be energized by the ambiguity, and be ready to drive a real roadmap from it.
Strong Communicator: Able to communicate clearly with engineers, cross-functional partners, and leadership. You document decisions, advocate for your team, and represent platform work with clarity and conviction.

Nice To Haves

Experience Scaling a Team: You've grown an engineering team from small to mid-size, and you've learned what to standardize, what to keep flexible, and what gets harder before it gets easier.
Security Domain Familiarity: Bonus if you've worked in or alongside a security product company. Understanding of compliance requirements (SOC 2, GDPR) and secure-by-default infrastructure patterns is a plus.
Observability Depth: Hands-on experience with open source observability tooling - Prometheus, Grafana, Mimir, Pyroscope - and a history of building dashboards and SLIs/SLOs that teams actually use.
GitOps Experience: Hands-on with ArgoCD or Flux in production - progressive delivery, sync waves, and the real-world complexity of declarative infrastructure.
Cost Optimization Track Record: You've reduced cloud spend meaningfully while maintaining or improving service quality. You understand FinOps isn't just a dashboard - it's an engineering discipline.

Responsibilities

Lead and Grow the Team: Manage a team of 3 engineers today, with a clear mandate to grow. You'll hire, mentor, and develop engineers while fostering a culture of ownership, quality, and continuous improvement - all without losing the startup speed that makes us who we are.
Be the Technical Anchor: Roll up your sleeves and contribute directly - whether that's reviewing infrastructure PRs, stepping in during an outage, or pairing with an engineer on a gnarly Kubernetes networking problem. You'll set the technical bar through your own work, not just directives.
Own Platform Strategy: Define and execute the roadmap for our cloud infrastructure across Azure, Google Cloud, and AWS. Balance near-term reliability and operational needs with longer-term scalability investments as we grow rapidly.
Drive Reliability and Scale: Ensure the platform that processes billions of security events daily remains highly available and performant. Establish SLOs, incident response practices, and on-call processes that scale with the team.
Elevate Developer Experience: Partner with Backend and Product Engineering to build self-service tooling, CI/CD pipelines, and golden-path abstractions that help every engineer at Endor Labs ship faster and more safely.
Advance Infrastructure as Code: Champion GitOps and IaC best practices across Terraform/OpenTofu deployments. Build reusable patterns and raise the bar for how we manage our entire cloud footprint.
Collaborate Across the Company: Work closely with Security, Product, and Engineering leadership to align platform investments with business priorities. Translate technical constraints into clear tradeoffs for non-technical stakeholders.
Lead Our Agentic Engineering Transformation: Serve as the internal thought leader for how Endor Labs builds and ships software in an AI-native world. The Platform team is at the center of this shift - defining the infrastructure, tooling, and practices that let engineering teams harness AI agents effectively and safely. This is a greenfield space where you will set the direction, build conviction across the company, and make it real.
Build for the Future: Make architectural decisions today that won't box us in tomorrow. Think ahead to 10x growth, but pragmatically sequence the work so we keep shipping.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume