Principal Site Reliability Engineer

ZefrMarina del Rey, CA
4d$210,000 - $235,000Hybrid

About The Position

As a Principal Site Reliability Engineer at Zefr, you'll serve as a technical leader and subject matter expert, helping define the technical vision and shape the direction of our reliability practices across the organization. You'll leverage deep expertise in observability, core SRE principles, cloud infrastructure, CI/CD and DevSecOps to solve our most complex challenges and set the standard for engineering excellence. This role requires a blend of hands-on technical expertise and strategic thinking. You'll drive cross-functional initiatives, mentor engineers across teams, and partner with leadership to ensure our AI-powered platform is robust, efficient, and scalable. We’re looking for someone to combine their technical expertise with strong leadership and a passion for continuous improvement and innovation. Zefr wants a candidate that champions reliability as a product feature, and can translate complex technical concepts into strategy. This is a role where you'll shape how we build and operate systems at scale. Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely. Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes. Collaborate with other engineers to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP. Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams. Proactively maintain the health of production environments, including monitoring application performance and resource utilization. Participate in 24/7 on-call rotation, respond to system performance issues and outages. Debug code at the application and infrastructure level. Mature our CI/CD workflows and release process. Maintains a forward-thinking approach, actively researching and proposing new solutions. Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.

Requirements

  • 10+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
  • Demonstrated technical leadership experience; including mentoring engineers, driving cross-functional projects, and influencing architectural decisions at an organizational level.
  • Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
  • Advanced Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
  • Deep production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters
  • Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.
  • Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.
  • Strong knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.
  • Exceptional written and verbal communication skills; ability to translate complex technical concepts for diverse audiences and build consensus across teams.
  • Experience authoring technical strategy documents, RFCs, and architectural proposals.

Nice To Haves

  • Experience in Advertising or AdTech

Responsibilities

  • Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
  • Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
  • Collaborate with other engineers to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.
  • Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
  • Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
  • Participate in 24/7 on-call rotation, respond to system performance issues and outages.
  • Debug code at the application and infrastructure level.
  • Mature our CI/CD workflows and release process.
  • Maintains a forward-thinking approach, actively researching and proposing new solutions.
  • Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.

Benefits

  • Flexible PTO
  • Medical, dental, and vision insurance with FSA options
  • Company-paid life insurance
  • Paid parental leave
  • 401(k) with company match
  • Professional development opportunities
  • 13 paid holidays off
  • Summer Fridays (we leave early)
  • In-office, hybrid, and fully-remote work options available
  • In-office lunches and lots of free food
  • Optional in-person and virtual events (we like to celebrate!)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service