Kandji-posted 5 months ago
Full-time • Senior
Miami, FL
251-500 employees
Publishing Industries

As a Principal Site Reliability Engineer at Kandji, you will play a critical role in ensuring the reliability, scalability, and performance of our platform. In this strategic position, you'll work cross-functionally to build and evolve the systems, tools, and processes that keep our services resilient and performant-especially as we scale to meet the demands of a growing customer base. You'll bring a deep understanding of distributed systems, incident management, observability, and automation. Your experience with AWS, Kubernetes, and Infrastructure-as-Code (Terraform preferred) will help drive efforts to proactively identify and eliminate reliability risks, reduce toil through automation, and establish engineering best practices across teams. This role provides the opportunity to shape the culture and architecture of reliability at Kandji, partnering closely with engineering, infrastructure, and product teams to build systems that are not only functional, but fault-tolerant and maintainable.

  • Design and implement fault-tolerant, scalable, and highly available systems across our AWS-hosted platform to ensure reliability under load and failure conditions.
  • Partner with engineering teams to define and uphold SLIs/SLOs, perform root cause analyses, and drive post-incident reviews with a focus on long-term systemic improvements.
  • Run recurring reliability reviews, and mature incident response practices including alert quality, runbooks, and failure simulations.
  • Build and maintain automation for deployment, incident response, and remediation workflows to reduce manual toil and increase operational efficiency.
  • Hands-on experience implementing DevSecOps practices including secure IaC, policy-as-code, and embedding controls in pipelines or platform abstractions.
  • Champion the development of comprehensive observability solutions-including metrics, logging, tracing, and alerting-to enable proactive detection and resolution of issues.
  • Contribute to and improve our Terraform-based infrastructure management, enabling consistent, auditable, and repeatable infrastructure deployments.
  • Lead efforts in system tuning, load testing, and capacity forecasting to support our scaling platform and avoid bottlenecks before they occur.
  • Lead efforts to monitor and optimize cloud costs across environments.
  • Design and advocate for architectural trade-offs that balance cost, performance, and reliability.
  • Embed reliability thinking into engineering and product workflows.
  • Run architecture reviews, failure simulations, and training to elevate operational discipline.
  • Mentor engineers across the organization in SRE best practices, incident response, and reliability design patterns.
  • 10+ years in Site Reliability Engineering, DevOps, Infrastructure or related roles, with a proven track record of improving system reliability and scaling distributed systems in cloud environments (preferably AWS).
  • Deep expertise in Infrastructure as Code (Terraform strongly preferred), Kubernetes, and container orchestration at scale.
  • Strong background in automation, scripting (e.g., Python, Go, or Bash), and CI/CD pipelines.
  • Experience defining and maintaining SLOs/SLIs, leading incident response and postmortems, and applying SRE principles to reduce toil and improve system reliability.
  • Deep familiarity with chaos engineering, failure mode analysis, and designing systems for graceful degradation under partial failure.
  • Strong understanding of modern observability stacks (e.g., Datadog, Prometheus, Grafana, OpenTelemetry) and performance tuning for distributed systems.
  • Solid understanding of security and compliance in cloud environments, with experience implementing secure-by-default infrastructure patterns.
  • Familiar with secure infrastructure design, cloud compliance requirements (SOC2, ISO27001, ISO42001), and embedding DevSecOps into delivery workflows.
  • Skilled in diagnosing complex, multi-layered production issues and implementing pragmatic, long-term solutions.
  • Excellent written and verbal communication skills with the ability to clearly articulate reliability trade-offs and influence engineering teams toward better operational outcomes.
  • Competitive salary
  • 100% individual and dependent medical + dental + vision coverage
  • 401(k) with a 4% company match
  • 20 days PTO
  • Kandji Wellness Week the first week in July
  • Equity for full-time employees
  • Up to 16 weeks of paid leave for new parents
  • Paid Family and Medical Leave
  • Modern Health - Mental Health Benefits - Individual and Dependents
  • Fertility Benefits
  • Working Advantage Employee Discounts
  • Free onsite fitness center
  • Free parking
  • Lunch 5 days/week
  • Exciting opportunities for career growth
  • An outstanding, inclusive culture
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service