Celonis-posted 3 days ago
Full-time • Mid Level
Redwood City, CA
1,001-5,000 employees

We're Celonis, the global leader in Process Intelligence technology and one of the world's fastest-growing SaaS firms. We believe there is a massive opportunity to unlock productivity by placing AI, data and intelligence at the core of business processes - and for that, we need your help. Care to join us? The Team As a member of our Reliability Engineering team, you will play a critical role in ensuring the health, performance, and resilience of our platform. The team applies advanced software engineering and Site Reliability Engineering (SRE) principles to drive system reliability, scalability, and operational excellence across the organization. The Role Join a highly technical, collaborative, and innovation-driven team that blends Site Reliability Engineering with modern Software Engineering practices to build resilient and scalable systems. Lead reliability efforts for a fleet of 80+ FedRAMP-compliant microservices running on Kubernetes, applying SRE principles to drive observability, automation, and incident prevention. Develop and enforce SLOs, SLAs, and error budgets to drive reliability-focused development. Provide mentorship and technical leadership across the SRE and engineering teams. Own high-priority application incident escalations, performing deep technical analysis and restoration within defined SLOs, while continuously improving detection and response mechanisms. Engineer solutions to enhance the availability, latency, and performance of production services—automating manual processes to eliminate toil and scale operational efficiency. Collaborate closely with platform and application engineering teams to conduct post-incident reviews, extract insights, and implement systemic changes that improve overall reliability.

  • Lead reliability efforts for a fleet of 80+ FedRAMP-compliant microservices running on Kubernetes, applying SRE principles to drive observability, automation, and incident prevention.
  • Develop and enforce SLOs, SLAs, and error budgets to drive reliability-focused development.
  • Provide mentorship and technical leadership across the SRE and engineering teams.
  • Own high-priority application incident escalations, performing deep technical analysis and restoration within defined SLOs, while continuously improving detection and response mechanisms.
  • Engineer solutions to enhance the availability, latency, and performance of production services—automating manual processes to eliminate toil and scale operational efficiency.
  • Collaborate closely with platform and application engineering teams to conduct post-incident reviews, extract insights, and implement systemic changes that improve overall reliability.
  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field (or equivalent hands-on experience).
  • Minimum of 8+ years of experience in software engineering or SRE roles.
  • Deep experience with cloud platforms (AWS, GCP, or Azure).
  • Proficiency in Java, the Spring framework, and Python (or a similar scripting language) in a Linux environment.
  • Prior experience contributing to Site Reliability Engineering initiatives or similar operational roles.
  • Demonstrated ability to lead projects and influence engineering culture.
  • Knowledge of SRE principles, including SLI/SLO design, error budgets, and toil reduction strategies.
  • Excellent written and verbal communication skills in English.
  • Please note : This position is not eligible for immigration visa sponsorship, now or in the future.
  • Experience with observability and monitoring tools (e.g. Datadog, etc.).
  • Experience in developing and operating production-grade, scalable services using Kubernetes and elastic cloud architectures.
  • Experience with CI/CD pipelines and tools such as ArgoCD, GitHub Actions, or similar.
  • Experience with Infrastructure as Code (IaC) tools such as Terraform and Kustomize.
  • Exposure to incident management practices, on-call rotations, and postmortem culture.
  • Including generous PTO, hybrid working options, company equity (RSUs), comprehensive benefits, extensive parental leave, dedicated volunteer days, and much more
  • Accelerate Your Growth: Benefit from clear career paths, internal mobility, a dedicated learning program, and mentorship opportunities.
  • Prioritize Your Well-being: Access to resources such as gym subsidies, counseling, and well-being programs.
  • Connect and Belong: Find community and support through dedicated inclusion and belonging programs.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service