About The Position

The Production Engineering team is responsible for building, scaling, and operating the cloud platform for CyberArk’s machine identity security products. Our solutions are trusted by the world’s largest organizations to protect and manage TLS machine identities, SSH machine identities, and code signing identities. As a Staff Production Engineer at CyberArk, you will play a key role in designing and evolving the reliability, scalability, and operational excellence of our cloud platform. You will work across infrastructure, services, and engineering teams to ensure systems are resilient, observable, and able to operate at scale. This role is ideal for engineers who combine strong infrastructure expertise with a systems mindset, and who are comfortable driving improvements across production environments, tooling, and engineering practices.

Requirements

  • 8+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
  • Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
  • Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
  • Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
  • Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
  • Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
  • Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
  • Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
  • Strong problem-solving skills and ability to work across teams

Nice To Haves

  • Experience implementing DevSecOps practices in cloud environments
  • Experience building or improving observability platforms and tooling
  • Professional certifications (CKA/CKAD, AWS Solutions Architect, Azure Administrator)
  • Experience using AI-assisted development tools to improve operational workflows and automation

Responsibilities

  • Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
  • Lead improvements across production systems, including performance, availability, and incident response
  • Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
  • Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
  • Partner with development teams to improve system reliability, observability, and cloud-native design patterns
  • Define and implement monitoring, alerting, and observability strategies across distributed systems
  • Lead incident response efforts, including root cause analysis and long-term remediation strategies
  • Identify and eliminate operational toil through automation and system improvements
  • Mentor engineers and contribute to raising the bar for production engineering practices

Benefits

  • medical
  • dental
  • vision
  • financial
  • equity
  • discretionary bonus
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service