About the position
Arize is seeking a Site Reliability Engineer to work with their On-Prem engineering team responsible for deploying Arize in customer environments. The ideal candidate will have 1-2+ years of experience in site reliability engineering, DevOps, and system administration, as well as experience working with DevOps tools such as Kubernetes, Terraform, Ansible, Puppet, and Chef. The role involves working hands-on with the infrastructure that supports their distributed and highly scalable services in both SaaS and on-prem offerings, gathering requirements from customers, and adapting manifests and software to support new environments. The candidate will also use and augment monitoring tools to observe platform health, ensure performance and reliability, interact with the product team to test new features and package new on-prem releases, automate and optimize the release pipeline to make it as frictionless as possible, and exhibit continuous curiosity for emerging technology that could solve their challenges.
Responsibilities
- Work hands-on with the infrastructure that supports distributed and highly scalable services in both SaaS and on-prem offerings
- Gather requirements from customers and adapt manifests and software to support new environments
- Use and augment monitoring tools to observe platform health, ensure performance and reliability
- Interact with the product team to test new features and package new on-prem releases
- Automate and optimize the release pipeline to make it as frictionless as possible
- Exhibit continuous curiosity for emerging technology that could solve our challenges
Requirements
- 1-2+ years experience in site reliability engineering, DevOps, and system administration
- CS (preferred) or other technical degree, or equivalent practical experience
- Experience working with DevOps tools such as Kubernetes, Terraform, Ansible, Puppet and Chef
- Proficiency with scripting languages such as Python and bash
- Experience managing cloud infrastructure in AWS, GCP, and/or Azure
- Expertise in Linux administration, configuration, and networking protocols
- Bonus points for experience with on-prem deployment architectures, running a 24x7 SaaS platform with defined SLI, SLO, SLA, and familiarity with operating machine learning & AI applications.