Sr. Site Reliability Engineer - 11444

Coupa Software•Ann Arbor, MI

1d•Remote

About The Position

Coupa makes margins multiply through its community-generated AI and industry-leading total spend management platform for businesses large and small. Coupa AI is informed by trillions of dollars of direct and indirect spend data across a global network of 10M+ buyers and suppliers. They empower businesses with the ability to predict, prescribe, and automate smarter, more profitable business decisions to improve operating margins. Coupa values pioneering technology, a collaborative culture driven by transparency and openness, and global impact. The Site Reliability Engineers at Coupa are part of the Cloud Operations team, responsible for the end-to-end availability and performance of mission-critical services and building automation to prevent problem recurrence. They also provide administration of Linux machines, web servers, application servers, and infrastructure support for customer environments.

Requirements

Bachelor’s degree in Computer Science, Information Systems, or related field, with 5+ years of experience in system administration and large-scale web operations
Strong programming skills (PowerShell, Python, Bash, or OOP languages) and experience with automation and configuration management tools (Chef, Puppet, Ansible, etc.)
Hands-on experience managing cloud infrastructure (AWS, GCP) and container platforms (EKS, GKE), plus Infrastructure as Code tools like Terraform
Proficiency in CI/CD pipelines, source control (Git with complex branching), and deployment/automation tools (Jenkins, Octopus, Rundeck)
Solid understanding of networking and operations concepts (DNS, load balancing), monitoring tools (Datadog, Splunk, New Relic), and database administration (MS SQL Server)
Strong Agile/Scrum experience (JIRA), ITIL practices (incident/change management, RCA), and excellent communication, problem-solving, and ownership skills

Responsibilities

Own end-to-end availability and performance of critical services, including building automation to prevent recurring issues
Administer Linux and Windows systems across web, application, and database servers
Develop and automate solutions using various programming languages
Provide application and infrastructure support, including participating in on-call rotations for emergencies
Enhance monitoring, alerting, and observability to ensure reliability and performance
Collaborate with cross-functional teams on releases, infrastructure, troubleshooting, and maintain documentation such as RCAs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume