Sr. Site Reliability Engineer - 11444

Coupa SoftwareAnn Arbor, MI
Remote

About The Position

Coupa makes margins multiply through its community-generated AI and industry-leading total spend management platform for businesses large and small. Coupa AI is informed by trillions of dollars of direct and indirect spend data across a global network of 10M+ buyers and suppliers. They empower businesses with the ability to predict, prescribe, and automate smarter, more profitable business decisions to improve operating margins. Coupa values pioneering technology, a collaborative culture driven by transparency and openness, and global impact. The Site Reliability Engineers at Coupa are part of the Cloud Operations team, responsible for the end-to-end availability and performance of mission-critical services and building automation to prevent problem recurrence. They also provide administration of Linux machines, web servers, application servers, and infrastructure support for customer environments.

Requirements

  • Bachelor’s degree in Computer Science, Information Systems, or related field, with 5+ years of experience in system administration and large-scale web operations
  • Strong programming skills (PowerShell, Python, Bash, or OOP languages) and experience with automation and configuration management tools (Chef, Puppet, Ansible, etc.)
  • Hands-on experience managing cloud infrastructure (AWS, GCP) and container platforms (EKS, GKE), plus Infrastructure as Code tools like Terraform
  • Proficiency in CI/CD pipelines, source control (Git with complex branching), and deployment/automation tools (Jenkins, Octopus, Rundeck)
  • Solid understanding of networking and operations concepts (DNS, load balancing), monitoring tools (Datadog, Splunk, New Relic), and database administration (MS SQL Server)
  • Strong Agile/Scrum experience (JIRA), ITIL practices (incident/change management, RCA), and excellent communication, problem-solving, and ownership skills

Responsibilities

  • Own end-to-end availability and performance of critical services, including building automation to prevent recurring issues
  • Administer Linux and Windows systems across web, application, and database servers
  • Develop and automate solutions using various programming languages
  • Provide application and infrastructure support, including participating in on-call rotations for emergencies
  • Enhance monitoring, alerting, and observability to ensure reliability and performance
  • Collaborate with cross-functional teams on releases, infrastructure, troubleshooting, and maintain documentation such as RCAs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service