Staff Site Reliability Engineer - Eng

UKGLowell, MA
23hOnsite

About The Position

Staff Site Reliability Engineers (SREs) at UKG are senior individual contributors who play a critical role in ensuring the reliability, scalability, and performance of our services. They bring a breadth of knowledge across service delivery and apply software engineering principles to operational challenges. In this role, you will ensure the reliability, availability, and performance of production systems by applying software engineering practices to operations. SREs proactively monitor system health, manage risk through SLOs and error budgets, lead incident response, and enable safe, rapid change while balancing reliability and delivery velocity. Staff SREs are passionate about learning and evolving with modern technologies. They strive to innovate and relentlessly pursue an excellent customer experience, with an “automate everything” mindset that enables services to be delivered with speed, consistency, and high availability. This is a senior individual contributor role, focused on technical leadership, influence, and reliability impact.

Requirements

  • 5+ years of hands-on experience in software engineering, systems engineering, or cloud-based environments.
  • 5+ years of experience working with public cloud platforms (e.g., GCP (preferred), AWS, or Azure).
  • 5+ years of experience configuring, operating, and maintaining applications and/or systems infrastructure in a large-scale, customer-facing environment.
  • Demonstrated understanding of observability best practices, including metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing.
  • Experience coding in one or more higher-level programming languages (e.g., Python, Java, or C++).
  • Strong working knowledge of Linux systems, including troubleshooting, performance analysis, and scripting in production environments.
  • Experience with GitHub Actions and modern CI/CD practices.
  • Experience building operational dashboards and alerts using observability tools such as Splunk or Grafana.
  • Excellent communication and collaboration skills, with experience of mentoring and guiding engineers.

Nice To Haves

  • Experience with distributed system design and architecture.
  • Hands-on experience with cloud-native applications and containerization technologies (Kubernetes, containers).
  • Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible).
  • Experience operating production workloads in Google Cloud Platform (GCP).
  • Solid grounding in at least two of the following areas: Computer Science fundamentals, Cloud Architecture, Security, or Network Design.

Responsibilities

  • Engage in and improve the lifecycle of services from conception to end-of-life, including system design reviews, capacity planning, and production readiness.
  • Define and implement standards and best practices for system architecture, service delivery, reliability, and automation, including the definition and monitoring of service health indicators (latency, traffic, error rates, and resource saturation), service level objectives (SLOs), and the use of error budgets to guide operational and delivery decisions.
  • Support service, product, and engineering teams by providing common tooling and frameworks to increase availability and improve incident detection and response.
  • Improve system performance, availability, and efficiency through automation, process refinement, post-incident reviews, and in-depth configuration analysis.
  • Collaborate closely with engineering teams across the organization to deliver and operate reliable services.
  • Increase operational efficiency, effectiveness, and service quality by treating operational challenges as software engineering problems (reducing toil).
  • Guide junior team members and serve as a champion for Site Reliability Engineering best practices.
  • Actively participate in incident responses, including on-call rotations and post-incident reviews, collaborating with engineering teams to restore service and reduce recurrence.
  • Partner with stakeholders to influence and help drive the best possible technical and business outcomes.

Benefits

  • employees may be eligible to participate in a performance-based bonus plan and to receive restricted stock unit awards as part of total compensation
  • Learn more about UKG’s benefits and rewards at https://www.ukg.com/about-us/careers/benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service