Manager of Site Reliability Engineering

UKG•Alpharetta, GA

12d

About The Position

Site Reliability Managers at UKG have a breadth of knowledge encompassing all aspects of service delivery and management. This SRE role is primarily responsible for application reliability, performance, and operability as software runs on the underlying platform. The team focuses on how applications behave in production — including scalability, stability, resource usage, and failure recovery — rather than feature development. They lead and grow teams that develop solutions to increase resiliency and support our Cloud Engineering and Infrastructure. This can include building and managing CI/CD deployment pipelines, automated testing, capacity planning, performance analysis, monitoring, alerting, chaos engineering, and automation. Site Reliability Managers are passionate about learning and evolving with current technology trends and enabling their teams to do the same. They strive to innovate and are relentless in pursuing a flawless customer experience. They have an "automate everything" mindset, helping us bring value to our customers by leading their teams. Deploy services with incredible speed, consistency, and availability.

Requirements

Engineering degree, or a related technical discipline, or equivalent work experience
Knowledge of Public Cloud based applications & Containerization Technologies
Demonstrated understanding of best practices in metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing
Experience transforming teams and successfully leading them through change
5+ year of people management experience leading a technical team
Deep understanding of Windows Server internals (memory management, threading, I/O, services)
Experience with .NET runtime behavior (GC, memory leaks, thread pools, IIS)
Performance tuning of monolithic .NET applications in production environments

Nice To Haves

Experience working in a GCP Cloud environment
Experience with hiring SRE, DevOps, or similar engineering team

Responsibilities

Be a Technology Leader by driving the roadmap execution and running the project(s) while planning new ones
Help drive change across the company, working towards a common methodology based around Site Reliability Engineering and Solid System Engineering practices
Lead the team in driving further adoption of Site Reliability practices such as Chaos engineering, SLOs, Error Budgets, release safety, load testing, and disaster recovery strategies
Build teams through hiring and people growth while balancing your ownership workload through delegation and define and review individual and team goals (OKRs)
Responsible for guiding and encouraging the personal and technical development, engagement, and growth of your direct reports
Own application performance, scalability, and availability in production environments
Diagnose and resolve systemic reliability issues across application, OS, and infrastructure layers
Lead major incident response and act as the escalation point for platform-related reliability issues
Ensure post-incident reviews result in measurable improvements to platform stability and application performance
Partner with application teams to influence design decisions that impact runtime reliability
Collaborate cross organization to successfully complete successful delivery with the wider functions, including but not limited to Security, Architecture, Operations and Product Managers