Staff Site Reliability Engineer

Checkr•Denver, CO

51d

About The Position

As a Staff Site Reliability Engineer within the Platform group at Checkr, you will play a crucial role in advancing the reliability of our products in our mission to create the data platform of the future. Our SRE team focuses our engineering expertise on fostering service resiliency, scalability, and efficiency. The person in this role will identify engineering challenges across the organization and its services, lead the development of innovative solutions to resolve them, and drive their adoption.

Requirements

Bachelor's degree in Computer Science or related field, or equivalent education/training
10+ years of industry experience in software engineering, including 5+ years of engineering for systems reliability, scalability, and efficiency
Strong proficiency in developing solutions in Python (preferred), GoLang, or Ruby in Linux environments
Deep understanding of the fundamental infrastructure and platform concepts behind microservice architectures, asynchronous systems, and remote APIs
Strong collaboration, documentation, communication, and project management skills
Proficiency in developing and operating production, customer-facing environments in AWS or Azure using solutions such as Kubernetes, Docker, and Terraform
Demonstrated ability to establish service observability standards and leverage platforms and frameworks like Datadog, Splunk, Grafana, Prometheus, and OpenTelemetry
Proven track record of advancing incident management practices and driving continuous improvement
Leadership skills and a passion for mentoring more junior engineers
History of leading platform adoption across engineering teams, guided by a self-service and product-first approach

Responsibilities

Collaborate and drive architectural discussions across a wide array of customers, including operations, developers, technical architects, and executives
Lead reliability roadmap planning and cross-team project execution to enable engineering and help Checkr customers
Proactively engage across teams to foster service reliability, efficiency, and scalability
Participate in a cross-organization incident response team, driving continuous improvement
Troubleshoot complex production issues across the stack, with respect to performance, availability, and data quality
Design, build, ship, and maintain the core observability libraries, tools, and patterns used by all of Checkr's engineering teams

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume