Staff Site Reliability Engineer

CheckrDenver, CO
51d

About The Position

As a Staff Site Reliability Engineer within the Platform group at Checkr, you will play a crucial role in advancing the reliability of our products in our mission to create the data platform of the future. Our SRE team focuses our engineering expertise on fostering service resiliency, scalability, and efficiency. The person in this role will identify engineering challenges across the organization and its services, lead the development of innovative solutions to resolve them, and drive their adoption.

Requirements

  • Bachelor's degree in Computer Science or related field, or equivalent education/training
  • 10+ years of industry experience in software engineering, including 5+ years of engineering for systems reliability, scalability, and efficiency
  • Strong proficiency in developing solutions in Python (preferred), GoLang, or Ruby in Linux environments
  • Deep understanding of the fundamental infrastructure and platform concepts behind microservice architectures, asynchronous systems, and remote APIs
  • Strong collaboration, documentation, communication, and project management skills
  • Proficiency in developing and operating production, customer-facing environments in AWS or Azure using solutions such as Kubernetes, Docker, and Terraform
  • Demonstrated ability to establish service observability standards and leverage platforms and frameworks like Datadog, Splunk, Grafana, Prometheus, and OpenTelemetry
  • Proven track record of advancing incident management practices and driving continuous improvement
  • Leadership skills and a passion for mentoring more junior engineers
  • History of leading platform adoption across engineering teams, guided by a self-service and product-first approach

Responsibilities

  • Collaborate and drive architectural discussions across a wide array of customers, including operations, developers, technical architects, and executives
  • Lead reliability roadmap planning and cross-team project execution to enable engineering and help Checkr customers
  • Proactively engage across teams to foster service reliability, efficiency, and scalability
  • Participate in a cross-organization incident response team, driving continuous improvement
  • Troubleshoot complex production issues across the stack, with respect to performance, availability, and data quality
  • Design, build, ship, and maintain the core observability libraries, tools, and patterns used by all of Checkr's engineering teams
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service