Senior Site Reliability Engineer

Koch IndustriesAtlanta, KS
302d

About The Position

Koch Global Services is on a mission to transform how we deliver reliable and scalable services to Koch. We are building an SRE capability from the ground up—modernizing legacy monitoring tools and practices. This transformation will drive a culture of reliability, accountability, and automation. If you are passionate about designing resilient systems, influencing strategic decisions, and mentoring the next generation of SREs, this is your opportunity to make a significant impact. This role is more than just engineering—it's about driving a transformation in how we deliver reliable, scalable, and observable services. If you're excited about the opportunity to build and influence a modern SRE capability from the ground up, we want to hear from you!

Requirements

  • Expertise with modern observability platforms and standards (Prometheus, Grafana, OpenTelemetry etc.)
  • Strong understanding of service reliability metrics, including SLIs, SLOs, and SLAs
  • Hands-on experience with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible, etc.)
  • Familiarity with the AWS ecosystem and cloud-native architectures
  • A passion for mentoring and developing engineers
  • Excellent communication skills, with the ability to engage technical and non-technical stakeholders
  • Experience leading incident response and driving post-incident analysis for continuous improvement

Nice To Haves

  • Hands-on experience with OpenTelemetry for distributed tracing and telemetry collection
  • Expertise in deploying and managing Grafana, Loki, Tempo, & Mimir
  • Experience migrating solutions from Splunk, LogicMontior, etc. to modern observability technologies
  • Experience with Kubernetes deployments and management
  • Knowledge of synthetic transaction monitoring to proactively detect reliability issues
  • Cross-domain expertise (e.g., networking, finance, leadership) that enhances your ability to drive impact
  • Experience with GitHub Enterprise for CI/CD and infrastructure automation
  • Multi-cloud experience (Azure, GCP, etc.)
  • Proven ability to drive organizational change and influence engineering culture

Responsibilities

  • Design and implement modern observability solutions to enhance service reliability and accountability
  • Define and measure service performance through SLIs, SLOs, and SLAs to drive intentional service reliability strategies
  • Partner with stakeholders to advocate for and drive reliability best practices, ensuring alignment with business objectives
  • Mentor and develop engineers, fostering a culture of continuous learning and growth

Benefits

  • Medical, dental, vision insurance
  • Flexible spending and health savings accounts
  • Life insurance, ADD, disability insurance
  • Retirement plan
  • Paid vacation/time off
  • Educational assistance
  • Infertility assistance
  • Paid parental leave
  • Adoption assistance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service