NetSmart-posted 3 months ago
Full-time • Senior
Overland Park, KS
1,001-5,000 employees
Professional, Scientific, and Technical Services

The Senior Site Reliability Engineer (SRE) will serve as a senior technical contributor, responsible for advancing observability and operational maturity across hundreds of application teams. This is not a product deployment or configuration role. The SRE will work directly with application engineers and external infrastructure partners to implement distributed tracing, profiling, structured logging, and metrics collection strategies that support reliability at scale. This role requires strong software engineering fundamentals, deep knowledge of observability tooling, and the ability to work across a wide range of technology stacks and organizational boundaries. The ideal candidate is comfortable with high ambiguity, varied application environments, and time-sensitive incident response involving external stakeholders.

  • Partner with application teams to implement observability best practices: distributed tracing, profiling, structured logging, and metrics collection
  • Support instrumentation and telemetry integrations across legacy and modern architectures
  • Implement and support enterprise observability platforms, including Grafana, Zabbix, Splunk, and related tooling
  • Build and maintain centralized dashboards and alerts to improve monitoring quality and reduce operational noise
  • Collaborate with development teams and vendors to define SLIs, SLOs, and alert thresholds for key services
  • Participate in on-call rotations and serve as an escalation point during complex incidents involving external partners
  • Lead and contribute to post-incident reviews with a focus on observability gaps, telemetry accuracy, and long-term remediation
  • Create and maintain documentation, templates, and onboarding materials for standardized observability integration
  • Provide mentorship to mid-level engineers and guide application teams through complex observability challenges
  • 5+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles supporting production systems
  • Strong software development experience in Python, Go, Java, or C#
  • Demonstrated success implementing observability solutions in production environments
  • Hands-on experience with Grafana, Zabbix, Splunk, OpenTelemetry, or comparable tools
  • Deep understanding of telemetry data structures (logs, metrics, traces) and their use in troubleshooting distributed systems
  • Experience participating in incident response and remediation
  • Strong communication skills and ability to work directly with third-party vendors and managed service providers
  • Experience supporting observability in mixed technology environments (.NET, Linux, Windows Server, Kubernetes, monoliths and microservices)
  • Familiarity with CI/CD systems and Git-based workflows
  • Familiarity with OpenTelemetry Collector and custom instrumentation patterns
  • Experience onboarding large application portfolios into centralized observability platforms
  • Understanding of operational SLIs/SLOs and alerting strategies across heterogeneous systems
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service