Sr. Observability Engineer

EverBankJacksonville, FL
5h

About The Position

The Sr. Observability Engineer plays a critical role in ensuring the reliability, availability, and performance of enterprise systems by designing and implementing observability solutions. This position supports the IT Incident Management Team (IMT) by providing actionable insights, telemetry, and automation to reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). The role combines deep technical expertise in monitoring, logging, and tracing with a strong understanding of SRE principles.

Requirements

  • 3 years of technical experience supporting enterprise systems
  • Previous experience with observability tools or site reliability engineering

Nice To Haves

  • 5 years of experience with observability tools (Splunk, ELK, Prometheus, Grafana, OpenTelemetry)
  • Proficiency in scripting languages (Python, Bash) and automation frameworks
  • Certifications in SRE, ITIL, or cloud technologies
  • Familiarity with cloud platforms (Azure, AWS, or GCP) and container orchestration (Kubernetes)
  • Experience with AIOps or machine learning for anomaly detection
  • CI / CD Tools  - GitHub, Jenkins, Azure DevOps

Responsibilities

  • Designs, implements, and maintains observability tools (e.g., Splunk, Prometheus, Grafana, OpenTelemetry).
  • Develops dashboards, alerts, and automated workflows to support proactive incident detection.
  • Partners with IMT to provide real-time telemetry during major incidents.
  • Conducts root cause analysis using logs, metrics, and traces.
  • Improves incident response processes through automation and data-driven insights.
  • Defines and monitors Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
  • Collaborates with application and infrastructure teams to embed observability into CI/CD pipelines.
  • Identifies gaps in monitoring coverage and implements solutions.
  • Drives adoption of observability best practices across engineering teams.
  • Incident Management ( IMT) - Provide Incident Analysis, Run Book, suggest improvements and collaborate with wider group
  • Build & Publish operation KPI's - Sev1 / Sev2, MTTR, MTTD, Incident Volume, Application performance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service