About The Position

Lead the definition and implementation of a consistent, enterprise-wide approach to application monitoring, observability, and operational support. This role drives standardized practices, tooling, and proactive operations to improve system reliability, visibility, and performance across all teams.

Requirements

  • Deep expertise in application support, monitoring, observability, and operations
  • Proven experience implementing enterprise-wide standards and cross-team practices
  • Strong understanding of cloud platforms, infrastructure, and enterprise applications
  • Experience with automation, scripting, and operational optimization
  • Strong leadership and influencing skills without direct authority
  • Ability to drive change, challenge existing practices, and improve operational maturity
  • Results-oriented mindset focused on reliability, performance, and continuous improvement
  • Degree or equivalent and typically requires 10+ years of relevant experience. Less years required if has relevant Master’s or Doctorate qualifications.

Nice To Haves

  • Experience with modern observability platforms (e.g., Datadog, Dynatrace, Splunk, New Relic)
  • Knowledge of SRE practices, SLIs/SLOs, and reliability engineering principles
  • Experience in large-scale enterprise or highly regulated environments
  • Familiarity with CI/CD, DevOps, and cloud-native architectures
  • Prior experience driving enterprise transformations or platform standardization

Responsibilities

  • Define and implement a unified monitoring and observability strategy across all applications
  • Standardize tools, metrics, alerting thresholds, and practices across engineering and operations teams
  • Establish centralized dashboards to provide full visibility into system health and performance
  • Identify gaps in monitoring coverage and implement improvements
  • Develop proactive detection mechanisms to identify anomalies and performance degradation
  • Reduce incidents through preventive, data-driven practices
  • Define and track key operational metrics such as availability, performance, reliability, and recovery time
  • Lead root cause analysis and ensure sustainable corrective actions
  • Drive automation across monitoring, alerting, and operational workflows
  • Guide development of scripts, automated runbooks, and self-healing capabilities
  • Improve efficiency by reducing manual intervention
  • Define and enforce operational standards, governance models, and accountability frameworks
  • Ensure compliance with security, audit, and performance requirements
  • Align teams to consistent, scalable operational practices
  • Act as a cross-functional leader across development, QA, operations, and support teams
  • Partner with service managers and delivery teams to drive adoption of best practices
  • Lead organizational change and promote a culture of operational excellence
  • Deliver executive-level dashboards and insights on system performance
  • Analyze trends and recommend strategic improvements
  • Lead continuous improvement initiatives and ensure long-term adoption of best practices

Benefits

  • competitive compensation package
  • annual bonus
  • long-term incentive opportunities
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service