Senior Manager, Monitoring & Observability

ConduentEast Honolulu, HI
Remote

About The Position

The Senior Manager of Monitoring & Observability is responsible for the strategic direction, engineering evolution, and operational oversight of the organization’s enterprise observability capabilities. This role provides leadership across multiple monitoring and alerting platforms and teams, ensuring consistent standards, intelligent event correlation, and service-level visibility that enable faster incident detection, reduced MTTR, and scalable operations. This leader will combine deep engineering experience with people and vendor leadership to mature the organization from traditional monitoring to modern, outcome-driven observability. The role partners closely with infrastructure, application, cloud, SRE, and service management teams to ensure monitoring data drives actionable insights—not just alerts.

Requirements

  • 10+ years of experience in IT operations, monitoring, observability, or reliability engineering, with at least several years in a leadership or management role.
  • Strong hands-on engineering background in enterprise monitoring and event management systems.
  • Demonstrated experience transforming organizations from reactive monitoring to proactive, correlated event management or observability models.
  • Experience leading teams responsible for large-scale, multi-tool monitoring environments.
  • Strong understanding of incident management, ITSM integration, and service health models.
  • Excellent communication skills, with the ability to translate technical data into executive-level insights.

Nice To Haves

  • Experience with large, complex enterprise environments (hybrid, cloud, regulated, or mission-critical).
  • Hands-on experience implementing or operating AIOps platforms for event correlation and noise reduction.
  • Familiarity with cloud-native observability, Kubernetes, and modern application architectures.
  • Experience establishing monitoring standards and governance across distributed teams.
  • Background working closely with SRE, platform engineering, or reliability-focused teams.

Responsibilities

  • Define and own the enterprise observability and event management strategy, roadmap, and success metrics (e.g., MTTR reduction, alert quality, incidents caught by monitoring).
  • Lead and mentor multiple monitoring and alerting teams spanning infrastructure, application, network, and cloud observability platforms.
  • Establish governance, standards, and operating models for monitoring, alerting, and event management across the enterprise.
  • Serve as the senior escalation point for observability-related challenges during major incidents and platform outages.
  • Provide oversight and strategic direction for enterprise monitoring and observability platforms, including but not limited to: SolarWinds (infrastructure and network monitoring), AppDynamics / APM platforms, Splunk / log and telemetry platforms, Netcool / event management and correlation, AIOps platforms (event correlation, noise reduction, topology awareness), ThousandEyes / digital experience monitoring.
  • Drive integration and alignment across tools to reduce silos, eliminate duplicate alerts, and enable unified visibility.
  • Ensure effective lifecycle management of observability tools, maximizing return on existing investments.
  • Lead engineering initiatives to advance event correlation, enrichment, and service-level context (moving from alert-based monitoring to outcome-based observability).
  • Champion automation and intelligence in detection, correlation, and triage, including AIOps-driven capabilities.
  • Partner with architecture and engineering teams to embed observability standards into application, cloud, and platform designs.
  • Improve the quality of telemetry (metrics, logs, traces, events) and ensure data is usable for troubleshooting, trend analysis, and leadership reporting.
  • Improve alert signal-to-noise ratio and eliminate chronic alert fatigue through standards, tuning, and correlation strategies.
  • Ensure monitoring supports incident response effectively, enabling faster root cause identification and resolution.
  • Define and track KPIs and operational health metrics for observability platforms and teams.
  • Support continuous improvement through post-incident reviews, trend analysis, and proactive gap identification.
  • Act as a trusted partner to infrastructure, application, cloud, and operations leadership.
  • Align observability priorities with business outcomes and service reliability goals.
  • Manage vendor relationships and influence product roadmaps based on enterprise needs.

Benefits

  • Health and Welfare Benefits: Our health and welfare benefits can be tailored to fit you and your family's needs and start on the first day of employment.
  • Retirement Savings: We will support you as you save for your future.
  • Employee Discounts: We offer you access to a vast selection of global, national, and local discounts on merchandise, services, travel, and more.
  • Career Growth Opportunities: We help you thrive, so together, we can grow. We provide opportunities to advance your career with a vast portfolio of businesses and a global footprint.
  • Paid Training: Earn while you learn and continue to grow with access to award-winning learning platforms throughout your Conduent career.
  • Paid time off: We provide attractive paid time off packages designed for you to enjoy your life away from work.
  • Great Work Environment: We are proud of our award-winning culture and the recognition we’ve received for our diversity efforts.
  • health insurance coverage
  • voluntary dental and vision programs
  • life and disability insurance
  • a retirement savings plan
  • paid holidays
  • paid time off (PTO) or vacation and/or sick time
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service