Senior Manager, Observability

CoreWeaveSunnyvale, CA
Hybrid

About The Position

The Observability Engineering organization at CoreWeave is responsible for the platforms and practices that help engineers understand, operate, and improve production systems at scale. This team owns and evolves the foundations for metrics, logs, traces, telemetry pipelines, and observability reliability, enabling teams to detect issues quickly, troubleshoot complex distributed systems, and operate AI infrastructure with confidence. As CoreWeave continues to scale, observability plays a critical role in delivering reliable platform experiences, improving engineering velocity, and maintaining operational excellence across a rapidly growing cloud environment.

Requirements

  • 8+ years of software engineering experience with production systems at scale
  • 4+ years of engineering management experience leading senior engineers and technical leads
  • Experience building and operating observability platforms across logs, metrics, traces, and alerting in distributed systems
  • Knowledge of reliability engineering concepts including SLOs, SLIs, incident management, error budgets, and fault-tolerant design
  • Experience scaling telemetry systems including collection pipelines, storage backends, and query layers
  • Experience with distributed systems, performance engineering, and trade-offs involving scale, resilience, and cost
  • Experience partnering with infrastructure, security, and application engineering teams to drive platform adoption
  • Experience hiring and managing engineering teams
  • Must be a U.S. person (U.S. citizen or national, U.S. lawful permanent resident, refugee, or asylee) or eligible to access export controlled information without authorization, or eligible and likely to obtain required export authorization.

Nice To Haves

  • Experience with OpenTelemetry, Grafana, Prometheus-compatible systems, log aggregation, and distributed tracing tools
  • Experience operating cloud-native infrastructure, including Kubernetes environments
  • Experience supporting large-scale cloud, developer platforms, or AI/ML infrastructure
  • Familiarity with capacity planning for high-ingest telemetry systems
  • Experience scaling platforms in high-growth environments

Responsibilities

  • Lead a team responsible for building, scaling, and operating observability systems across metrics, logs, traces, and telemetry pipelines.
  • Define strategy and roadmap for observability systems.
  • Drive platform reliability and performance improvements.
  • Guide architectural decisions across observability infrastructure.
  • Partner closely with infrastructure, platform, security, and application engineering teams to improve instrumentation and production visibility.
  • Combine technical leadership, operational ownership, and team management to ensure observability platforms scale with business and customer needs.

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service