Principal Observability Platform Engineer

Nscale
$150,000 - $215,000

About The Position

As a Principal/Staff Observability Platform Engineer, you'll own the technical direction of Nscale's observability platform: the systems that give us deep visibility into GPU clusters, AI workloads, and the infrastructure running them. You treat observability as a product and a discipline, not a tooling exercise. You'll set the architectural roadmap, raise the engineering bar across teams, and ensure our platform scales ahead of the business, not behind it. You understand that complexity is a cost. Solutions that require constant babysitting don't scale, and neither does operational burden. The platforms you build should be simple to operate, easy to understand, and self-evidently correct when something goes wrong. This isn't a "maintain and operate" role. It's a "define, build, and lead" role.

Requirements

  • 8+ years in SRE, infrastructure engineering, platform engineering, or observability-focused roles.
  • You've operated observability infrastructure at serious scale. You know what breaks at 10x and you design for it.
  • You have a strong bias toward simplicity. You've seen over-engineered observability stacks collapse under their own weight and you build accordingly.
  • Deep hands-on experience with a significant subset of: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic.
  • Strong engineering fundamentals, proficient in Python, Go, or similar; comfortable owning complex systems end to end.
  • Experience with Kubernetes at scale; familiarity with GPU infrastructure or HPC environments (Slurm) is a strong plus.
  • You can architect systems, write the code, review others' work, and explain the tradeoffs clearly, all in the same week.
  • Infrastructure-as-Code is default, not optional (Terraform, Ansible, or equivalent).

Nice To Haves

  • Experience with high-volume streaming pipelines for observability data (Kafka, Vector, Fluent Bit, etc.).
  • Background in AI/ML infrastructure observability: GPU utilisation, training job visibility, inference latency.
  • Prior experience defining observability strategy at an organisation level.

Responsibilities

  • Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale.
  • Drive platform decisions that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management.
  • Identify systemic gaps before they become incidents; design platforms that make failure visible and fast to diagnose.
  • Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how Nscale builds and operates.
  • Define standards and patterns that other engineers adopt, not by mandate, but because they're clearly better.
  • Mentor and technically grow the observability team; raise the ceiling on what the team can build and own.
  • Lead incident postmortems and use them to drive durable platform improvements.
  • Evaluate and introduce tooling that meaningfully improves signal quality, operational efficiency, or scalability, and retire what doesn't.

Benefits

  • medical
  • dental
  • vision
  • flexible paid time off
  • parental leave
  • retirement plan participation
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service