Principal Observability Platform Engineer

Nscale

2d•$150,000 - $215,000

About The Position

As a Principal/Staff Observability Platform Engineer, you'll own the technical direction of Nscale's observability platform: the systems that give us deep visibility into GPU clusters, AI workloads, and the infrastructure running them. You treat observability as a product and a discipline, not a tooling exercise. You'll set the architectural roadmap, raise the engineering bar across teams, and ensure our platform scales ahead of the business, not behind it. You understand that complexity is a cost. Solutions that require constant babysitting don't scale, and neither does operational burden. The platforms you build should be simple to operate, easy to understand, and self-evidently correct when something goes wrong. This isn't a "maintain and operate" role. It's a "define, build, and lead" role.

Requirements

8+ years in SRE, infrastructure engineering, platform engineering, or observability-focused roles.
You've operated observability infrastructure at serious scale. You know what breaks at 10x and you design for it.
You have a strong bias toward simplicity. You've seen over-engineered observability stacks collapse under their own weight and you build accordingly.
Deep hands-on experience with a significant subset of: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic.
Strong engineering fundamentals, proficient in Python, Go, or similar; comfortable owning complex systems end to end.
Experience with Kubernetes at scale; familiarity with GPU infrastructure or HPC environments (Slurm) is a strong plus.
You can architect systems, write the code, review others' work, and explain the tradeoffs clearly, all in the same week.
Infrastructure-as-Code is default, not optional (Terraform, Ansible, or equivalent).

Nice To Haves

Experience with high-volume streaming pipelines for observability data (Kafka, Vector, Fluent Bit, etc.).
Background in AI/ML infrastructure observability: GPU utilisation, training job visibility, inference latency.
Prior experience defining observability strategy at an organisation level.

Responsibilities

Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale.
Drive platform decisions that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management.
Identify systemic gaps before they become incidents; design platforms that make failure visible and fast to diagnose.
Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how Nscale builds and operates.
Define standards and patterns that other engineers adopt, not by mandate, but because they're clearly better.
Mentor and technically grow the observability team; raise the ceiling on what the team can build and own.
Lead incident postmortems and use them to drive durable platform improvements.
Evaluate and introduce tooling that meaningfully improves signal quality, operational efficiency, or scalability, and retire what doesn't.