Staff Observability Platform Engineer

Nscale

About The Position

As a Staff Observability Platform Engineer, you'll play a critical role in building and evolving Nscale's observability platform, enabling deep visibility into GPU clusters, AI workloads, and the infrastructure that powers them. You view observability as a product, not simply a collection of tools. You'll help define and implement scalable, reliable observability solutions that empower engineering teams to understand system behavior, diagnose issues quickly, and operate complex distributed systems with confidence. You'll combine technical leadership with hands-on engineering, partnering across SRE, infrastructure, platform, and AI/ML teams to improve reliability, operational efficiency, and developer experience. You'll influence architectural decisions, establish engineering best practices, and help drive the evolution of observability capabilities across the organization. This is a role for someone who enjoys solving difficult infrastructure problems, building platforms that scale, and helping engineering teams succeed through better visibility and operational insight.

Requirements

6+ years of experience in SRE, platform engineering, infrastructure engineering, observability engineering, or related disciplines.
Strong experience building and operating observability platforms in cloud-native, distributed environments.
Deep hands-on experience with several of the following technologies: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic, or similar platforms.
Strong software engineering skills with proficiency in Go, Python, or equivalent languages.
Experience operating and troubleshooting Kubernetes-based platforms at scale.
Strong understanding of monitoring, logging, tracing, telemetry pipelines, and modern observability practices.
Experience designing systems with scalability, reliability, performance, and operational simplicity in mind.
Proficiency with Infrastructure-as-Code tools such as Terraform, Ansible, or equivalent.
Ability to lead technical initiatives and influence engineering decisions across multiple teams.
Excellent communication skills with the ability to explain technical tradeoffs and align stakeholders around pragmatic solutions.

Nice To Haves

Experience operating observability systems in GPU, AI/ML, HPC, or large-scale compute environments.
Familiarity with Slurm, Kubernetes GPU scheduling, or AI infrastructure platforms.
Experience with high-volume telemetry pipelines and streaming technologies such as Kafka, Vector, or Fluent Bit.
Knowledge of observability challenges related to model training, inference workloads, GPU utilization, and distributed AI systems.
Experience mentoring engineers and helping grow technical capability across teams.

Responsibilities

Design, build, and evolve observability platforms across metrics, logs, traces, alerting, and telemetry pipelines.
Lead the implementation of scalable observability solutions that support Nscale's growing GPU and AI infrastructure.
Partner with SRE, infrastructure, platform, and AI/ML teams to ensure observability is embedded throughout the software and infrastructure lifecycle.
Drive improvements in monitoring coverage, alert quality, service health visibility, and incident response effectiveness.
Develop standards, frameworks, and reusable patterns that simplify observability adoption across engineering teams.
Identify reliability risks and operational blind spots, helping teams proactively address them before they impact customers.
Contribute to architectural decisions around telemetry collection, storage, retention, cardinality management, and performance optimization.
Lead technical initiatives and projects that improve platform scalability, reliability, and operational efficiency.
Mentor engineers and provide technical guidance through design reviews, code reviews, and knowledge sharing.
Participate in incident investigations and postmortems, translating operational learnings into durable platform improvements.
Evaluate new observability technologies and practices, balancing innovation with operational simplicity and long-term maintainability.

Benefits

We strongly encourage applications from people of color, the LGBTQ+ community, people with disabilities, neurodivergent individuals, parents, carers, and people from lower socio-economic backgrounds.
If there's anything we can do to accommodate your specific situation, please let us know.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume