Staff Infrastructure Engineer — Observability

SentinelOne

6h•$132,000 - $215,000

About The Position

As a Staff Infrastructure Engineer, you'll be a pivotal technical leader and architect within our Observability team, driving strategic initiatives and shaping the future of our critical systems. You will leverage your deep expertise to design, implement, and optimize solutions that underpin SentinelOne's global platform, directly empowering engineering teams across the organization. We are seeking a candidate who is driven by a deep passion for observability and technical leadership. Imagine architecting the core systems that provide SentinelOne with real-time, global visibility, delivering actionable platform insights precisely when they are needed. In this high-impact role, you'll design and implement robust, secure solutions for high-volume data ingestion, storage, and analysis—fundamentally shaping how we understand and optimize our platform health. This is your chance to take end-to-end ownership of critical infrastructure, mentor talented engineers, and profoundly accelerate software delivery across our entire engineering organization. Due to Federal Government contract requirements, U.S. Citizenship is required for this position. FedRAMP staff may be subject to customer or third party background checks up to and including Secret Clearance if required by their role at SentinelOne.

Requirements

8+ years experience in Infrastructure Engineering, Site Reliability Engineering (SRE), or a related systems-focused field.
8+ years experience in architecting, scaling, and managing enterprise-grade observability stacks utilizing Prometheus, Grafana, Thanos (or Mimir/Cortex), and OpenTelemetry (OTEL).
Experience design-engineering cloud-native infrastructure within major cloud providers (AWS or GCP) and managing production Kubernetes environments (EKS, GKE).
Advanced proficiency with IaC and automation tools, specifically Terraform and Ansible, to manage immutable infrastructure.
Experience maintaining and optimizing high-throughput, large-scale distributed systems with a focus on cost-efficiency, scalability, and disaster recovery.
Demonstrated ability to lead complex technical designs, mentor other engineers, and collaborate cross-functionally with product and application teams.
US Citizenship and the ability to work in a government-regulated environment.

Nice To Haves

8+ years production-level programming experience in GoLang (highly desirable) or another mainstream language (e.g., Python, Java) with a strong willingness to adopt GoLang.
Experience working with high-security compliance frameworks, specifically FedRAMP or other sovereign cloud requirements.
Familiarity with the unique operational challenges of on-premises, hybrid, or air-gapped Kubernetes deployments.
Experience designing advanced CI/CD pipelines (e.g., GitHub Actions) and implementing sophisticated deployment strategies (canary, blue-green, rolling updates).

Responsibilities

Architect and implement robust, scalable telemetry platforms that empower SentinelOne engineers to deploy and monitor features with speed, safety, and reliability.
Act as the primary Subject Matter Expert (SME) and administrator for our core observability stack, including Grafana, Prometheus, Thanos/Mimir/Cortex, and OpenTelemetry (OTEL) pipelines.
Partner strategically with diverse engineering teams across the organization to define platform requirements, ensuring the observability ecosystem evolves ahead of stakeholder needs.
Take complete ownership of critical features, from initial architectural design and requirements refinement through to production deployment and operational maturity.
Drive exemplary operational efficiency for critical observability services across AWS and GCP, meticulously balancing unwavering system reliability with smart cloud cost-optimization.
Build robust automation and self-service tooling to drastically reduce operational toil, optimize resource utilization, and minimize pager fatigue.
Drive the deployment, maintenance, and compliance of observability systems in critical, high-security environments, including FedRAMP and air-gapped deployments.
Cultivate platform transparency and reliability by rigorously implementing IaC (Terraform/Ansible) and standardizing industry best practices.
Elevate engineering quality by mentoring team members, leading comprehensive technical design and code reviews, and providing constructive feedback that fosters growth.
Lead the swift resolution of highly complex production incidents, perform thorough root-cause analyses, and participate in on-call rotations to ensure peak system integrity.