DevOps/Observability Engineer

Quantiphi

7d•Remote

About The Position

We are seeking a highly experienced Senior DevOps/Observability Engineer with over 8 years of experience to lead the design and implementation of our next-generation, unified observability platform. This pivotal role will focus on architecting a sophisticated observability pipeline from the ground up, leveraging a modern, open-source-centric stack on Amazon Web Services (AWS). The ideal candidate will have deep expertise in designing and deploying observability solutions, with a strong emphasis on OpenTelemetry (OTel) and Kubernetes observability. You will be responsible for deploying, configuring, and integrating a suite of tools including Prometheus, Grafana, and Splunk to provide comprehensive insights into our complex, distributed systems. This is a hands-on role for a technical leader who is passionate about building scalable, reliable, and efficient monitoring and logging systems.

Requirements

Over 8 years of experience as a DevOps/Observability Engineer.
Proven ability to design and implement end-to-end observability pipelines using OpenTelemetry, Prometheus, and Grafana on centralized infrastructure.
Deep expertise in centralizing AWS telemetry, including multi-account CloudTrail organization trails, cross-account CloudWatch metrics/logs, and VPC Flow Logs.
Strong experience designing log aggregation strategies, implementing noise reduction/filtering at the collector level, and configuring Splunk HTTP Event Collector (HEC) integrations.
Hands-on experience building comprehensive alerting frameworks using Alertmanager and CloudWatch Alarms.
Hands-on experience with advanced dashboard engineering in Grafana (using PromQL).
Advanced proficiency in writing Terraform modules specifically for deploying and managing observability stacks and EC2 infrastructure.
Demonstrated experience managing, routing, and optimizing log pipelines at massive scale (TB/day).
Experience deploying Prometheus and OTel within Kubernetes (EKS) or containerized (ECS) environments.
Proven track record of reducing observability spend through strategic metric dropping, log filtering, and efficient storage tiering.

Responsibilities

Design and implement end-to-end observability pipelines using OpenTelemetry, Prometheus, and Grafana on centralized infrastructure.
Centralize AWS telemetry, including multi-account CloudTrail organization trails, cross-account CloudWatch metrics/logs, and VPC Flow Logs.
Design log aggregation strategies, implement noise reduction/filtering at the collector level, and configure Splunk HTTP Event Collector (HEC) integrations.
Build comprehensive alerting frameworks using Alertmanager and CloudWatch Alarms.
Engineer advanced dashboards in Grafana using PromQL.
Write Terraform modules specifically for deploying and managing observability stacks and EC2 infrastructure.
Manage, route, and optimize log pipelines at massive scale (TB/day).
Deploy Prometheus and OTel within Kubernetes (EKS) or containerized (ECS) environments.
Reduce observability spend through strategic metric dropping, log filtering, and efficient storage tiering.

Benefits

Opportunity to join one of the world’s fastest-growing AI-first digital engineering companies.
Make a real impact at scale.
Lead and collaborate with a high-energy team of talented, driven individuals solving complex, meaningful challenges.
Work with Fortune 500 companies and disruptive innovators in a research-driven environment with 60+ patents.
Gain hands-on experience with cutting-edge AI, ML, data, and cloud technologies.
Continuous upskilling opportunities.
Fun, diverse and hybrid work culture.
Ample opportunities to learn, grow and interact with colleagues from varied experience and backgrounds around the globe.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume