Senior Software Engineer, Observability

Okta•Washington, DC

13h•$147,000 - $202,000•Hybrid

About The Position

The Auth0 Platform Observability team owns the observability tooling that monitors the Auth0 Platform. This role is for an Observability Engineer to help ensure that Product and Platform Engineers can monitor and observe the platform while continuing to rapidly ship software that customers love. The team maintains and automates observability tooling for the entire platform, including metrics, logs, and traces. The ideal candidate is passionate about monitoring, observing, measuring uptime and availability, and ensuring platform stability. Experience within the Site Reliability Engineering (SRE) field or working as a Development Operations (DevOps) engineer, with a passion for Observability tooling, is highly valued. As a Senior Engineer on this team, you will act as a core technical leader. You will work cross-functionally to help integrate services with our instrumentation libraries, support product teams, and actively investigate incidents to identify our observability gaps.

Requirements

5+ years of platform engineering, SRE, or DevOps experience
Experience with cloud infrastructure like AWS, Google Cloud, or Azure
Expertise in the Datadog ecosystem (Metrics, Logs, Traces, and Error Tracking), including establishing alerting standards, implementing tagging taxonomies, and managing Datadog configurations via Terraform.
Strong coding skills in Node.js or Golang
Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
A data-driven approach to debugging complex, cross-service performance bottlenecks.
Deep understanding of microservice architecture and best practices.
Experience in coaching and mentoring more junior engineers.
Proven ability to lead cross-functional technical initiatives and collaborate seamlessly with multiple engineering teams.
Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.

Responsibilities

Champion observability best practices, acting as an educator who can effectively correct anti-patterns and teach other engineering teams how to build robust, standardized instrumentation.
Be an expert in running services in production environments.
Contribute to the process of designing services for high growth and high availability.
Provision, configure, and monitor cloud-native infrastructure and services.
Design, build, and maintain scalable observability infrastructure using tools like Terraform.
Troubleshoot performance issues and operational issues.
Automate operational tasks and improve scripts.
Assist with and provide feedback for performance testing and automation.
Actively participate in major incident response to diagnose root causes and identify critical gaps in our current telemetry tooling.
Act as a technical leader, driving cross-team initiatives to improve instrumentation and observability standards across the broader engineering organization.