Senior Software Engineer, Observability

Okta•Chicago, WA

7d•Hybrid

About The Position

Okta is securing AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk. The Auth0 Platform Observability team owns the observability tooling that monitors the Auth0 Platform, and we are looking for an Observability Engineer to help ensure that our Product and Platform Engineers can monitor and observe our platform while continuing to rapidly ship software that our customers love. Our engineers maintain and automate observability tooling for our entire platform, including metrics, logs, and traces. We are looking for engineers passionate about monitoring, observing, measuring uptime and availability, and ensuring platform stability. If you have experience within the Site Reliability Engineering (SRE) field or working as a Development Operations (DevOps) engineer, and you have a passion for Observability tooling, this position will allow you to further your learning and development in these areas. As a Senior Engineer on this team, you will act as a core technical leader. You will work cross-functionally to help integrate services with our instrumentation libraries, support product teams, and actively investigate incidents to identify our observability gaps.

Requirements

5+ years of platform engineering, SRE, or DevOps experience
Experience with cloud infrastructure like AWS, Google Cloud, or Azure
Expertise in the Datadog ecosystem (Metrics, Logs, Traces, and Error Tracking), including establishing alerting standards, implementing tagging taxonomies, and managing Datadog configurations via Terraform.
Strong coding skills in Node.js or Golang
Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
A data-driven approach to debugging complex, cross-service performance bottlenecks.
Deep understanding of microservice architecture and best practices.
Experience in coaching and mentoring more junior engineers.
Proven ability to lead cross-functional technical initiatives and collaborate seamlessly with multiple engineering teams.
Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.

Responsibilities

Proven ability to champion observability best practices, acting as an educator who can effectively correct anti-patterns and teach other engineering teams how to build robust, standardized instrumentation.
Be an expert in running services in production environments
Contribute to the process of designing services for high growth and high availability.
Provision, configure, and monitor cloud-native infrastructure and services
Design, build, and maintain scalable observability infrastructure using tools like Terraform.
Troubleshoot performance issues and operational issues.
Automating operational tasks and improving scripts.
Assisting with and providing feedback for performance testing and automation
Actively participate in major incident response to diagnose root causes and identify critical gaps in our current telemetry tooling.
Act as a technical leader, driving cross-team initiatives to improve instrumentation and observability standards across the broader engineering organization.