Staff SRE - Observability

Focused•Chicago, IL

6d•$160,000 - $200,000•Hybrid

About The Position

We are seeking an experienced Staff Observability Consultant with deep expertise in OpenTelemetry and strong Platform Engineering capabilities to help organizations implement, optimize, and scale their observability infrastructure. This role requires a seasoned consultant who can design comprehensive telemetry strategies, implement distributed tracing solutions, establish robust monitoring practices, and interface closely with clients on the observability journey.

Requirements

3-7 years of experience in observability, monitoring, and distributed systems
Deep hands-on experience with OpenTelemetry ecosystem, including SDKs, APIs, and specifications
Proficiency with OpenTelemetry Collector configuration, processors, exporters, and receivers
Strong understanding of telemetry data models, semantic conventions, and instrumentation best practices
5+ years of Platform Engineering or DevOps experience with focus on site reliability, observability, and incident response
Proficiency with Infrastructure as Code tools (Terraform, Pulumi, CloudFormation, CDK)
Strong experience with CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
Hands-on experience with major cloud providers (AWS, GCP, Azure) and their observability services
Experience with container technologies (Docker, Podman) and container registries
Knowledge of networking, security, load balancing, and distributed systems concepts
Experience implementing SRE practices including error budgets and toil metrics
Proficiency in incident management, on-call procedures, and post-mortem culture
Experience with capacity planning, performance optimization, and scalability design
Proficiency in multiple programming languages preferred (Go, Python, Java, Node.js, Rust)
Strong scripting and automation skills (Bash, Python, PowerShell)
Understanding of software engineering best practices and testing methodologies

Nice To Haves

Understanding of Large Language Models (LLMs) and their application in DevOps
Knowledge of vector databases, embeddings, and retrieval-augmented generation (RAG)
Experience with AI/ML model deployment and monitoring in production environments
Strong technical writing and documentation skills
Ability to present complex technical concepts to diverse stakeholders
A passion for knowledge sharing
Systems thinking and ability to design holistic observability solutions
Strong analytical and troubleshooting skills for complex distributed systems
Curiosity about emerging technologies, particularly AI applications in operations
Adaptability to rapidly evolving cloud-native and observability technologies
Collaborative mindset with focus on enabling developer productivity and system reliability
Experience with Honeycomb
Contributions to open-source observability or AI framework projects
Track record of implementing platform engineering solutions that significantly improved developer experience
Experience scaling observability infrastructure to handle high event volume

Responsibilities

Design and implement end-to-end OpenTelemetry solutions across diverse technology stacks
Configure and deploy OpenTelemetry Collectors for efficient data collection, processing, sampling, and routing
Establish telemetry pipelines for metrics, traces, and logs across microservices architectures
Optimize collector configurations for performance, reliability, and cost-effectiveness
Augment existing infrastructure with integrated observability solutions
Implement Infrastructure as Code (IaC) solutions using Terraform, Pulumi, CloudFormation, etc.
Architect and manage Kubernetes clusters with comprehensive monitoring and logging
Build CI/CD pipelines with embedded observability and automated testing
Establish and maintain Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs)
Implement error budgets, toil reduction strategies, and capacity planning
Support incident response procedures and post-mortem processes
Deploy and manage observability infrastructure across AWS, GCP, and Azure
Establish security, compliance, and governance frameworks for telemetry data
Experience automating Agent Evaluations in CI/CD pipelines and observability backends.

Benefits

This role will require being in the Chicago office three days per week and up to 20% travel within the United States.
Focused is unable to sponsor or take over sponsorship of the employment Visa process at this time.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume