Lead Observability Engineer

Skyline Technology Solutions, LLC-posted 5 days ago

Full-time • Mid Level

Glen Burnie, MD

251-500 employees

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

The Lead Observability Engineer serves as the organization’s technical authority for monitoring, telemetry, and reliability insights across all platforms and services. This role owns the architecture, implementation, and operation of the observability ecosystem—including metrics, logging, tracing, dashboards, alerting, and service-level indicators—ensuring that engineering teams have the visibility required to deliver resilient, high-performing systems. The position combines deep platform engineering expertise with the strategic responsibilities of defining telemetry standards, guiding reliability practices, and driving the adoption of modern observability methodologies. The Lead Observability Engineer partners closely with application, platform, and security teams to establish scalable instrumentation frameworks, operationalize SLOs, and ensure data quality and consistency across environments. This role requires technical leadership, strong architectural judgment, and the ability to translate complex system behavior into actionable insights that elevate operational excellence across the organization. You can expect to spend your time accomplishing the following: 50% of the time on Objective 1: Observability Platform Ownership 25% of the time on Objective 2: Standards, Instrumentation, and Reliability Practices 25% of the time on Objective 3: Cross-Functional Technical Leadership

Architect, implement, and operate the full observability stack, including metrics, logging, tracing, dashboards, alerting, and telemetry pipelines.
Maintain and optimize Grafana, Loki, Tempo, exporters, agents, and related services to ensure reliability, performance, and scalability.
Ensure high-quality, consistent telemetry across all environments.
Define organizational standards for instrumentation, dashboards, alerts, SLIs, and SLOs.
Partner with engineering teams to guide adoption of reliability and observability best practices.
Improve signal-to-noise ratio in alerts and evolve incident visibility and analysis frameworks.
Collaborate with Platform, Application, Security, and Network Engineering teams to ensure observability is embedded into architecture and operational workflows.
Provide expert guidance on system behavior, failure modes, performance patterns, and telemetry-driven insights.

Bachelor’s degree in Computer Science, Networking, Telecommunications, or related technical field
8+ years of experience in systems engineering, SRE, platform engineering, or infrastructure operations roles in large-scale, high-availability environments
Observability engineering: metrics, logs, traces, dashboards, alerting, SLOs/SLIs, Linux systems engineering, OS tuning, benchmarking, and troubleshooting at scale
Experience with log aggregation and search systems (Splunk, ElasticSearch), message brokers (RabbitMQ, Kafka), and system monitoring tools (Zabbix, Grafana)
Proven hands-on experience operating Linux systems (RHEL, Ubuntu, CentOS) at scale, including performance tuning, benchmarking, hardening, and troubleshooting
Demonstrated experience with observability tooling such as Splunk, ElasticSearch, Graphite, Zabbix, log pipelines, and metrics systems
Proficiency with Kubernetes, Docker, CI/CD, and infrastructure automation frameworks such as Ansible, Chef, or Salt
Background in security operations or tooling such as MS Defender, Nessus, Carbon Black, CrowdStrike, IAM, or FIM solutions
Experience designing or supporting disaster recovery, high-availability, and SLA-driven systems for mission-critical services
Direct experience with distributed systems, Kafka-based architectures, or microservices environments
Strong familiarity with compliance frameworks (SOC2, PCI, HITRUST, FedRAMP, CONMON, C5, GDPR) and implementing technical controls in production environments
Demonstrated ability to collaborate across cross-functional engineering, security, and compliance teams and lead technical initiatives without direct authority
Experience supporting or designing multi-datacenter infrastructure or hybrid cloud environments
Prior leadership experience in SRE, platform engineering, or cloud operations teams within enterprise-scale organizations

Professional certifications Preferred: CISSP, CISM, PMP, ITIL, AWS/Azure

Medical Insurance
Vision Insurance
Dental Insurance
FSA Plan
Paid Time Off
401K Retirement Savings Plan
Training & Tuition Assistance
Disability & Life Insurance

Track Jobs with Teal

Job Search Resources

•

AI Resume Builder

•

Lead DevOps Engineer Resume Examples

•

Lead DevOps Engineer Cover Letter Examples

Lead Observability Engineer

Job Search Resources

Tools

Career Hubs

Guides

Company