Skyline Technology Solutions, LLC-posted 5 days ago
Full-time • Mid Level
Glen Burnie, MD
251-500 employees

The Lead Observability Engineer serves as the organization’s technical authority for monitoring, telemetry, and reliability insights across all platforms and services. This role owns the architecture, implementation, and operation of the observability ecosystem—including metrics, logging, tracing, dashboards, alerting, and service-level indicators—ensuring that engineering teams have the visibility required to deliver resilient, high-performing systems. The position combines deep platform engineering expertise with the strategic responsibilities of defining telemetry standards, guiding reliability practices, and driving the adoption of modern observability methodologies. The Lead Observability Engineer partners closely with application, platform, and security teams to establish scalable instrumentation frameworks, operationalize SLOs, and ensure data quality and consistency across environments. This role requires technical leadership, strong architectural judgment, and the ability to translate complex system behavior into actionable insights that elevate operational excellence across the organization. You can expect to spend your time accomplishing the following: 50% of the time on Objective 1: Observability Platform Ownership 25% of the time on Objective 2: Standards, Instrumentation, and Reliability Practices 25% of the time on Objective 3: Cross-Functional Technical Leadership

  • Architect, implement, and operate the full observability stack, including metrics, logging, tracing, dashboards, alerting, and telemetry pipelines.
  • Maintain and optimize Grafana, Loki, Tempo, exporters, agents, and related services to ensure reliability, performance, and scalability.
  • Ensure high-quality, consistent telemetry across all environments.
  • Define organizational standards for instrumentation, dashboards, alerts, SLIs, and SLOs.
  • Partner with engineering teams to guide adoption of reliability and observability best practices.
  • Improve signal-to-noise ratio in alerts and evolve incident visibility and analysis frameworks.
  • Collaborate with Platform, Application, Security, and Network Engineering teams to ensure observability is embedded into architecture and operational workflows.
  • Provide expert guidance on system behavior, failure modes, performance patterns, and telemetry-driven insights.
  • Bachelor’s degree in Computer Science, Networking, Telecommunications, or related technical field
  • 8+ years of experience in systems engineering, SRE, platform engineering, or infrastructure operations roles in large-scale, high-availability environments
  • Observability engineering: metrics, logs, traces, dashboards, alerting, SLOs/SLIs, Linux systems engineering, OS tuning, benchmarking, and troubleshooting at scale
  • Experience with log aggregation and search systems (Splunk, ElasticSearch), message brokers (RabbitMQ, Kafka), and system monitoring tools (Zabbix, Grafana)
  • Proven hands-on experience operating Linux systems (RHEL, Ubuntu, CentOS) at scale, including performance tuning, benchmarking, hardening, and troubleshooting
  • Demonstrated experience with observability tooling such as Splunk, ElasticSearch, Graphite, Zabbix, log pipelines, and metrics systems
  • Proficiency with Kubernetes, Docker, CI/CD, and infrastructure automation frameworks such as Ansible, Chef, or Salt
  • Background in security operations or tooling such as MS Defender, Nessus, Carbon Black, CrowdStrike, IAM, or FIM solutions
  • Experience designing or supporting disaster recovery, high-availability, and SLA-driven systems for mission-critical services
  • Direct experience with distributed systems, Kafka-based architectures, or microservices environments
  • Strong familiarity with compliance frameworks (SOC2, PCI, HITRUST, FedRAMP, CONMON, C5, GDPR) and implementing technical controls in production environments
  • Demonstrated ability to collaborate across cross-functional engineering, security, and compliance teams and lead technical initiatives without direct authority
  • Experience supporting or designing multi-datacenter infrastructure or hybrid cloud environments
  • Prior leadership experience in SRE, platform engineering, or cloud operations teams within enterprise-scale organizations
  • Professional certifications Preferred: CISSP, CISM, PMP, ITIL, AWS/Azure
  • Medical Insurance
  • Vision Insurance
  • Dental Insurance
  • FSA Plan
  • Paid Time Off
  • 401K Retirement Savings Plan
  • Training & Tuition Assistance
  • Disability & Life Insurance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service