Observability Engineer

Miratech•Yakima, WA

4d•Remote

About The Position

We are looking for an Observability Engineer to design, implement, and optimize enterprise observability solutions across applications, infrastructure, and cloud environments. This role focuses on monitoring, telemetry, automation, reliability engineering, and AIOps capabilities to improve system visibility, operational efficiency, and service reliability. The ideal candidate will have hands-on experience with observability platforms, cloud technologies, automation, and incident management practices while collaborating with engineering and operations teams to establish observability standards and best practices.

Requirements

4+ years of experience in Observability Engineering, Site Reliability Engineering, or related domains.
Hands-on experience with observability platforms such as Dynatrace, Splunk, Grafana, and OpenTelemetry.
Strong expertise in AWS and GCP, with familiarity with cloud-native architectures.
Proficiency in Python for automation and operational tooling.
Experience implementing metrics, logs, events, and distributed tracing (MELT) across distributed systems.
Hands-on experience with Terraform and Infrastructure as Code practices.
Strong understanding of SLIs, SLOs, alerting strategies, and incident response frameworks.
Excellent troubleshooting, communication, and collaboration skills.
Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).

Nice To Haves

Experience with AIOps platforms and intelligent alerting solutions.
Knowledge of Kubernetes and containerized environments.
Experience integrating observability tools with ServiceNow and CI/CD ecosystems.
AWS, GCP, Observability, or SRE-related certifications.

Responsibilities

Design and implement end-to-end observability solutions across applications, infrastructure, and cloud environments.
Develop dashboards, alerts, and telemetry frameworks to provide real-time visibility into system health and performance.
Build automation solutions to eliminate repetitive operational tasks and improve efficiency.
Enable runbook automation, self-healing capabilities, and automated incident triage workflows.
Define and implement SLIs, SLOs, and alerting strategies to improve service reliability.
Drive improvements in MTTD and MTTR through actionable alerts and telemetry-driven insights.
Implement proactive monitoring, anomaly detection, and predictive alerting to identify issues before customer impact.
Leverage AIOps capabilities for alert correlation and intelligent incident response.
Integrate observability platforms with CI/CD pipelines, cloud services, and ITSM tools such as ServiceNow.
Collaborate with engineering, product, and operations teams to establish observability standards and operational readiness practices.
Mentor teams and drive adoption of observability best practices across the organization.