Observability Engineer I

HealthEquity

1d•$72,000 - $111,000•Remote

About The Position

Our mission is to SAVE AND IMPROVE LIVES BY EMPOWERING HEALTHCARE CONSUMERS. Come be part of remarkable. Overview How you can make a difference You will directly contribute to our mission of delivering reliable, high-performing, mobile first digital experiences to our members and internal teams. By ensuring issues are detected early, triaged quickly, and prevented from recurring, you support our ability to maintain trust, improve our brand, and reduce incident impact across the business. You’ll make a difference by: Enabling faster incident response by improving monitoring coverage, alert accuracy, and root cause visibility Helping teams shift from reactive to proactive operations by applying telemetry data and AI-driven insights Empowering service owners with clear dashboards and actionable insights that guide performance improvements Improving system resilience through continuous feedback and collaboration with other internal teams Driving data-informed decisions by transforming raw logs and metrics into insights that help drive business outcomes Help us create a culture where observability extends far beyond the platforms we manage by building a shared practice that elevates everyone's experience, from the developers that build our software to the members that use it. What you’ll be doing You will play a foundational role in supporting the reliability, performance, and visibility of our critical IT infrastructure and business systems. You’ll help ensure our applications, infrastructure, and services are observable, measurable, and actionable by assisting with the configuration, maintenance, and continuous improvement of monitoring, logging, and alerting tools such as Dynatrace, LogicMonitor, ThousandEyes, and others. Working closely with senior engineers and IT operations teams, you will: Support onboarding of new systems and applications into our observability platforms Maintain and troubleshoot dashboards, alerts, and telemetry integrations Collaborate closely with ITSM, application support, and infrastructure teams to improve incident detection and root cause analysis Follow defined operational procedures and participate in change and incident management workflows Document technical procedures, runbooks, and monitoring standards Learn and apply observability best practices while developing skills in automation and data analytics This entry-level role is ideal for someone with a strong interest in monitoring, data analysis, and IT operations who is eager to grow within a fast-paced, enterprise-scale environment.

Requirements

Foundational knowledge of IT infrastructure, applications, and networking concepts (e.g., servers, databases, APIs, web services, cloud platforms)
Curiosity and attention to detail when investigating alerts, logs, metrics, and performance trends
Basic experience or coursework with monitoring/logging tools (e.g., Dynatrace, Splunk, Prometheus, Grafana, ELK, or similar)
Foundational knowledge of IT infrastructure, applications, and networking concepts (e.g., servers, databases, APIs, web services, cloud platforms)
Curiosity and attention to detail when investigating alerts, logs, metrics, and performance trends
A strong working foundational understanding of Dynatrace, LogicMonitor, Prometheus, Open Telemetry, ThousandEyes or similar.
Understanding of performance counters and indicators for both systems and applications and how to interpret them
Familiarity with scripting or query languages (e.g., PowerShell, Python, SQL, or log query languages like DQL or SPL) is required
Interest in leveraging AI-powered observability features (e.g., anomaly detection, root cause analysis, predictive alerts) to improve reliability and reduce noise
Strong communication and collaboration skills—able to work with cross-functional teams in IT, application support, architecture and engineering
Willingness to learn cloud native observability practices, ITIL workflows, and continuous improvement methodologies
Accountability and a service-oriented mindset is a must. We are a highly motivated, service oriented team. We care about service availability, resilience, performance, reducing mean time to restore service, and helping teams understand the art of the possilbe with observability.
We are developing a modern, self-service, cloud native, AI-assisted observability model. Success in this role includes developing the ability to work alongside—and critically think about—AI-generated insights while continuously improving our systems’ resilience and visibility.

Responsibilities

Support onboarding of new systems and applications into our observability platforms
Maintain and troubleshoot dashboards, alerts, and telemetry integrations
Collaborate closely with ITSM, application support, and infrastructure teams to improve incident detection and root cause analysis
Follow defined operational procedures and participate in change and incident management workflows
Document technical procedures, runbooks, and monitoring standards
Learn and apply observability best practices while developing skills in automation and data analytics