Site Reliability Architect

Qode•Arlington, TX

8h•Onsite

About The Position

We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures. This position involves designing and implementing unified observability solutions, defining and managing SLIs/SLOs, leveraging AI/ML for proactive detection and incident reduction, and troubleshooting complex multi-service architectures. The role also requires deep hands-on experience with various monitoring tools, particularly Dynatrace, and applying GenAI/LLMs for operational improvements.

Requirements

15+ years in SRE / Production Engineering
Strong Unified Observability background (not infra-only)
Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
SLI/SLO engineering experience in production systems
Experience implementing dynamic thresholds and anomaly detection
Knowledge of AI/ML concepts applied to Ops (AIOps)
Distributed systems troubleshooting expertise
Experience with Kafka or streaming data platforms

Nice To Haves

Experience in financial services or regulated environments
Proven reduction of alert noise and MTTR using AIOps
GenAI / LLM integration into operations workflows

Responsibilities

Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
Build actionable dashboards for operations, engineering, and leadership
Implement alerting strategies using static and dynamic thresholds
Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
Transition monitoring from reactive alerts to proactive insights
Implement noise reduction, alert correlation, and root cause analysis
Apply baseline modeling, seasonality detection, and anomaly scoring
Monitor and troubleshoot multi-service architectures involving Microservices, Downstream APIs, Kafka / streaming platforms, and Cloud infrastructure (Terraform, IaC)
Identify whether issues originate from upstream/downstream dependencies, streaming platform, infrastructure, or application code
Utilize deep hands-on experience with Dynatrace
Work with OpenTelemetry, Prometheus / Grafana, ELK / EFK, and Cloud-native monitoring (AWS/Azure/GCP)
Perform strong JSON-based telemetry manipulation and enrichment
Apply GenAI / LLMs for incident summarization, root cause explanation, runbook recommendations, and auto-remediation suggestions
Collaborate with platform teams to operationalize GenAI safely