Site Reliability Architect

QodeArlington, TX
Onsite

About The Position

We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures. This position involves designing and implementing unified observability solutions, defining and managing SLIs/SLOs, leveraging AI/ML for proactive detection and incident reduction, and troubleshooting complex multi-service architectures. The role also requires deep hands-on experience with various monitoring tools, particularly Dynatrace, and applying GenAI/LLMs for operational improvements.

Requirements

  • 15+ years in SRE / Production Engineering
  • Strong Unified Observability background (not infra-only)
  • Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
  • SLI/SLO engineering experience in production systems
  • Experience implementing dynamic thresholds and anomaly detection
  • Knowledge of AI/ML concepts applied to Ops (AIOps)
  • Distributed systems troubleshooting expertise
  • Experience with Kafka or streaming data platforms

Nice To Haves

  • Experience in financial services or regulated environments
  • Proven reduction of alert noise and MTTR using AIOps
  • GenAI / LLM integration into operations workflows

Responsibilities

  • Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
  • Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
  • Build actionable dashboards for operations, engineering, and leadership
  • Implement alerting strategies using static and dynamic thresholds
  • Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
  • Transition monitoring from reactive alerts to proactive insights
  • Implement noise reduction, alert correlation, and root cause analysis
  • Apply baseline modeling, seasonality detection, and anomaly scoring
  • Monitor and troubleshoot multi-service architectures involving Microservices, Downstream APIs, Kafka / streaming platforms, and Cloud infrastructure (Terraform, IaC)
  • Identify whether issues originate from upstream/downstream dependencies, streaming platform, infrastructure, or application code
  • Utilize deep hands-on experience with Dynatrace
  • Work with OpenTelemetry, Prometheus / Grafana, ELK / EFK, and Cloud-native monitoring (AWS/Azure/GCP)
  • Perform strong JSON-based telemetry manipulation and enrichment
  • Apply GenAI / LLMs for incident summarization, root cause explanation, runbook recommendations, and auto-remediation suggestions
  • Collaborate with platform teams to operationalize GenAI safely
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service