Site Reliability Architect

QodeArlington, TX
Onsite

About The Position

We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.

Requirements

  • 15+ years in SRE / Production Engineering
  • Strong Unified Observability background (not infra-only)
  • Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
  • SLI/SLO engineering experience in production systems
  • Experience implementing dynamic thresholds and anomaly detection
  • Knowledge of AI/ML concepts applied to Ops (AIOps)
  • Distributed systems troubleshooting expertise
  • Experience with Kafka or streaming data platforms

Nice To Haves

  • Experience in financial services or regulated environments
  • Proven reduction of alert noise and MTTR using AIOps
  • GenAI / LLM integration into operations workflows

Responsibilities

  • Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
  • Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
  • Build actionable dashboards for operations, engineering, and leadership
  • Implement alerting strategies using static and dynamic thresholds
  • Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
  • Transition monitoring from reactive alerts to proactive insights
  • Implement noise reduction, alert correlation, and root cause analysis
  • Apply baseline modeling, seasonality detection, and anomaly scoring
  • Monitor and troubleshoot multi-service architectures involving: Microservices, Downstream APIs, Kafka / streaming platforms, Cloud infrastructure (Terraform, IaC)
  • Identify whether issues originate from: Upstream/downstream dependencies, Streaming platform, Infrastructure, Application code
  • Deep hands-on experience with Dynatrace (mandatory)
  • Experience with: OpenTelemetry, Prometheus / Grafana, ELK / EFK, Cloud-native monitoring (AWS/Azure/GCP)
  • Strong JSON-based telemetry manipulation and enrichment
  • Apply GenAI / LLMs for: Incident summarization, Root cause explanation, Runbook recommendations, Auto-remediation suggestions
  • Collaborate with platform teams to operationalize GenAI safely

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

1-10 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service