Site Reliability Engineering (SRE) Architect

Dell TechnologiesRound Rock, TX
1d

About The Position

Site Reliability Engineering (SRE) Architect Join us to do the best work of your career and make a profound social impact as a Site Reliability Engineering (SRE) Architect on our Site Reliability Engineering Team in Austin, Texas. What you’ll achieve We are seeking a highly experienced Site Reliability Engineering (SRE) Architect to lead the design, evolution, and reliability of our largescale distributed systems. The ideal candidate will demonstrate deep expertise in Dynatrace, AIOps platforms, observability engineering, and AIdriven automation, including handson development with AI agents and modern coding frameworks. This is a technical leadership role requiring architecturelevel thinking, strong coding ability, and the ability to drive enterprisewide transformation. Take the first step towards your dream career Every Dell Technologies team member brings something unique to the table. Here’s what we are looking for with this role:

Requirements

  • Architecture & Reliability Engineering Design and architect highly reliable, scalable, and selfhealing systems across hybrid, multicloud, and onprem environments Establish reliability patterns, guardrails, and architecture standards including SLIs, SLOs, error budgets, and resiliency patterns Lead root cause prevention strategies, chaos engineering practices, and resilience validation frameworks
  • Observability & Dynatrace Expertise Own endtoend observability strategy using Dynatrace, including: Application Performance Monitoring (APM) Infrastructure monitoring Log analytics Realuser monitoring (RUM) Custom instrumentation and dashboards Architect deterministic and AIdriven alerting, Davis AI configurations, and servicelevel dependency mapping
  • AIOps & Automation Lead adoption and integration of AIOps platforms (Dynatrace Davis AI, ServiceNow AIOps, Moogsoft, or equivalent) Build intelligent automation pipelines for: Predictive incident detection Autoremediation Noise reduction and event correlation Operational anomaly detection Drive automation-first operations to reduce toil and improve operational efficiency
  • Coding & AI Agents Develop and integrate AI agents capable of: Automated troubleshooting Intelligent runbook execution Workflow automation LLM-driven operational insights Write highquality code in languages such as Python, Go, TypeScript, or Java Build internal tools, automation frameworks, and platform APIs
  • CrossFunctional Leadership Partner with SRE teams, platform engineering, application engineering, cybersecurity, and infrastructure groups Provide architectural governance, participate in design reviews, and influence engineering standards Mentor engineers on reliability, observability, and automation best practices

Nice To Haves

  • Bachelor’s degree with 12+ years of experience, Master’s or PhD with 8+ years of experience, or an equivalent combination of education and experience

Responsibilities

  • Design and architect highly reliable, scalable, and selfhealing systems across hybrid, multicloud, and onprem environments
  • Establish reliability patterns, guardrails, and architecture standards including SLIs, SLOs, error budgets, and resiliency patterns
  • Lead root cause prevention strategies, chaos engineering practices, and resilience validation frameworks
  • Own endtoend observability strategy using Dynatrace, including: Application Performance Monitoring (APM) Infrastructure monitoring Log analytics Realuser monitoring (RUM) Custom instrumentation and dashboards
  • Architect deterministic and AIdriven alerting, Davis AI configurations, and servicelevel dependency mapping
  • Lead adoption and integration of AIOps platforms (Dynatrace Davis AI, ServiceNow AIOps, Moogsoft, or equivalent)
  • Build intelligent automation pipelines for: Predictive incident detection Autoremediation Noise reduction and event correlation Operational anomaly detection
  • Drive automation-first operations to reduce toil and improve operational efficiency
  • Develop and integrate AI agents capable of: Automated troubleshooting Intelligent runbook execution Workflow automation LLM-driven operational insights
  • Write highquality code in languages such as Python, Go, TypeScript, or Java
  • Build internal tools, automation frameworks, and platform APIs
  • Partner with SRE teams, platform engineering, application engineering, cybersecurity, and infrastructure groups
  • Provide architectural governance, participate in design reviews, and influence engineering standards
  • Mentor engineers on reliability, observability, and automation best practices
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service