Senior SRE (Site Reliability Engineer)

Retail Industry•Dallas, TX

1d•Hybrid

About The Position

We are seeking a high-caliber Senior SRE Engineer to join a premier client in Washington, DC, to spearhead the evolution of their enterprise observability platform. This is a high-impact role designed for a technical leader with nearly a decade of specialization in Dynatrace SaaS, tasked with architecting and automating large-scale monitoring solutions across complex AWS and Azure environments. You will bridge the gap between infrastructure and applications, leveraging Davis AI and Grail to drive proactive reliability, mentoring cross-functional DevOps teams, and establishing a gold standard for full-stack visibility in a mission-critical, multi-cloud landscape.

Requirements

9+ years of hands-on experience specifically focused on Dynatrace implementation and management at an enterprise scale.
5+ years in SRE, DevOps, or Cloud Infrastructure roles, with deep knowledge of Linux systems and networking.
Advanced experience navigating and securing AWS and Azure environments.
Strong proficiency in Python or similar scripting languages for building self-service tooling and automation.
Proven ability to integrate observability stacks with ITSM and communication tools like ServiceNow, PagerDuty, and Microsoft Teams.
Experience working within a SAFe Agile delivery environment and a solid understanding of the ITIL framework.
Bachelor’s degree in Computer Science, Engineering, or a related technical field.
Ability to work on-site in the Washington, DC area as required and provide off-hours support for critical production incidents.

Responsibilities

Lead the design, governance, and rollout of Dynatrace observability for distributed microservices, serverless workloads, and multi-region cloud environments.
Configure deep code-level visibility (PurePath), Smartscape topology mapping, and advanced APM instrumentation to ensure comprehensive system transparency.
Harness Davis AI for causal analysis and root cause identification; develop custom dashboards, alerting profiles, and auto-remediation workflows to minimize MTTR.
Implement Real User Monitoring (RUM) and Synthetic Monitoring to analyze user journeys and establish performance KPIs.
Drive "Observability as Code" by building CI/CD pipelines (GitHub Actions, Jenkins) and automating infrastructure via Terraform, CloudFormation, or AWS CDK.
Manage high-volume log ingest pipelines and processing rules using Dynatrace Grail and Log Management features.
Define and monitor SLIs, SLOs, and error budgets while participating in on-call rotations and developing detailed RCA documentation.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume