Site Reliability Engineer

David's Bridal•King of Prussia, PA

13h•Hybrid

About The Position

The Site Reliability Engineer (SRE) is accountable for the availability, scalability, and performance of David's Bridal cloud compute platform across AWS and Azure, as well as the on-premises infrastructure spanning Hyper-V, VMware, Linux systems, and core network services (Active Directory, DNS, DHCP). This role owns SLO definition, incident response, capacity planning, and automation for production workloads supporting our retail and ecommerce platforms during peak wedding and holiday demand. The SRE also ensures the reliability and lifecycle of on-premises compute and identity infrastructure, maintaining seamless integration between datacenter environments and cloud platforms. The SRE partners with Platform Engineering, Security, and Application teams to drive service health, reduce toil, and embed reliability practices across the SDLC.

Requirements

7+ years of SRE, DevOps, or production engineering experience operating production workloads on AWS and/or Azure; multi-cloud and hybrid experience strongly preferred.
Deep operational expertise with AWS EC2, ECS or EKS, and Lambda, or with Azure VMs, AKS, and Functions, in a regulated or high-traffic ecommerce environment.
Hands-on experience administering Hyper-V and/or VMware vSphere environments including clustering, HA, VM lifecycle, and storage integration.
Demonstrated experience with Linux server administration (RHEL, CentOS, or Ubuntu) including patching, performance tuning, and shell scripting.
Strong Active Directory expertise — OU design, GPO management, user/group lifecycle, replication troubleshooting, and Entra ID / AD Connect synchronization.
Practical experience managing Windows Server DNS and DHCP in multi-site enterprise environments, including DHCP failover, relay configuration, and split-brain DNS.
Strong scripting skill in Python, Bash, and PowerShell; PowerShell DSC and PowerCLI experience is a plus.
Production experience with Terraform or Bicep, and CI/CD with GitHub Actions or Azure DevOps.
Hands-on Kubernetes operations experience on EKS, AKS, or self-managed clusters.
Demonstrated experience defining and operating against SLOs and error budgets.
Strong observability background with Datadog, Splunk, Zabbix, Prometheus, or equivalent platforms.
Experience leading incident response and authoring blameless postmortems.

Nice To Haves

Retail or ecommerce platform experience, especially during peak events such as Black Friday, Cyber Monday, and bridal season.
Experience with Azure Arc for managing on-premises and multi-cloud servers from a unified control plane.
Familiarity with Shopify Plus, headless commerce, CDN edge platforms, and Tealium or equivalent CDP.
Chaos engineering practice using AWS FIS, Azure Chaos Studio, or Gremlin.
FinOps practice and cloud cost optimization at enterprise scale.
Experience with cross-cloud networking (Transit Gateway, Azure VWAN, ExpressRoute, Direct Connect) and Meraki SD-WAN.
Experience with SIEM platforms (Microsoft Sentinel, Gravwell, Splunk) in a hybrid cloud / on-premises environment.
AWS Certified Solutions Architect (Associate or Professional); AWS Certified DevOps Engineer.
Microsoft Certified Azure Administrator (AZ-104); Azure Solutions Architect Expert (AZ-305).
Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD).
Microsoft Certified: Identity and Access Administrator (SC-300) or equivalent Active Directory / Entra ID certification.
VMware Certified Professional (VCP-DCV) or Microsoft Hyper-V certification.

Responsibilities

Own production reliability for compute workloads on AWS (EC2, ECS, EKS, Lambda) and Azure (VMs, AKS, Functions) supporting davidsbridal.com and core retail systems.
Define and operate SLIs, SLOs, and error budgets for tier 1 services; partner with product and engineering on reliability targets and release gates.
Lead capacity planning, autoscaling design, and rightsizing across multiple AWS regions and Azure subscriptions, with explicit readiness for peak bridal and holiday traffic.
Drive cloud compute cost optimization, including Reserved Instance and Savings Plan strategy, Azure Reservations, spot adoption, and workload placement decisions.
Own the operational health of on-premises virtualization platforms including Hyper-V clusters (Windows Server) and VMware vSphere / ESXi environments.
Manage VM lifecycle end-to-end — provisioning, rightsizing, snapshotting, replication, and decommission — across Hyper-V and VMware hosts.
Maintain Hyper-V and VMware host patching, firmware updates, and capacity management; coordinate with hardware vendors and the datacenter team for physical infrastructure.
Administer Linux servers (RHEL, CentOS, Ubuntu) including OS patching, performance monitoring, cron job management, and service hardening.
Design and maintain HA and DR configurations for on-premises compute — Hyper-V Replica, VMware vSphere HA/DRS, and backup integration with Rubrik or equivalent.
Drive hybrid cloud integration between on-premises infrastructure and Azure, including ExpressRoute / Direct Connect connectivity, Azure Arc-managed on-premises VMs, and workload migration planning.
Own the health and operational integrity of Active Directory (AD DS) across all DBI domains — OU structure, Group Policy, site topology, replication health, and AD Connect / Entra ID synchronization.
Manage AD user and group lifecycle in coordination with HR and IT operations — provisioning, modification, offboarding, and access governance aligned with identity management policy.
Administer Windows Server DNS infrastructure including forward/reverse zones, DNS replication, conditional forwarders, and split-brain DNS for internal and store-facing domains (dbistores.com, dbi.com).
Manage DHCP infrastructure across corporate and store environments — scope configuration, failover partnerships (DHCP02/DHCP03), relay/helper IP updates, lease duration management, and exclusion ranges.
Investigate and resolve DNS/DHCP incidents including OFFER=0 conditions, scope exhaustion, relay misconfiguration, and cross-site replication failures.
Maintain DHCP relay configurations on all routers and switches; coordinate with the network team on helper-address updates during infrastructure changes (office moves, new VLANs, server decommissions).
Document and enforce AD change management practices — no production AD changes on Fridays, script exclusions for service accounts and distribution groups, and OU-level protections for store and disabled accounts.
Serve as incident commander for production events across cloud and on-premises environments; lead triage, mitigation, customer impact assessment, and executive communications.
Build and maintain runbooks, automated remediation, and self-healing patterns for high-frequency failure modes across AWS, Azure, Hyper-V, VMware, and AD/DNS/DHCP infrastructure.
Lead blameless postmortems and drive corrective actions to closure; track recurring themes and feed them into the reliability roadmap.
Participate in a 24x7 on-call rotation and continuously reduce on-call burden through automation and toil tracking.
Instrument compute workloads with metrics, logs, traces, and synthetic checks across AWS, Azure, and on-premises using Datadog, Splunk, and Zabbix.
Define golden signals and dashboards for tier 1 services; ensure alerts are actionable, owned, and tied to documented SLOs.
Extend observability coverage to on-premises infrastructure — Hyper-V host health, VMware cluster utilization, AD replication status, DNS query latency, and DHCP lease utilization.
Partner with application teams to embed observability into new services from day one and during production readiness reviews.
Build and maintain Terraform modules and Bicep templates for AWS and Azure compute, networking, and Kubernetes infrastructure.
Automate on-premises operations including Hyper-V VM provisioning via PowerShell/DSC, VMware tasks via PowerCLI, AD bulk operations, DHCP scope management, and DNS zone updates.
Operate GitHub Actions and Azure DevOps pipelines for compute platform changes, including policy as code and drift detection.
Automate routine work including patching, AMI and image bakery, certificate rotation, scaling events, and access provisioning across cloud and on-premises environments.
Champion shift-left reliability by integrating reliability and security checks into CI/CD pipelines.
Partner with Security, Platform Engineering, and Application teams on production readiness reviews, change management, and architecture decisions.
Mentor engineers on SRE principles, observability best practices, and cloud and on-premises compute patterns.
Influence architecture toward scalable, resilient, and cost-effective compute patterns across the David's Bridal technology portfolio — cloud, hybrid, and on-premises.