Senior Engineer, SRE

SephoraSan Francisco, CA
Hybrid

About The Position

At Sephora, beauty is about feeling seen, valued, and empowered, individually and collectively. It is connecting deeply with others, celebrating diversity and inclusivity, unlocking your potential, and making a difference every day. Together, we belong to something beautiful. As Senior Engineer, Site Reliability Engineering - Digital, you'll be ensuring hyper-stable online experiences for millions of Sephora customers. The work you do will impact beauty, as you monitor, optimize, and safeguard the reliability of Sephora's Dotcom platform and OMNI services. You'll be part of a team that's united in beauty, supported by those who are equally passionate about delivering resilient, high-performance digital experiences that connect customers to the products they love.

Requirements

  • 6+ years of hands-on SRE, DevOps, or Production Engineering experience in high-scale digital applications, with a strong understanding of reliability principles and operational excellence.
  • Strong exposure to Azure AKS, Kubernetes, Docker, Service Mesh, and API-driven architectures, with operational support experience for React front-end and Spring Boot microservices in production environments.
  • Hands-on experience with observability tools (Dynatrace, Splunk, Grafana, Prometheus) and strong scripting abilities (Python, Bash, PowerShell, YAML) to build automation that reduces toil and improves incident response.
  • Proven experience in incident management, root cause analysis, and implementing permanent corrective actions that drive long-term reliability improvements.
  • Experience with SRE principles, CI/CD pipelines (Jenkins, GitHub Actions), and cloud platforms (Azure required; AWS/GCP/OCI a plus).
  • Strong analytical and problem-solving abilities with clear communication skills under pressure, a collaborative mindset, and passion for reducing toil while improving developer and operator experiences.

Responsibilities

  • Operate and support the Dotcom and OMNI platform (including BOPIS and Same-Day Delivery), ensuring high availability, resilience, and hyper-stable customer experiences during normal operations and peak traffic events.
  • Triage, diagnose, and resolve L2/L3 production incidents; lead post-incident reviews and partner with engineering teams on permanent corrective actions to eliminate root causes.
  • Build automation solutions, reduce operational toil, and create AI-driven reliability tools and agentic workflows to improve mean time to resolution, productivity, and overall stability.
  • Develop and optimize observability through logs, metrics, traces, dashboards, and anomaly detection; refine alerting and telemetry pipelines to proactively identify and resolve issues.
  • Ensure world-class readiness for releases, seasonal events, feature launches, and traffic spikes through resiliency checks, performance validation, and comprehensive change reviews.
  • Maintain and optimize SLO/SLI frameworks; monitor error budgets and partner with application teams on continuous reliability improvements.

Benefits

  • Healthcare plans including medical, dental, and vision coverage
  • Disability insurance
  • Life insurance
  • Competitive 401k with 4% match
  • FSA and HSA programs
  • Student Debt Retirement plan
  • Paid time off
  • Sick paid time off
  • Protected leave
  • Development programs
  • Tuition reimbursement
  • Mentorship
  • 30% discount on all merchandise/services
  • Opportunities for free product or “gratis”
  • Flash sale discounts on LVMH brand products
  • Free mental health and financial coaching resources with 24/7 access to Modern Health and Financial Finesse
  • Volunteer and donation matching
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service