Senior Site Reliability Engineer II

ShutterflyFort Mill, SC
78d$106,000 - $151,000Remote

About The Position

At Shutterfly, we make life's experiences unforgettable. We believe there is extraordinary power in the self-expression. That's why our family of brands helps customers create products and capture moments that reflect who they uniquely are. Shutterfly is looking for a Senior Site Reliability Engineer to join our team. Shutterfly is undergoing a comprehensive consumer website re-platforming effort, with the Site Reliability Engineering (SRE) team playing a pivotal role in building shared infrastructure and ensuring future efficiency and supportability. The Senior Site Reliability Engineer II role is responsible for ensuring the reliability, availability, and performance of Shutterfly's consumer systems. This position requires deep technical expertise in performance troubleshooting, system optimization, and automation to help maintain resilient, scalable, and cost-efficient platforms. As a senior member of the SRE team, you will collaborate closely with development and operations teams, contribute to automation and observability solutions, and serve as a subject matter expert during incidents.

Requirements

  • 5-7+ years of experience in software engineering, SRE, or DevOps roles supporting large-scale, highly available systems.
  • Strong skills in performance troubleshooting, root cause analysis, and distributed system optimization.
  • Proficiency in at least one programming language (Python, Go, Java, or similar) with ability to write production-quality code.
  • Hands-on experience with observability platforms (e.g., Splunk, Datadog, SignalFx, Prometheus, OpenTelemetry).
  • Strong knowledge of AWS services, cloud deployment models, and cost optimization strategies.
  • Experience with Infrastructure as Code (Terraform, CloudFormation) and configuration management (Ansible, Chef, Puppet).
  • Solid understanding of distributed systems concepts (scalability, high availability, fault tolerance).
  • Experience in incident management and driving operational improvements.
  • Exposure to AI/ML or AIOps tools for anomaly detection, predictive analytics, or automated incident response (preferred but not required).
  • Effective communication skills with ability to work across engineering and business teams.
  • Bachelor's degree in Computer Science, Engineering, or equivalent experience.

Nice To Haves

  • Exposure to AI/ML or AIOps tools for anomaly detection, predictive analytics, or automated incident response.

Responsibilities

  • Perform advanced performance analysis and troubleshooting across distributed systems to ensure optimal availability, scalability, and cost efficiency.
  • Implement and maintain monitoring, alerting, and observability solutions to provide proactive visibility into application and infrastructure health.
  • Partner with development teams to influence service design and architecture so that new features meet high standards for reliability and scalability.
  • Participate in incident response, including root cause analysis and long-term reliability improvements.
  • Contribute to capacity planning, cost optimization, and performance tuning of large-scale systems.
  • Build and maintain automation and tooling that reduces manual effort, accelerates delivery, and minimizes human error.
  • Explore and apply AI/ML technologies (e.g., anomaly detection, predictive scaling, automated alerting) to enhance SRE practices.
  • Share expertise with peers by documenting best practices, solutions, and troubleshooting methodologies.
  • Collaborate across infrastructure, development, and business teams to align on standards and reliability goals.
  • Provide technical depth and decisive action during critical incidents.

Benefits

  • Bonus incentive
  • Health benefits
  • 401K program
  • Employee perks

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Career Level

Senior

Industry

Personal and Laundry Services

Education Level

Bachelor's degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service