Site Reliability Engineer II

The Walt Disney Company•New York, NY

About The Position

The Streaming SRE squad drives improvements in performance, resiliency, and operational excellence. We take a consultative approach to reliability engineering—partnering with a variety of cross-functional teams to provide guidance, automation, education, and best practices that elevate the reliability and scalability of services that support our products and brands. We are seeking a Site Reliability Engineer who will contribute to the stability and scalability of critical systems by building automation, improving operational workflows, enhancing observability, and participating in incident response. The ideal candidate has a strong understanding of distributed system fundamentals, cloud-native resources and operations, and performance optimization. This role requires a collaborative mindset and the ability to work closely with engineering teams to implement SRE principles across the organization. Fostering innovation is a critical component to success here at Disney Entertainment and ESPN Product & Technology. Therefore, the ideal candidate will also need to be highly adaptable to changes and be able to pivot when required.

Requirements

Bachelor's degree in computer science, Engineering, or related field (or equivalent experience).
3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related discipline.
Hands-on experience with cloud platforms – AWS (preferred), GCP, Azure.
Proficiency in Python, Go, JavaScript, Bash, or equivalent scripting languages.
Working knowledge of Linux or Unix-based systems.
Experience with CI/CD systems (e.g., GitHub Actions, GitLab CI, Jenkins).
Familiarity with Infrastructure-as-Code (Terraform, CloudFormation, etc.).
Experience with containerization technologies such as Docker and Kubernetes.
Understand networking fundamentals, distributed systems, and system design basics.
Strong analytical and troubleshooting skills, including the ability to diagnose complex system issues.
An ability to work both independently and collaboratively
Strong communication skills and the ability to collaborate effectively with cross-functional teams.

Nice To Haves

Hands-on experience with observability stacks (Prometheus, Grafana, ELK/EFK, Datadog, Splunk, New Relic).
Exposure to GitOps tooling (Argo CD, Flux).
Experience contributing to SLO/SLI frameworks and implementing error budgets.
Knowledge of service mesh architectures (Istio, Linkerd).
Familiarity with performance testing and load testing tools.
Experience with message queues, event-driven systems, or distributed data platforms.
Cloud or DevOps-related certifications (AWS Associate/Specialty, GCP Professional, Kubernetes CKA/CKS).
Experience working in large-scale enterprise environments or with distributed global teams.
Experience using modern AI-assisted development tools (e.g., Copilot, Cursor, or similar) to improve code quality, accelerate development, and enhance documentation.
Understanding foundational AI/ML concepts, familiarity with cloud-native AI services such as model hosting, and/or ability to use AI tools to automate cloud operations tasks.

Responsibilities

Contribute to the design, implementation, and improvement of systems to enhance reliability, scalability, and performance.
Build and maintain automation for deployment, monitoring, alerting, and operational workflows.
Collaborate with software engineering teams to implement SRE best practices, including SLIs, SLOs, error budgets, and automated remediation.
Support CI/CD pipelines and participate in optimizing the software delivery lifecycle.
Develop tools, dashboards, and instrumentation to improve observability across metrics, logs, and distributed tracing.
Participate in incident response, root cause analysis (RCA), and corrective actions to prevent recurrence.
Assist in capacity planning, performance tuning, and scaling strategies for distributed systems.
Maintain and improve Infrastructure-as-Code (IaC) definitions and cloud environment configurations.
Contribute to documentation, runbooks, architectural diagrams, and operational standards.
Collaborate with cross-functional teams to identify reliability risks and recommend improvements.
Participate in incident-based escalations and rotations to support high-availability production systems.
Continuously evaluate system architecture, tools, and practices to drive operational excellence and efficiency.