Amazon-posted 1 day ago
Full-time • Mid Level
Culver City, CA

Amazon MGM Studios is seeking a Systems Development Engineer II to improve operational excellence and system reliability for our Studio in the Cloud (SITC) platform. SITC is transforming entertainment content creation by moving all major production and post-production workflows from traditional on-premises infrastructure to AWS cloud-based solutions. This initiative encompasses nine core workflows—Dailies, Editorial, VFX, Sound, Conform, Color, Mastering, QC/Delivery, and Remote Production—representing the complete content creation lifecycle from Original Camera Footage (OCF) ingestion through final delivery. In this role, you will work independently to make SITC systems more resilient, reliable, automated, and easier to operate. You will own your team's engineering and operational excellence, delivering solutions that reduce toil, eliminate risks, improve agility, and enable engineers to deliver products and features to customers with less effort, less cost, and better quality. You may focus on CI/CD automation for media workflow deployments, infrastructure optimization, observability and monitoring improvements, disaster recovery testing, performance tuning, or developer productivity tools that accelerate delivery. As a SysDE II, you will design and deliver technology solutions solving difficult business and technical problems, using appropriate combinations of software development, infrastructure automation, systems design, and process improvements. You will coach others on identifying and eliminating risk, especially risks to system resilience. You will participate in design reviews, operational readiness reviews, and post-incident analyses to identify contributing causes and deliver permanent fixes that prevent recurrence. Your work is mostly tactical, though you are beginning to participate in strategic planning processes like OP1 and OP2, ensuring appropriate investment in reliability and automation. This role requires solid technical skills spanning multiple domains including software development, infrastructure automation, cloud operations, and systems design. You will implement solutions for SITC systems that handle terabyte-scale media data, maintain responsive performance for creative workflows, operate across multiple AWS regions, and integrate with proprietary vendor systems like Avid, Adobe, and Blackmagic. Success requires writing high-quality automation and tooling, building robust operational processes, and ensuring systems support Amazon MGM Studios' production slate spanning 150+ titles annually without operational failures that impact creative teams.

  • Improve your team's operational health and system resilience by participating in design reviews, operations reviews, and post-incident analyses to identify risks to reliability, then delivering projects that mitigate those risks through automation, architectural improvements, or process changes
  • Design and deliver technology solutions that solve difficult operational and technical problems including: CI/CD pipeline automation for safe media workflow deployments, infrastructure-as-code implementations that reduce manual provisioning, monitoring and alerting systems that detect issues before customer impact, automated testing frameworks for vendor tool integrations, performance optimization for high-bandwidth media transfers, cost optimization tools that reduce infrastructure spending, or disaster recovery procedures that ensure multi-region availability
  • Work independently on automation, infrastructure, and tooling projects, seeking direction from your manager, SDE 3s, or Principal Engineers when facing architectural or technical trade-offs
  • Write high-quality software that meets Amazon Code Bar when building automation scripts, infrastructure tools, deployment systems, monitoring frameworks, or operational utilities. Ensure implementations are logical, maintainable, tested, version-controlled, and can be understood and extended by others
  • Identify and solve ambiguous problems, architectural deficiencies, or areas where your team's infrastructure hinders innovation of other teams. Beginning to extend this work to other teams in your organization
  • Make appropriate technical trade-offs and reuse or extend existing solutions where possible. Consider the legacy and scalability of systems you build, avoiding short-term workarounds or escalating their overuse when necessary
  • Actively participate in review processes across teams including code reviews, operational readiness reviews (ORRs), and corrections of error (COEs), providing meaningful feedback to others including those more senior. Use these as teaching mechanisms to help others identify and eliminate operational risks
  • Consistently write clear, accurate, inclusive, and concise documentation for your solutions including operational runbooks, architecture diagrams, deployment procedures, troubleshooting guides, and system documentation. Improve your team's existing documentation
  • Beginning to influence other teams on engineering and operational best practices, helping them apply automation, monitoring, and resilience patterns you've developed
  • Mentor other engineers on your team by training them on how systems are constructed, how they operate, how to troubleshoot issues, and how systems fit into the bigger picture. Participate in hiring processes to attract diverse talent
  • Experience in automating, deploying, and supporting large-scale infrastructure
  • Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust
  • Experience with Linux/Unix
  • Experience with CI/CD pipelines build processes
  • Experience with distributed systems at scale
  • medical
  • financial
  • other benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service