Director, Platform SRO

Versant•New York, NY

2d•$180,000 - $210,000•Onsite

About The Position

The Director, Platform SRO is a senior, hands-on technical leader responsible for ensuring the stability, resilience, and operational readiness of mission-critical broadcast linear, live event, and digital media platforms. Operating in high-pressure, real-time environments, this consultant leads major incident response efforts, supports on-air and live-event continuity, and partners closely with engineering, broadcast operations, production, and vendor teams to minimize service disruption and audience impact. The role requires deep practical experience with media workflows, rapid troubleshooting during live events, and the ability to make sound technical decisions under tight time constraints. Beyond reactive incident response, the Director plays a strategic role in improving long-term system reliability and operational maturity. By applying SRO/SRE principles adapted for media environments, the consultant identifies systemic risks, drives root cause analysis, strengthens monitoring and observability, and improves operational processes across broadcast and digital ecosystems. This role balances immediate hands-on execution with advisory leadership, helping organizations build more resilient architectures, clearer incident processes, and greater confidence in their ability to support live, always-on media operations.

Requirements

Experience supporting media, broadcast, streaming, digital publishing, or other 24x7 customer-facing platforms.
Experience building or scaling SRE organizations and operational maturity programs.
Hands-on experience with observability platforms such as Datadog, New Relic, Splunk, Grafana, or similar tools.
Familiarity with Infrastructure as Code and automation frameworks including Terraform, CloudFormation, or equivalent technologies.
Experience leading reliability initiatives across hybrid cloud and on-premises environments.
Industry certifications such as AWS Solutions Architect, Google Professional Cloud Engineer, Azure Solutions Architect, ITIL, SRE Foundation, or equivalent.

Nice To Haves

Experience implementing AI-assisted operational intelligence, event correlation, or automated incident response capabilities.

Responsibilities

Lead and coordinate high-severity incident response for broadcast linear channels, live events, and digital media platforms, serving as incident commander when required
Rapidly triage and troubleshoot issues across media workflows, including playout, live production, contribution/distribution, and OTT delivery
Establish, refine, and execute incident management processes, including escalation models, on-call coordination, communications, and severity classification
Produce post-incident reviews, root cause analyses, and corrective action plans to prevent recurrence and reduce operational risk
Assess system reliability, fault tolerance, and operational readiness across on-prem, hybrid, and cloud-based media architectures
Identify single points of failure and recommend architectural, workflow, and operational improvements to enhance availability and resilience
Define and improve monitoring, alerting, and observability strategies tailored to real-time broadcast and live event environments
Support disaster recovery, failover planning, and live-event readiness reviews, including testing and validation
Develop and maintain operational runbooks, standard operating procedures, and incident documentation
Partner with engineering, broadcast operations, production teams, and vendors to align reliability practices with on-air and live-event requirements
Mentor teams on incident response best practices, reliability engineering concepts, and continuous improvement
Advise leadership on operational risk, system health, and reliability priorities for critical media platforms