The position focuses on enhancing application service and infrastructure resilience through self-healing and automated failovers, targeting a 99.99% uptime for customers. The role involves assisting in planned disruptions of production infrastructure to ensure accountability for building resilient systems, and influencing design and development teams to consider failure scenarios. Responsibilities also include identifying opportunities to eliminate manual activities through automation, enhancing scalability via capacity management, and monitoring service availability and performance. The role requires participation in post-mortem reviews and optimizing monitoring capabilities to ensure critical user service journeys are traceable.