Manager, Engineering Ops (Devops)

Talent Systems

5h•$170,000 - $180,000•Hybrid

About The Position

We are seeking an experienced Head of Engineering, Engineering Operations to lead our engineering operations which includes areas such DevOps, Site Reliability Engineering (SRE), CI/CD, Release management etc for our cloud-based systems and applications. This role is pivotal in ensuring the reliability, security, scalability, and availability of our systems while driving innovation in automation, CI/CD pipelines, and operational efficiency. You will be responsible for crisis management, improving system performance, cost and fostering a culture of operational excellence.

Requirements

10+ years of experience in software engineering, with 5+ years in leadership roles
Proven track record of improving system reliability, availability, and performance for cloud-based applications.
Extensive experience with CI/CD pipelines and automation tools.
Demonstrated expertise in crisis management and incident response in high-pressure environments.
Deep knowledge of cloud platforms (such as AWS) and container orchestration tools (Kubernetes, Docker).
Strong proficiency in monitoring and observability tools like Grafana.
Excellent problem-solving and decision-making skills under pressure.
Exceptional communication and collaboration skills, with the ability to influence stakeholders across engineering and business teams.
Proven ability to lead and grow high-performing teams in a fast-paced environment.
A strong focus on fostering a culture of accountability, learning, and operational excellence.
Influence partner engineering teams like platform and product engineering.

Responsibilities

Lead and mentor teams in DevOps, SRE, and Engineering Operations, fostering a culture of collaboration, ownership, and innovation.
Develop and execute the strategic roadmap for engineering operations, aligning with business goals and product requirements.
Advocate for and implement industry best practices in system reliability, DevOps, and automation.
Drive initiatives to improve the reliability, availability, and performance of cloud-based applications and infrastructure.
Establish performance measurements for various system health metrics.
Ensure robust incident management and crisis response processes to minimize downtime and customer impact.
Oversee the design, implementation, and optimization of CI/CD pipelines to enable seamless and automated deployment processes.
Leverage automation tools and practices to reduce manual interventions and improve operational efficiency.
Collaborate with product and engineering teams to enable rapid and reliable feature delivery.
Implement and maintain advanced monitoring, logging, and alerting systems to gain deep insights into system health and performance.
Use observability tools (e.g., Grafana) to proactively identify and resolve issues before they impact customers.
Lead crisis management efforts during high-severity incidents, ensuring quick resolution and effective communication with stakeholders.
Conduct root cause analyses and drive post-mortem reviews to identify and address operational gaps.
Build, grow, and retain a high-performing engineering operations team with expertise in DevOps and SRE practices across multiple geolocations.
Foster close collaboration with development, data, and product teams to align engineering operations with overall business objectives.
Promote a blameless post-mortem culture to encourage continuous learning and improvement.
Optimize cloud infrastructure costs while maintaining system reliability and scalability.
Implement robust security practices in operations to ensure compliance with industry standards and regulations.