Senior Engineering Manager, Cloud Platform

Horizon3 AI

50d•Remote

About The Position

Horizon3.ai is seeking an engineering leader to head its infrastructure platform team. This team is responsible for building capabilities that enable product development teams to rapidly launch new features by providing self-service environments and infrastructure. The leader will also be tasked with establishing a Site Reliability Engineering (SRE) function to define and drive investments in operational excellence, ensuring Horizon3’s product and service offerings meet customer and business expectations. The company is a fast-growing, remote cybersecurity firm focused on helping organizations proactively identify and fix exploitable attack vectors before they can be exploited by criminals. Their flagship product, NodeZeroTM, offers autonomous pentests and other assessment operations for internal, external, cloud, and hybrid cloud environments. The team is composed of former U.S. Special Operations cyber operators, startup engineers, and cybersecurity practitioners, united by a commitment to solving common security problems like ineffective tools, alert fatigue, and skills shortages.

Requirements

Demonstrated experience leading teams operating SaaS service infrastructure.
Deep hands-on experience deploying and operating production infrastructure on public cloud platforms (AWS strongly preferred; Azure and GCP familiarity a plus).
Strong command of Infrastructure as Code, including Terraform; experience with Crossplane and GitOps patterns strongly preferred.
Experience managing production Kubernetes environments at scale.
Solid understanding of security best practices including zero trust architecture, secrets management, identity and access management, and software supply chain security.
Experience building and operating self-service infrastructure platforms that enable application development teams, while balancing self-service and developer productivity with maintainability and security.
Experience leading or building SRE functions, including incident management processes, on-call programs, SLO/SLA definition, and operational runbooks.
Deep hands on experience with observability: application performance management, logs and traces, and golden signals and service-specific metrics.
Expert in leading infrastructure teams to translate business and product requirements into technical requirements and engineering deliverables.
Demonstrated ability to drive a design-before-build engineering culture and capture operational lifecycle and impact on dependent teams into infrastructure architecture decisions.
Experience in design practices like architecture decision records or RFCs in the infrastructure or platform engineering context.
Proven track record of driving engineering rigor through coaching, design reviews, and feedback.
Track record of growing engineers from strong individual contributors to system-level thinkers.
Proven ability to hire, develop, and retain high-performing engineers and engineering managers in remote or distributed environments.
Proven skills in managing a backlog of strategic roadmap initiatives and a competing stream of developer support and operational requests against capacity constraints.
Strong communication skills across technical and non-technical audiences.
Proven ability to build, scale, and retain high performing engineering teams in remote or distributed environments.

Nice To Haves

Azure and GCP familiarity
Experience with Crossplane and GitOps patterns

Responsibilities

Lead software engineering teams providing infrastructure-as-code to manage cloud infrastructure.
Provide high quality IaC components and frameworks to support application development teams to leverage and extend to self-service their infrastructure provisioning.
Establish governance and mechanisms for application development teams to self-service infrastructure provisioning, while providing for best practices and controls.
Provide documentation, training, and support to ensure feature dev teams are leveraging self-service capabilities.
Hire experienced site reliability staff, and a line manager to grow and oversee the SRE team.
Professionalize incident management by defining and documenting incident processes and practices for the SRE team and for application feature teams.
Make tool and vendor decisions to support incident management processes.
Drive incident professionalism across the engineering organization through training and process adoption.
Establish design-before-build discipline.
Facilitate lightweight design documents, architectural decision records, and working group reviews.
Outline operational lifecycles, “Day 2” concerns, and developer experience as part of infrastructure architecture decisions.
Use design reviews, code reviews, and blameless retrospectives to drive a quality and excellence culture in engineering.
Balance providing developer support while also executing on a roadmap of infrastructure engineering initiatives.
Establish intake, allocate resources, provide visibility into backlogs to stakeholders, and manage prioritization against capacity.
Directly manage a growing team of infrastructure engineers.
Hire and develop line managers and staff / principal engineers.
Ensure a strong bench of technical and leadership talent in the group.
Recruit and onboard talented individuals to support organizational goals.
Mentor, coach, equip, and develop the team.
Recognize and retain high performers.
Lead horizontally with peer Management & Senior Leaders.