Director of DevOps

Kipu Systems•Salt Lake City, UT

About The Position

We’re looking for a Director of DevOps and CloudOps to own the reliability, security, and scalability of Kipu’s platform infrastructure across AWS and Azure. You’ll lead the team responsible for keeping our systems running, our deployments fast and safe, and our infrastructure ready to support the next phase of Kipu’s growth. This is a hands-on leadership role—you write scripts, build automation, maintain and upgrade internal tools, and architect solutions alongside your team every day. Kipu is the leading technology platform for behavioral health, operating in a HIPAA-regulated environment where uptime and data security are non-negotiable. You’ll manage two established teams and individual contributors—approximately 13 engineers across DevOps Engineering, DevOps Production Support, and specialized infrastructure roles. A critical part of this role is enabling the broader engineering organization: multiple product teams are spinning up new services at a fast pace, and your team is responsible for standing up all CI/CD pipelines, container orchestration (Kubernetes, AWS ECS), cloud infrastructure, and observability for every new service that ships. You’ll partner closely with engineering, product, and security to ensure our cloud infrastructure is a competitive advantage, not a constraint.

Requirements

8+ years of experience in DevOps, CloudOps, SRE, or infrastructure engineering, with at least 3 years leading teams.
Deep expertise in AWS (EC2, ECS/EKS, RDS, S3, Lambda, VPC, IAM, CloudWatch, Secrets Manager), including networking, compute, storage, and cost optimization.
Strong background in CI/CD pipeline design, release engineering, and deployment automation.
Experience operating infrastructure in a HIPAA-compliant or similarly regulated environment.
Proven track record building and maintaining observability stacks (monitoring, alerting, logging, tracing).
Infrastructure-as-code fluency: Terraform (required), AWS CDK, with familiarity in CloudFormation or Pulumi. Experience with configuration management tools (Ansible preferred).
Experience managing containerized workloads at scale (Kubernetes, ECS, or similar).
Demonstrated ability to recruit, develop, and retain strong infrastructure engineering talent.
Working experience with Azure cloud services (Azure DevOps, AKS, Azure Monitor, or equivalent).
Strong scripting, coding, and automation skills in Python and Bash—you write code daily, not occasionally.
Experience building, maintaining, and upgrading internal tools and applications.
Experience with security risk management, vulnerability remediation, and compliance-driven patching across cloud infrastructure at scale.
Demonstrated ability to manage managers and lead through others while remaining technically engaged.
High personal integrity, strong work ethic, and a commitment to doing the right thing under pressure.

Nice To Haves

Experience with Azure, particularly in a multi-cloud or hybrid environment.
Healthcare SaaS or multi-tenant platform experience.
SOC 2 or HITRUST audit experience, including evidence collection and control implementation.
Background supporting data-intensive or AI/ML infrastructure workloads.
Experience leading platform migrations or major infrastructure modernization efforts.
Familiarity with FinOps practices and cloud cost governance at scale.
Experience with PostgreSQL administration and performance tuning.
Familiarity with Datadog, Grafana, PagerDuty, and building observability-as-code.
AWS certifications (Solutions Architect Professional, DevOps Engineer Professional) or Azure equivalents.
Experience with service mesh, API gateways, or zero-trust networking models.
Familiarity with Ruby on Rails, Node.js, or Spring Boot application ecosystems (the services your team will support).

Responsibilities

Own Kipu’s cloud infrastructure strategy across AWS and Azure, including architecture decisions, cost optimization, and capacity planning.
Drive reliability and availability targets, establishing and maintaining SLAs/SLOs that align with customer and business expectations.
Lead incident response, root cause analysis, and post-incident review processes to continuously improve system resilience.
Manage infrastructure budgets and optimize cloud spend without sacrificing performance or security.
Write Python, Bash, and other scripts daily to automate operations, solve problems, and improve workflows. Own and evolve infrastructure-as-code (Terraform, CDK, Ansible).
Maintain, upgrade, and develop internal DevOps applications and automation tools used across the organization.
Design and maintain CI/CD pipelines (Jenkins, GitHub Actions) that enable engineering teams to ship with speed and confidence.
Establish release engineering standards, including deployment strategies (blue-green, canary, feature flags) and rollback procedures.
Reduce build times, flaky tests, and deployment friction across the engineering organization.
Serve as the infrastructure partner for product engineering teams spinning up new services—own the process of onboarding each service into CI/CD, container platforms (Kubernetes, ECS), and cloud infrastructure.
Drive standardization of service deployment patterns, infrastructure templates, and operational runbooks across all teams.
Ensure infrastructure meets HIPAA, SOC 2, and other regulatory requirements, partnering with security and compliance teams on audits and remediation.
Implement and enforce infrastructure security best practices, including network segmentation, IAM policies, secrets management, and encryption at rest and in transit.
Maintain disaster recovery and business continuity plans, including regular testing and validation.
Own security risk identification, assessment, and remediation across the infrastructure—proactively identify vulnerabilities and drive fixes across cloud resources.
Manage security patching, hardening, and compliance remediation at scale across AWS and Azure environments.
Build and evolve Kipu’s observability stack: monitoring, alerting, dashboards, logging, and distributed tracing (Datadog, CloudWatch, Azure Monitor).
Establish a data-driven approach to reliability, using SLIs and error budgets to balance velocity with stability.
Proactively identify and address infrastructure risks before they become customer-facing incidents.
Design and enforce observability standards for every new service—ensure teams ship with proper metrics, logging, and alerting from day one.
Provide production support and operational guidance to other engineering teams across the organization.
Lead and mentor two managers and their teams, plus direct IC reports (~13 total headcount), fostering a culture of ownership, accountability, and continuous improvement.
Define team structure, hiring plans, and career development paths as the organization scales.
Collaborate cross-functionally with engineering, product, and security leadership to align infrastructure priorities with business goals.