Manager, Site Reliability Engineering

Applied Systems, Inc.

34d•$115,000 - $175,000•Remote

About The Position

We're seeking a Site Reliability Engineering (SRE) Manager to lead and oversee our founding SRE Engineering team. Applied Systems is dedicated to delivering cutting-edge solutions that empower insurance agencies and carriers to optimize operations, enhance customer experiences, and drive business growth. As an SRE Manager, you will play a pivotal role in ensuring the reliability, scalability, and efficiency of our infrastructure while fostering a culture of operational excellence and innovation within the team.

Requirements

8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles, with at least 2+ years in a leadership or managerial capacity.
Proven experience managing hybrid infrastructure environments, including on-premises systems and public cloud platforms (GCP, AWS, Azure).
Hands-on experience with tools like Ansible, Terraform, and Kubernetes for automation and orchestration.
Strong foundation in on-premises infrastructure, including VMWare, F5 LTM load balancers, and networking.
Experience with distributed systems, microservices architecture, and service discovery tools like HashiCorp Consul.
Proficiency in scripting and programming languages such as Python, Go, Bash, and PowerShell.
Expertise in Windows and Linux system administration.
Advanced knowledge of IaC tools like Terraform, Terraform CDK with TypeScript, Packer, and HCL.
Familiarity with CI/CD pipelines and version control systems (GitLab, GitHub etc.).
Experience with monitoring tools (Datadog) and security solutions (HashiCorp Vault, Cloud Armor).
Strong knowledge of SQL Server and PostgreSQL for database management.
Kubernetes expertise, including Helm charts

Responsibilities

Team Leadership & Strategy:
Lead and manage a team of Site Reliability Engineers, providing mentorship, guidance, and fostering professional growth.
Define and implement SRE best practices, ensuring alignment with organizational goals and industry standards.
Collaborate with cross-functional teams, including development, product, and platform teams, to drive consensus on technical initiatives and design decisions.
Develop and execute strategies to improve system reliability, scalability, and performance across hybrid environments.
Advocate for a culture of automation, observability, and continuous improvement within the organization.
Infrastructure Management:
Oversee the design, implementation, and maintenance of hybrid infrastructure environments, including on-premises systems and public cloud platforms (GCP, AWS, Azure).
Ensure optimal performance and reliability of on-prem infrastructure components such as VMWare for virtualized environments, F5 LTM for load balancing, and Google Load Balancer for cloud-based traffic management.
System Reliability & Scalability:
Lead efforts to design and implement scalable, fault-tolerant systems with high availability and performance optimization.
Define and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure system reliability.
Automation & Configuration Management:
Drive automation initiatives using tools like Ansible, Ansible Tower (AWX), Terraform, Terraform CDK with TypeScript, Packer, and Python.
Oversee the development and maintenance of Infrastructure as Code (IaC) to streamline provisioning and management of on-prem and cloud environments.
Monitoring & Observability:
Implement and manage monitoring solutions using Datadog to ensure system performance, tracing instrumentation, and adherence to SLIs/SLOs/SLAs.
Promote observability practices to proactively identify and resolve issues before they impact customers.
Disaster Recovery & Security:
Define and implement disaster recovery strategies and high-availability solutions across hybrid environments.
Collaborate with security teams to ensure compliance with security standards and best practices, including the use of tools like HashiCorp Vault and Cloud Armor.
CI/CD Pipelines & Application Deployment:
Oversee the development and optimization of CI/CD pipelines using tools and workflows like GitLab and GitHub Actions to ensure efficient and reliable deployments.
Manage Kubernetes environments, including Helm charts and ArgoCD for application orchestration and deployment.
Database Management:
Ensure the reliability, scalability, and performance of databases, including SQL Server and PostgreSQL.
Documentation & Vendor Collaboration:
Ensure accurate documentation of workflows, procedures, and infrastructure standards to support internal teams and customers.
Collaborate with third-party vendors to evaluate and integrate their products and services into the infrastructure ecosystem.
On-Call Support:
Lead the on-call rotation strategy, ensuring timely resolution of production issues and complex engineering challenges.