Manager, Site Reliability Engineering

Applied Systems, Inc.
5d$115,000 - $175,000Remote

About The Position

We're seeking a Site Reliability Engineering (SRE) Manager to lead and oversee our founding SRE Engineering team. Applied Systems is dedicated to delivering cutting-edge solutions that empower insurance agencies and carriers to optimize operations, enhance customer experiences, and drive business growth. As an SRE Manager, you will play a pivotal role in ensuring the reliability, scalability, and efficiency of our infrastructure while fostering a culture of operational excellence and innovation within the team.

Requirements

  • 8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles, with at least 2+ years in a leadership or managerial capacity.
  • Proven experience managing hybrid infrastructure environments, including on-premises systems and public cloud platforms (GCP, AWS, Azure).
  • Hands-on experience with tools like Ansible, Terraform, and Kubernetes for automation and orchestration.
  • Strong foundation in on-premises infrastructure, including VMWare, F5 LTM load balancers, and networking.
  • Experience with distributed systems, microservices architecture, and service discovery tools like HashiCorp Consul.
  • Proficiency in scripting and programming languages such as Python, Go, Bash, and PowerShell.
  • Expertise in Windows and Linux system administration.
  • Advanced knowledge of IaC tools like Terraform, Terraform CDK with TypeScript, Packer, and HCL.
  • Familiarity with CI/CD pipelines and version control systems (GitLab, GitHub etc.).
  • Experience with monitoring tools (Datadog) and security solutions (HashiCorp Vault, Cloud Armor).
  • Strong knowledge of SQL Server and PostgreSQL for database management.
  • Kubernetes expertise, including Helm charts

Responsibilities

  • Team Leadership & Strategy:
  • Lead and manage a team of Site Reliability Engineers, providing mentorship, guidance, and fostering professional growth.
  • Define and implement SRE best practices, ensuring alignment with organizational goals and industry standards.
  • Collaborate with cross-functional teams, including development, product, and platform teams, to drive consensus on technical initiatives and design decisions.
  • Develop and execute strategies to improve system reliability, scalability, and performance across hybrid environments.
  • Advocate for a culture of automation, observability, and continuous improvement within the organization.
  • Infrastructure Management:
  • Oversee the design, implementation, and maintenance of hybrid infrastructure environments, including on-premises systems and public cloud platforms (GCP, AWS, Azure).
  • Ensure optimal performance and reliability of on-prem infrastructure components such as VMWare for virtualized environments, F5 LTM for load balancing, and Google Load Balancer for cloud-based traffic management.
  • System Reliability & Scalability:
  • Lead efforts to design and implement scalable, fault-tolerant systems with high availability and performance optimization.
  • Define and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure system reliability.
  • Automation & Configuration Management:
  • Drive automation initiatives using tools like Ansible, Ansible Tower (AWX), Terraform, Terraform CDK with TypeScript, Packer, and Python.
  • Oversee the development and maintenance of Infrastructure as Code (IaC) to streamline provisioning and management of on-prem and cloud environments.
  • Monitoring & Observability:
  • Implement and manage monitoring solutions using Datadog to ensure system performance, tracing instrumentation, and adherence to SLIs/SLOs/SLAs.
  • Promote observability practices to proactively identify and resolve issues before they impact customers.
  • Disaster Recovery & Security:
  • Define and implement disaster recovery strategies and high-availability solutions across hybrid environments.
  • Collaborate with security teams to ensure compliance with security standards and best practices, including the use of tools like HashiCorp Vault and Cloud Armor.
  • CI/CD Pipelines & Application Deployment:
  • Oversee the development and optimization of CI/CD pipelines using tools and workflows like GitLab and GitHub Actions to ensure efficient and reliable deployments.
  • Manage Kubernetes environments, including Helm charts and ArgoCD for application orchestration and deployment.
  • Database Management:
  • Ensure the reliability, scalability, and performance of databases, including SQL Server and PostgreSQL.
  • Documentation & Vendor Collaboration:
  • Ensure accurate documentation of workflows, procedures, and infrastructure standards to support internal teams and customers.
  • Collaborate with third-party vendors to evaluate and integrate their products and services into the infrastructure ecosystem.
  • On-Call Support:
  • Lead the on-call rotation strategy, ensuring timely resolution of production issues and complex engineering challenges.

Benefits

  • Medical, Dental, and Vision Coverage
  • Holiday and Vacation Time
  • Health & Wellness Days
  • A Bonus Day for Your Birthday
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service