Sr. Site Reliability Engineer

Applied Systems, Inc.

1d•Remote

About The Position

Applied Systems is transforming the insurance industry by building a team dedicated to learning, innovation, and delivering indispensable software and services to customers. With over 40 years of experience in insurtech, the company aims to redefine what's achievable and create a workplace where career growth is fostered. The Senior Site Reliability Engineer will join the SRE team, playing a critical role in ensuring the reliability, scalability, and efficiency of software applications to deliver best-in-class services to insurance agencies and carriers, empowering them to streamline operations, improve customer experiences, and drive business growth.

Requirements

5+ years of experience in DevOps, SRE, or Infrastructure Engineering roles
Strong foundations in the areas of Incident Management, Troubleshooting, Observability of software applications
Experience with cloud platforms (GCP, AWS, Azure), including traffic management solutions
Familiarity with distributed systems, microservices architecture, and related technologies
Proficiency in Python, Go, Bash, and PowerShell
Expertise in Windows and Linux system administration
Advanced knowledge of IaC tools like Terraform, including Terraform CDK with TypeScript, Packer, and HCL
Knowledge of CI/CD pipelines and version control systems (GitLab, GitHub Actions, etc.)
Familiarity with monitoring tools (Datadog) and security solutions (HashiCorp Vault, Cloud Armor)
Experience with SQL Server and PostgreSQL for database management
Kubernetes expertise, including Helm charts and ArgoCD for application deployment and orchestration
Excellent communication skills to collaborate with engineers, product managers, and business stakeholders
Strong organizational skills and attention to detail
Ability to prioritize tasks and make accurate decisions under pressure
Passion for mentoring and guiding team members

Responsibilities

Develop and maintain IaC using Terraform, Terraform CDK with TypeScript, Packer, and Ansible to automate on-prem and cloud infrastructure provisioning and management
Collaborate with development and platform teams to design scalable, reliable systems with fault tolerance, high availability, and performance optimization
Implement and manage monitoring solutions using Datadog to ensure system performance, tracing instrumentation, and adherence to SLI/SLO/SLAs
Utilize HashiCorp Consul for service discovery, dynamic configuration, and network automation across distributed systems
Define and implement best practices for disaster recovery and high availability across hybrid environments
Build and maintain CI/CD pipelines using tools like GitLab and GitHub Actions to streamline deployments and ensure code quality
Automate repetitive tasks to increase efficiency and reduce human error, leveraging tools like Python, Go, Bash, and PowerShell
Manage Kubernetes environments, including Helm charts and ArgoCD for application deployment and orchestration
Mentor junior engineers, lead technical discussions, and collaborate across teams to drive consensus on design decisions and technical initiatives
Create and maintain accurate documentation for workflows, procedures, and infrastructure standards to support internal teams and customers
Participate in the on-call rotation to provide production support and resolve complex engineering challenges
Work with third-party vendors to evaluate and integrate their products and services into the infrastructure ecosystem