Site Reliability Engineer (Contractor)

Varo Bank•Salt Lake City, UT

About The Position

Varo’s SRE team is well established, designing, building, and running large-scale, distributed, fault-tolerant systems that power most of Varo's operations. We live and breathe AWS and Kubernetes, having an open source first and result oriented mindset. We are an automation and observability focused team and we strive to automate ourselves out of manual / remedial tasks. We monitor and create dashboards to promote a data-driven approach to scale out our platform. On a typical day, members of our team are hands-on scaling-out production infrastructure, building out CI/CD pipelines, and brainstorming with developers on how to make things better. We collectively strive to build and maintain a rapid-feedback platform that enables our engineers to accomplish their own goals instead of creating friction.

Requirements

3+ years of experience in an SRE, DevOps, or Infrastructure Engineering role, with the ability to work independently and manage multiple workstreams.
Strong hands-on experience with core AWS services, including EKS, EC2, RDS Aurora, MSK, S3, IAM, VPC, and Direct Connect.
Deep production experience with Kubernetes (upgrades, networking, RBAC) alongside Helm and GitOps tools like ArgoCD.
Advanced proficiency with Terraform, including writing modules and managing multi-account/multi-environment states.
Experience supporting and maintaining data platforms such as Airflow, Databricks, EMR, Kafka/MSK, or CDC pipelines.
Solid understanding of networking (VPCs, security groups, Istio, DNS) paired with strong Python scripting skills for tooling and automation.
Experience managing observability stacks (Prometheus, Grafana, ELK) and effectively leveraging AI/LLM tools for automation and incident analysis.

Nice To Haves

Experience with Karpenter and KEDA
GitLab CI/CD pipeline experience
Hashicorp Vault for secrets management

Responsibilities

Manage, upgrade, and autoscale EKS clusters across multiple environments (SIT, UAT, Prod) and AWS accounts.
Write Terraform modules and Helm charts to support GitOps workflows using ArgoCD and GitLab CI/CD pipelines.
Maintain and troubleshoot Kafka (MSK) clusters, including broker health, connectors, and CDC pipelines.
Improve observability using Prometheus, Thanos, Grafana, and ELK while proactively identifying cloud cost-optimization opportunities.
Automate operational tasks with Python and leverage AI/ML techniques for predictive alerting and intelligent runbooks.
Handle Platform Service Desk requests, including Terraform merge request reviews, access management, and deployment support.
Participate in the production on-call rotation, support incident response, and contribute to blameless post-mortems.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume