Site Reliability Engineer

Attain•Redwood City, CA

38d•Hybrid

About The Position

As a Site Reliability Engineer, you will play a critical role in building out and maintaining the infrastructure that powers all of our systems, as well as all of the supporting tools to ensure that those systems are running smoothly. You will work closely with nearly every engineering team at Attain, in helping to ensure that our systems are operating at peak efficiency, and preparing us to handle the scale of our future growth. Attain Office Hybrid Schedule (where applicable): Redwood City, CA: Mondays (in-office for stand-ups, all-hands) and choice of three days between Tues-Friday Chicago, IL & New York, NY: 4 days in-office; 1 day remote

Requirements

You are comfortable wearing many hats
You have a willingness to learn and teach in a fast-paced, collaborative environment
You have a strong desire to automate things
You readily provide constructive feedback, and also proactively seek feedback to improve yourself
You like to get your hands dirty and tinker with/stress test new technologies

Nice To Haves

4+ years of experience building and maintaining large-scale cloud-native infrastructure (AWS and/or GCP)
Experience working with the containerization technologies Docker, Kubernetes, and Istio or a similar service mesh technology
Experience with SQL database technologies such as MySQL,Google BigQuery, and Google Spanner
Experience with stream technologies such as Kafka and Amazon Kinesis
Experience with pub sub technologies such as AWS SNS and Google Pub/Sub
Experience with serverless computing technologies such as AWS Lambda and Google Cloud Functions/Google Cloud Run
Experience with infrastructure-as-code tools such as Terraform
Experience with observability tools such as Datadog, Prometheus, and Grafana
Strong computer science and software engineering fundamentals
Experience with SOC2 Compliance processes and requirements

Responsibilities

Write Terraform modules for deploying infrastructure resources via our GitLab pipelines
Develop Helm charts for deploying services and jobs in our Kubernetes cluster
Define metrics, network policies, and routing rules for our Istio service mesh
Monitor and maintain our GCP BigQuery and Spanner databases
Pipe metrics to our Google-managed Prometheus instance and build out Grafana dashboards and alerts to increase visibility on our systems
Experiment with GCP offerings, 3rd party vendors, and open-source tools to further automate and secure day-to-day operations
Leverage latest and greatest LLM models in developing infrastructure and tooling
Pair with engineering leads to instrument and monitor critical functionality
Add automation to both existing and new systems to reduce our reliance on manual processes
Participate in architecture design and capacity planning discussions to ensure that our systems are scalable, maintainable, reliable, and secure
Build, maintain, and improve our CI/CD pipeline