Senior Site Reliability Engineer

Attain Finance•Greenville, SC

4d•$100,000 - $120,000•Remote

About The Position

Are you ready to make a difference in the world of consumer finance? At Attain Finance, we bring over 50 years of expertise in providing credit solutions across the U.S. and Canada. Our deep roots in the financial industry have empowered us to develop convenient, easily accessible financial services that meet our customers' growing needs. Join a leading consumer credit lender that thrives on innovation and collaboration, where your contributions are truly valued. Our portfolio includes distinguished brands like Cash Money®, LendDirect®, Heights Finance, Southern Finance, Covington Credit, Quick Credit, and First Heritage Credit. Each brand is constantly evolving to better serve our customers. Be part of a dynamic team that is shaping the future of consumer finance. Apply today and take the next step in your career with Attain Finance! We're looking for a Senior Site Reliability Engineer to help drive the reliability and operational excellence of how we build, ship, and run software. You'll work hands-on across AWS, Kubernetes (EKS), ArgoCD, Helm, Terraform, GitHub Actions, Azure DevOps, Grafana, and Python, building and operating the delivery systems that move our applications safely and reliably into production. This is a deeply hands-on, in-the-trenches role. You'll get into production and delivery problems for the systems you own, find root cause, and fix them — then make them more reliable, observable, and automated. You'll also help shape standards, contribute to incident response, and unblock work within your domain. The expectation is simple: you do the work, and you raise the bar for how the work gets done on the team.

Requirements

Kubernetes, ArgoCD, Helm, Terraform, Python. Deep hands-on production experience.
Hands-on AWS. Operate and debug EKS, ECS, EC2, ECR, IAM/IRSA, VPC networking, ALB/NLB, CloudWatch, Secrets Manager, and KMS.
GitHub Actions and/or Azure DevOps. Build and operate CI/CD at scale.
Grafana and the observability stack. Hands-on with Grafana dashboards and alerting, and the metrics, logs, and traces stack (Prometheus/Mimir, Loki, Tempo, OpenTelemetry).
Strong scripting. Python and Bash, with the ability to grow into systems-level coding.
Production troubleshooting. Comfortable getting into a system under load, finding root cause, and fixing it.
Production ownership. Uptime and reliability accountability.
Incident response. You respond and help drive postmortems that yield real improvements.
Standards contribution. You contribute to engineering standards and best practices.
Compliance awareness. Experience in regulated or high-rigor environments or implementing audit and access controls in pipelines.
Mentorship.Through code review, examples, and pairing.
5+ years in site reliability, platform, DevOps, or software engineering, with production ownership of systems or pipelines.

Nice To Haves

Advanced GitOps. ArgoCD (or Flux), reusable Helm patterns, Argo Rollouts.
CI consolidation or migration. Moving between CI systems, such as Azure DevOps to GitHub Actions.
Self-hosted observability at scale. Running Grafana, Mimir, Loki, and Tempo in production.
Supply chain security. SBOMs, artifact signing (Sigstore/cosign), SLSA provenance.
Platform migrations. Contributing to modernization with minimal disruption.
.NET / C#. Enough to containerize and reason about application workloads.
Low-level Kubernetes. Cilium/eBPF, Karpenter, or self-hosted networking and autoscaling.
Resilience testing. Chaos/failure injection or disaster recovery drills.
AI-assisted tooling. Responsible use with output validation.
Certification. AWS Solutions Architect, AWS DevOps Engineer, or CKA/CKAD.
Degree in computer science or equivalent practical experience.

Responsibilities

Build and operate the delivery platform. Work across AWS, EKS, ArgoCD, Helm, GitHub Actions, Azure DevOps, Terraform, and Python.
Fix the problems you own. Find root cause across the AWS and Kubernetes stack, fix it, and harden it so it stays fixed.
Respond to incidents. Help stabilize during outages, drive root-cause analysis, and ship corrective actions for your systems.
Standardize how we build and ship. Define reproducible container builds and GitOps paths on ArgoCD and Helm that replace manual deployment.
Help consolidate the CI estate. Standardize pipelines across GitHub Actions and Azure DevOps for your services — remove brittle steps and silent failures and improve visibility.
Support platform adoption. Build golden-path templates and tooling and help teams move services onto the platform.
Use progressive delivery. Canary and blue green deploys (Argo Rollouts) and automated rollback for the services you operate.
Build observability in. Wire golden-signal metrics, logs, and traces (Prometheus/Mimir, Loki, Tempo, OpenTelemetry) into your services, surfaced in Grafana with SLOs for your domain.
Operate production systems. Troubleshoot failed to deploy, respond to alerts, and improve behavior from real incidents.
Help meet SLOs and carry on call. Track reliability metrics for the services you operate and share the rotation.
Built across environments. Design dev, test, and prod for safe promotion, recovery from failed deployments, and zero-downtime upgrades.
Help set the standard. Build reference implementations for build, deploy, GitOps, promotion gates, and observability.
Uphold compliance with the pipeline. Support deployment traceability, approval trails, and segregation of duties for PCI DSS, SOC 2, SOX, and GLBA.
Cut toil and cost. Automate repetitive ops work and help tune EKS compute, CI runners, and observability cardinality.
Unblock across teams. Get hands-on with Cloud, Security, Application Engineering, Data, and Product to keep delivery moving.
Kill knowledge silos. Write docs, runbooks, and incident learnings, so engineers operate independently.