Site Reliability Architect

Jobgether
1d$170,000 - $185,000Remote

About The Position

In this role, you will define and drive the reliability and resilience strategy across a complex enterprise SaaS ecosystem. You will shape modern cloud and hybrid architectures designed for scale, security, and fault tolerance, ensuring exceptional availability and performance. Working in a high-impact environment, you will lead by influence, mentoring SRE and engineering teams while embedding best-in-class reliability practices. You will champion proactive operations, automation-first thinking, and continuous improvement. This position offers the opportunity to work remotely within the US, collaborating with distributed teams across time zones. If you thrive on solving complex infrastructure challenges and building systems that scale reliably, this role offers both technical depth and strategic impact.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field, or equivalent practical experience.
  • 10+ years of experience in SRE or DevOps roles, including at least 4 years in enterprise SaaS environments.
  • 4+ years of hands-on software development experience supporting cloud-hosted SaaS platforms.
  • Proven experience operating distributed, multi-cloud or multi-region systems at scale.
  • Deep expertise in modern cloud networking, including DNS, TCP/IP, load balancing, and Zero Trust security models.
  • Strong programming skills in Go, Python, Java, C#, or similar languages for building internal tooling and automation workflows.
  • Expert-level knowledge of Kubernetes architecture, including multi-cluster management and stateful workloads.
  • Experience optimizing cloud infrastructure costs while maintaining high performance and reliability.
  • Background in DevSecOps practices and regulatory compliance frameworks such as GDPR, HIPAA, and HITRUST.
  • Strong leadership, mentoring, and communication skills, with a proactive and solution-oriented mindset.
  • Openness to responsibly adopting AI tools to enhance productivity and innovation.

Responsibilities

  • Architect and implement resiliency-by-design systems focused on self-healing, fault tolerance, and proactive operational readiness.
  • Design and evolve secure, scalable cloud and hybrid environments leveraging advanced networking and compute architectures across AWS and GCP.
  • Lead the shift from reactive firefighting to proactive operations using feature flagging, production readiness reviews, architectural decision records, and chaos engineering.
  • Mentor SREs and software engineers in incident management, observability, monitoring, and advanced troubleshooting practices.
  • Define and manage SLIs, SLOs, and error budgets to balance innovation velocity with platform stability and service ownership.
  • Drive infrastructure automation using Terraform, CDK, and infrastructure-as-code tools to deliver consistent, secure, and audit-ready environments.
  • Contribute to strategic planning, cross-team collaboration, and continuous improvement initiatives across the platform.

Benefits

  • Competitive base salary range of $170,000–$185,000 per year, plus variable compensation.
  • Fully remote role within the US, aligned with EST or CST time zones.
  • Comprehensive healthcare plans, including medical, dental, and vision coverage.
  • Generous paid time off and company-paid holidays.
  • 401K retirement plan with company matching.
  • Additional company-sponsored wellness and benefit programs.
  • Supportive, inclusive, and growth-oriented work environment.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service