SRE Architect

QodeTexas, TX
Hybrid

About The Position

Job Description: SRE Architect 📍 Location: Austin, TX Hybrid) 🕒 Employment Type: Full-Time 🎯 Experience Level: Architect Role Overview We are seeking an experienced Site Reliability Engineer (SRE) Architect to design, build, and scale highly reliable, resilient, and observable systems. This role is ideal for a hands-on architect who can define SRE strategy, influence engineering practices, and partner closely with development, platform, and security teams. The position requires onsite or hybrid presence in Austin, TX , with collaboration across distributed teams.

Requirements

  • 10+ years of experience in SRE, DevOps, Platform Engineering, or Systems Architecture .
  • Strong experience designing and operating large-scale distributed systems .
  • Deep hands-on expertise with cloud platforms (AWS/GCP/Azure) .
  • Advanced experience with Kubernetes and containerized workloads .
  • Strong knowledge of Linux internals, networking, storage, and system performance .
  • Proven experience implementing IaC and configuration management .
  • Proficiency in one or more programming/scripting languages (Python, Go, Bash, etc.).
  • Strong understanding of observability, monitoring, and alerting strategies .
  • Excellent communication and stakeholder management skills.

Nice To Haves

  • Experience in multi-cloud or regulated environments .
  • Background supporting high-throughput, high-availability, or data-intensive systems .
  • Experience with Kafka, Spark, or large-scale data platforms .
  • Exposure to fintech, healthcare, enterprise SaaS, or hyperscale platforms .
  • Prior experience as Principal Engineer, Architect, or Lead SRE .

Responsibilities

  • Define and own the SRE architecture strategy , including reliability, availability, scalability, and performance standards.
  • Design resilient, fault-tolerant systems for cloud-native and hybrid environments .
  • Establish and govern SLIs, SLOs, and error budgets across platforms and services.
  • Lead capacity planning, resilience testing, and chaos engineering initiatives .
  • Architect and operate platforms on AWS/GCP/Azure (multi-cloud or hybrid setups).
  • Design and manage Kubernetes-based platforms (EKS/GKE/AKS).
  • Drive Infrastructure as Code (IaC) practices using Terraform, Ansible , or similar tools.
  • Standardize environments, deployment patterns, and runtime configurations.
  • Build and maintain observability frameworks using tools such as Prometheus, Grafana, Datadog, ELK, Splunk, or equivalent.
  • Lead incident management , root cause analysis (RCA), and post-incident reviews.
  • Reduce MTTR through automation, tooling, and process improvements.
  • Participate in and improve on-call models , escalation policies, and runbooks.
  • Partner with engineering teams to embed CI/CD best practices .
  • Drive automation across provisioning, deployments, testing, and operations.
  • Improve system reliability by eliminating manual operational toil.
  • Architect secure platforms aligned with enterprise security standards.
  • Implement best practices for secrets management, access control, compliance, and audits .
  • Collaborate with Security and Compliance teams on governance models.
  • Act as a technical mentor and thought leader within SRE and platform teams.
  • Influence engineering culture toward reliability-focused design.
  • Partner with product, application, and infrastructure teams to deliver business outcomes.

Benefits

  • Architect systems at enterprise scale
  • Influence platform and reliability strategy across teams
  • Work with modern cloud-native technologies
  • High-impact role with strong visibility and ownership
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service