SRE Architect

QodeArlington, TX
Hybrid

About The Position

We are seeking an experienced Site Reliability Engineer (SRE) Architect to design, build, and scale highly reliable, resilient, and observable systems. This role is ideal for a hands-on architect who can define SRE strategy, influence engineering practices, and partner closely with development, platform, and security teams.The position requires onsite or hybrid presence in Austin, TX, with collaboration across distributed teams.

Requirements

  • 10+ years of experience in SRE, DevOps, Platform Engineering, or Systems Architecture.
  • Strong experience designing and operating large-scale distributed systems.
  • Deep hands-on expertise with cloud platforms (AWS/GCP/Azure).
  • Advanced experience with Kubernetes and containerized workloads.
  • Strong knowledge of Linux internals, networking, storage, and system performance.
  • Proven experience implementing IaC and configuration management.
  • Proficiency in one or more programming/scripting languages (Python, Go, Bash, etc.).
  • Strong understanding of observability, monitoring, and alerting strategies.
  • Excellent communication and stakeholder management skills.

Nice To Haves

  • Experience in multi-cloud or regulated environments.
  • Background supporting high-throughput, high-availability, or data-intensive systems.
  • Experience with Kafka, Spark, or large-scale data platforms.
  • Exposure to fintech, healthcare, enterprise SaaS, or hyperscale platforms.
  • Prior experience as Principal Engineer, Architect, or Lead SRE.

Responsibilities

  • Define and own the SRE architecture strategy, including reliability, availability, scalability, and performance standards.
  • Design resilient, fault-tolerant systems for cloud-native and hybrid environments.
  • Establish and govern SLIs, SLOs, and error budgets across platforms and services.
  • Lead capacity planning, resilience testing, and chaos engineering initiatives.
  • Architect and operate platforms on AWS/GCP/Azure (multi-cloud or hybrid setups).
  • Design and manage Kubernetes-based platforms (EKS/GKE/AKS).
  • Drive Infrastructure as Code (IaC) practices using Terraform, Ansible, or similar tools.
  • Standardize environments, deployment patterns, and runtime configurations.
  • Build and maintain observability frameworks using tools such as Prometheus, Grafana, Datadog, ELK, Splunk, or equivalent.
  • Lead incident management, root cause analysis (RCA), and post-incident reviews.
  • Reduce MTTR through automation, tooling, and process improvements.
  • Participate in and improve on-call models, escalation policies, and runbooks.
  • Partner with engineering teams to embed CI/CD best practices.
  • Drive automation across provisioning, deployments, testing, and operations.
  • Improve system reliability by eliminating manual operational toil.
  • Architect secure platforms aligned with enterprise security standards.
  • Implement best practices for secrets management, access control, compliance, and audits.
  • Collaborate with Security and Compliance teams on governance models.
  • Act as a technical mentor and thought leader within SRE and platform teams.
  • Influence engineering culture toward reliability-focused design.
  • Partner with product, application, and infrastructure teams to deliver business outcomes.

Benefits

  • Architect systems at enterprise scale
  • Influence platform and reliability strategy across teams
  • Work with modern cloud-native technologies
  • High-impact role with strong visibility and ownership

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

1-10 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service