SRE Architect

Qode•Arlington, TX

2d•Hybrid

About The Position

We are seeking an experienced Site Reliability Engineer (SRE) Architect to design, build, and scale highly reliable, resilient, and observable systems. This role is ideal for a hands-on architect who can define SRE strategy, influence engineering practices, and partner closely with development, platform, and security teams.The position requires onsite or hybrid presence in Austin, TX, with collaboration across distributed teams.

Requirements

10+ years of experience in SRE, DevOps, Platform Engineering, or Systems Architecture.
Strong experience designing and operating large-scale distributed systems.
Deep hands-on expertise with cloud platforms (AWS/GCP/Azure).
Advanced experience with Kubernetes and containerized workloads.
Strong knowledge of Linux internals, networking, storage, and system performance.
Proven experience implementing IaC and configuration management.
Proficiency in one or more programming/scripting languages (Python, Go, Bash, etc.).
Strong understanding of observability, monitoring, and alerting strategies.
Excellent communication and stakeholder management skills.

Nice To Haves

Experience in multi-cloud or regulated environments.
Background supporting high-throughput, high-availability, or data-intensive systems.
Experience with Kafka, Spark, or large-scale data platforms.
Exposure to fintech, healthcare, enterprise SaaS, or hyperscale platforms.
Prior experience as Principal Engineer, Architect, or Lead SRE.

Responsibilities

Define and own the SRE architecture strategy, including reliability, availability, scalability, and performance standards.
Design resilient, fault-tolerant systems for cloud-native and hybrid environments.
Establish and govern SLIs, SLOs, and error budgets across platforms and services.
Lead capacity planning, resilience testing, and chaos engineering initiatives.
Architect and operate platforms on AWS/GCP/Azure (multi-cloud or hybrid setups).
Design and manage Kubernetes-based platforms (EKS/GKE/AKS).
Drive Infrastructure as Code (IaC) practices using Terraform, Ansible, or similar tools.
Standardize environments, deployment patterns, and runtime configurations.
Build and maintain observability frameworks using tools such as Prometheus, Grafana, Datadog, ELK, Splunk, or equivalent.
Lead incident management, root cause analysis (RCA), and post-incident reviews.
Reduce MTTR through automation, tooling, and process improvements.
Participate in and improve on-call models, escalation policies, and runbooks.
Partner with engineering teams to embed CI/CD best practices.
Drive automation across provisioning, deployments, testing, and operations.
Improve system reliability by eliminating manual operational toil.
Architect secure platforms aligned with enterprise security standards.
Implement best practices for secrets management, access control, compliance, and audits.
Collaborate with Security and Compliance teams on governance models.
Act as a technical mentor and thought leader within SRE and platform teams.
Influence engineering culture toward reliability-focused design.
Partner with product, application, and infrastructure teams to deliver business outcomes.