Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems

Rivian•Normal, IL

60d•Onsite

About The Position

This Site Reliability Engineer (SRE) role owns reliability outcomes for factory digital systems spanning compute, network, and application layers. The work is split across Platform Engineering, Observability, and Tiger Team incident response. This position will be located in Normal, IL and report to our Sr. Manager, Software Infrastructure/DevOps.

Requirements

Production experience in SRE/Platform/DevOps or Operations, owning availability, performance, and cost for critical services.
Strength in several of: Kubernetes/EKS and container networking; AWS primitives for resilient platforms; vSphere/ESXi and virtualization; Linux (and working Windows Server) administration; service discovery, load balancing, and DNS.
Observability across metrics/logs/traces, SLO/errorâbudget practice, and alert hygiene with tools like Prometheus/Grafana, Loki/Tempo, Datadog, Splunk.
Production change safety: GitOps, progressive delivery, guardrails in CI/CD (GitLab preferred), automated rollbacks, and policyâasâcode.
Infrastructure automation: Terraform/Terragrunt, Ansible, scripting (Python/Bash), secrets management, and leastâprivilege patterns.
Incident leadership/participation in 24x7 environments; clear comms under pressure and a habit of converting learnings into durable fixes.
Ability to partner across Factory IT, Manufacturing Engineering, Security, Networking, and application teams; communicate tradeoffs simply and drive decisions.

Nice To Haves

Industrial/OTâadjacent experience (lineside HMIs, MES/SCADA integrations, PLC interfaces, ruggedized compute) and shopâfloor networking constraints.
Experience building or integrating exporters (e.g., vSphere) or consolidating factory telemetry into plantâwide health views.
DR playbooks, capacity modeling, and cost/performance optimization for hybrid environments.

Responsibilities

Platform Engineering Design and evolve reliable, scalable, and secure platform foundations across hybrid/onâprem factory environments (e.g., Kubernetes/EKS, vSphere/ESXi, Linux/Windows server, industrial PCs), with clear reliability and cost guardrails.
Codify productionâreadiness standards and guardrails for factory systems (health checks, runbooks, SLOs/SLIs, deployment safety, failover patterns) aligned to Platform's production readiness checklist.
Advance InfrastructureâasâCode and configuration automation (e.g., Terraform/Terragrunt, Ansible) for factory workloads, including provisioning, secrets, policies, and change safety.
Partner with Manufacturing Engineering, Factory IT, Security, and Networking to land pragmatic, operable designs; contribute to reference architectures and reusable patterns.
Lead or contribute to reliability initiatives (e.g., selfâhealing automation, safe rollouts/canaries, rollback strategies) appropriate to level.
Observability Raise the bar on endâtoâend telemetry for factory systems: highâsignal metrics, logs, traces, and SLOâdriven alerts (e.g., Prometheus/Grafana, Loki/Tempo, Datadog, Splunk).
Establish consistent dashboards and service health views for shop/lineâlevel systems, including exporters for hypervisor/VM health and plant endpoints where feasible (e.g., vSphere exporters).
Improve alert quality and ownership: reduce noise, align escalation policies, and ensure actionable runbooks and health checks for critical services.
Build internal tooling (CLI/SDKs, operators/controllers, remediation bots) that turns telemetry into prevention and rapid response.
Tiger Team / Incident Response Act as technical incident responder for factoryâimpacting events; lead fast triage, stabilize services.
Drive postâincident reviews that eliminate repeat failure modes; improve MTTR and availability through durable engineering fixes and process improvements.
Drill onâcall readiness, escalation policies, and schedules using established incident tooling and practices (e.g., Rootly/alternatives), tuned for 24x7 manufacturing operations.
Mentor peers through reliability deep dives, failover exercises, and simulation runbooks (breadth of mentorship scales with level).