Staff Site Reliability Engineer/Developer – Vehicle Security Platforms

GM•Warren, MI

17d•Remote

About The Position

Join General Motors’ Vehicle Security Platforms (VSP) teams, where we build resilient, secure, and scalable platforms supporting mission-critical vehicle security communications. We seek an experienced Staff Site Reliability Engineer (SRE) with extensive experience in scaling distributed systems and driving end-to-end reliability strategies. In this role, you will shape the reliability of GM’s next-generation vehicle security platforms, influence cross-organizational architecture decisions, and embed reliability as a first-class product concern. Your leadership will contribute directly to protecting millions of vehicles and customers globally. What You’ll Do: Implement, and evolve secure, highly available, and globally distributed systems powering GM’s vehicle security platforms. Own reliability roadmaps, establishing frameworks and strategies for system hardening, high availability, disaster recovery, and operational scalability. Develop automation-first solutions to eliminate operational toil, with advanced use of languages such as Python, Go, and Java. Lead incident response, driving systematic elimination of failure modes through blameless postmortems PRRs and cross-team preventative initiatives. Drive observability strategies with best-in-class practices for metrics, logging, and distributed tracing, using Prometheus, Datadog, or similar stacks. Partner with engineering, platform, and security teams to design for reliability from inception, influencing architecture reviews and CI/CD best practices. Lead optimization, capacity planning, and performance-tuning strategies for large-scale, security-critical platforms. Introduce modern SRE practices such as chaos engineering, resilience testing, and progressive delivery to validate support teams and evolve system safety along with SLO, SLI, and SLAs. Mentor engineers across disciplines on SRE, platform resilience, secure operational practices, and architectural trade-offs. Evaluate and adopt technologies (open-source, enterprise, homegrown) for security and reliability at scale. Influence product strategy in partnership with engineering leads, ensuring operational reliability is prioritized alongside customer and business outcomes.

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure/platform roles supporting secure, scalable systems.
Strong Proven expertise in designing and scaling cloud infrastructure (Azure) and container orchestration systems (Kubernetes, Docker).
Demonstrated mastery of infrastructure-as-code frameworks (Terraform, Helm, CloudFormation, etc).
Proficiency in Python and one JVM language (Java or Kotlin), and working knowledge of Go.
Deep architectural understanding of distributed systems, networking, system design, and large-scale security practices.
Track record of architecting and running zero-downtime systems in production.
Experience with modern monitoring and reliability tooling and frameworks (Prometheus, Datadog, OpenTelemetry, etc.).
Experience leading incident response, uptime SLO/SLA management, and operational excellence initiatives across multiple teams.
Capable of influencing architecture and product strategy while maintaining a hands-on approach to systems reliability.
Exceptional communication skills, able to present complex trade-offs and foster alignment across executive, product, and engineering stakeholders.

Nice To Haves

BS/MS/PhD in Computer Science, Engineering, or equivalent industry experience.
Deep understanding of encryption technologies, secure data handling practices, and identity management.
Experience designing and operating IoT or automotive-focused architectures with rigorous availability and safety requirements.
Direct experience in chaos engineering, game-day testing, disaster recovery orchestration, and production load testing.
Ability to grow and mentor engineers into leaders in their domain, building SRE teams that can operate independently at scale.
Demonstrated success in defining and executing reliability strategies with measurable business impact.
Strong product mindset with the ability to balance engineering excellence with speed and business priorities.

Responsibilities

Implement, and evolve secure, highly available, and globally distributed systems powering GM’s vehicle security platforms.
Own reliability roadmaps, establishing frameworks and strategies for system hardening, high availability, disaster recovery, and operational scalability.
Develop automation-first solutions to eliminate operational toil, with advanced use of languages such as Python, Go, and Java.
Lead incident response, driving systematic elimination of failure modes through blameless postmortems PRRs and cross-team preventative initiatives.
Drive observability strategies with best-in-class practices for metrics, logging, and distributed tracing, using Prometheus, Datadog, or similar stacks.
Partner with engineering, platform, and security teams to design for reliability from inception, influencing architecture reviews and CI/CD best practices.
Lead optimization, capacity planning, and performance-tuning strategies for large-scale, security-critical platforms.
Introduce modern SRE practices such as chaos engineering, resilience testing, and progressive delivery to validate support teams and evolve system safety along with SLO, SLI, and SLAs.
Mentor engineers across disciplines on SRE, platform resilience, secure operational practices, and architectural trade-offs.
Evaluate and adopt technologies (open-source, enterprise, homegrown) for security and reliability at scale.
Influence product strategy in partnership with engineering leads, ensuring operational reliability is prioritized alongside customer and business outcomes.