Senior Backend Software Engineer/SRE - GM Energy

General Motors•Warren, MI

67d•Hybrid

About The Position

The Energy Cloud Platform is a highly scalable, secure cloud platform in production today that connects vehicles, utilities, markets, and IoT systems to enable smart charging, bidirectional energy (V2H/V2G), and data-driven energy services. As part of the Cloud Platform & Smart Charging team, you will help ensure our services remain reliable, observable, and ready to support mission-critical energy programs at scale. We are looking for a Senior Backend Software Engineer – Site Reliability (SRE) to lead reliability, performance, and operational excellence for the Energy Cloud platform, while also contributing directly to backend services and platform capabilities. In this role, you will combine strong backend software engineering skills with SRE practices to: Design and evolve production-ready, observable services Build and improve CI/CD, infrastructure, and automation Lead incident response, post-incident reviews, and reliability improvements Partner closely with product engineering, data, and cloud platform teams. This is a senior individual contributor role: you will drive cross-team initiatives, set reliability patterns others adopt, and mentor engineers across the organization. You will also spend meaningful time on backend feature and platform development, and the role is well-suited for someone who wants to deepen both SRE and software engineering skills.

Requirements

Bachelor’s degree in Computer Science, Software Engineering, Electrical/Computer Engineering or related field, or equivalent practical experience.
8+ years of experience in software engineering, DevOps, or SRE roles, including: Designing, building, and operating backend or platform services in production.
Hands-on experience with at least one major cloud provider (Azure, GCP, or AWS).
Strong programming skills in one or more languages (e.g., Python, C#, Java, Go) and experience writing production-grade services and automation.
Demonstrated experience with observability and monitoring (e.g., Datadog, Prometheus, Grafana, OpenTelemetry) and implementing meaningful metrics and alerts.
Experience with CI/CD pipelines (e.g., GitHub Actions, Azure DevOps, Jenkins) including automated testing, deployment strategies, and rollback patterns.
Proven track record owning or co-owning on-call, incident response, and post-incident improvement work for production systems.
Ability to lead cross-functional technical efforts, influence without direct authority, and communicate clearly with engineering, product, and operations stakeholders.

Nice To Haves

Experience in energy, utilities, EV charging, or large-scale IoT platforms.
Experience with data platforms (e.g., Snowflake, Databricks, or similar) and designing reliable data ingestion and processing pipelines.
Deep familiarity with SRE principles: error budgets, capacity planning, resilience testing, chaos engineering, and production game days.
Experience designing disaster recovery strategies and running DR drills in collaboration with product and infrastructure teams.
Experience implementing security and compliance practices (e.g., secrets management, vulnerability remediation, secure pipelines) in partnership with security and cloud platform teams.
Demonstrated success mentoring other engineers and raising the bar for reliability and operational excellence across multiple teams.

Responsibilities

Own and improve reliability for key Energy Cloud services that power electric grid programs, V2H/V2G pilots, and enrollment/operations experiences.
Define and maintain SLOs/SLIs (latency, error rate, availability) and partner with engineering and product to ensure they reflect real customer and business needs.
Lead rollout and continuous improvement of production observability (e.g., Datadog or similar): metrics, logs, traces, dashboards, and alerting across services.
Implement and enforce Production Readiness Reviews (PRR) and reliability scorecards so that every new service, integration, and major feature meets our reliability bar before going live.
Drive incident management: participate in and often lead on-call/incident response, perform root-cause analysis, and ensure post-incident actions are prioritized and completed.
Design and implement robust, well-tested backend services and automation to improve system reliability, performance, and data integrity (e.g., telemetry ingestion pipelines, charging session data flows, enrollment workflows).
Build and evolve CI/CD pipelines (e.g., GitHub Actions) to support blue/green or similar deployment strategies, automated rollbacks, and high-confidence releases.
Design and validate disaster recovery and continuity patterns (backups, cross-region failover, runbooks, simulation drills) for critical platform components.
Partner with data engineering and platform teams to ensure data ingestion, storage, and processing patterns support reliability, scalability, and monitoring requirements.
Lead cross-team reliability initiatives that improve how multiple Energy Cloud and related services are built, deployed, and operated.
Define and socialize standard patterns for observability, CI/CD, performance testing, and data quality that other teams can adopt.
Create clear technical documentation: runbooks, design docs, PRR checklists, SLO definitions, and reliability playbooks that make complex systems operable by others.
Mentor engineers (SWE, DevOps, SRE, data) on reliability best practices, debugging techniques, and operational excellence.