Software Engineer - SRE

General Motors•Austin, TX

1d•Hybrid

About The Position

The rapid adoption of advanced software in vehicles marks a new era for automakers and consumers, bringing both advantages and challenges. As part of Site Reliability Engineering (SRE) database group at General Motors, you'll join a dedicated team focused on enhancing the reliability, efficiency, and scalability of our distributed database systems. We leverage engineering principles to manage operations effectively and build solutions that enable us to grow without sacrificing performance or quality. Our SREs work closely with software development teams, acting as specialists in reliability and production engineering, with a focus on automation, observability, and shared responsibility. We are looking for individuals who are passionate about maintaining the health of our infrastructure while optimizing for reliability and cost-efficiency. This role involves a blend of database engineering and systems engineering skills to keep our services resilient, robust, and scalable. The database team within the SRE organization is chartered to provide best-in-class Database Management System (DBMS) project solutions to our application partners worldwide. This role involves modernizing our infrastructure and processes to provide database as a service capability into a highly standardized, reliable, and automated environment. The team is responsible for participating in all phases of the application development life cycle while designing, developing, and deploying databases on behalf of the application in a way that ensures GM’s data is secure, highly available, current, flexible, and monitored. This individual will be working on transforming GM applications and database services into modernized cloud offerings.

Requirements

Bachelor’s degree in computer science or a related field, or equivalent work experience.
Proficiency in PostgreSQL and at least one other (Oracle, SQL Server) database technologies.
Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems.
Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures.
Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures.
Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources.
Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures.
Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders.
Commitment to collaborative problem-solving and shared ownership of services.
Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems.

Nice To Haves

Experience with GIT/source code management, CI/CD development, open-source development.
Hands-on experience in Infrastructure as Code tools like Terraform, Terragrunt, Azure Resource Manager (ARM) templates, YAML pipelines, or Bicep.
Experience in FiveTran or Goldengate configuration and operation.
Experience in Cosmos or other NoSQL technologies.
Experience with cloud platforms (AWS, GCP, Azure).
Experience of observability using OpenTelemetry, Prometheus or services such as DataDog.
Familiarity with container orchestration systems like Kubernetes.
A track record of managing or developing distributed systems.

Responsibilities

Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention.
Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents.
Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution.
Work alongside developers to ensure the quality, scalability, and reliability of our database services.
Practice shared ownership of services in production, fostering a "You build it, you run it" culture.
Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively.
Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence.
Champion a culture of continuous improvement.
Evaluate system performance and advocate for optimizations that reduce infrastructure costs while maintaining service reliability.