The Senior Site Reliability Engineer/Developer is responsible for ensuring the reliability, scalability, and performance of software systems. Their job profile includes: System Monitoring and Troubleshooting: Monitoring the performance and availability of software systems, identifying and resolving issues, and implementing proactive measures to prevent future incidents. Automation and Infrastructure: Developing and maintaining automation tools and infrastructure to streamline software deployment, configuration management, and system monitoring. Performance Optimization: Analyzing system performance, identifying bottlenecks, and implementing optimizations to improve the efficiency and scalability of software systems. Incident Response and Root Cause Analysis: Responding to incidents, conducting root cause analysis, and implementing corrective actions to prevent similar incidents in the future. Collaboration with Development Teams: Collaborating with software development teams to ensure that reliability and scalability considerations are incorporated into the software design and implementation. Continuous Improvement: Identifying opportunities for process improvement, implementing best practices, and driving initiatives to enhance the reliability and performance of software systems. What You’ll Do: Implement, and evolve secure, highly available, and globally distributed systems powering GM’s vehicle security platforms. Own reliability roadmaps, establishing frameworks and strategies for system hardening, high availability, disaster recovery, and operational scalability. Develop automation-first solutions to eliminate operational toil, with advanced use of languages such as Python, Go, and Java. Lead incident response, driving systematic elimination of failure modes through blameless postmortems PRRs and cross-team preventative initiatives. Drive observability strategies with best-in-class practices for metrics, logging, and distributed tracing, using Prometheus, Datadog, or similar stacks. Partner with engineering, platform, and security teams to design for reliability from inception, influencing architecture reviews and CI/CD best practices. Lead optimization, capacity planning, and performance-tuning strategies for large-scale, security-critical platforms. Introduce modern SRE practices such as chaos engineering, resilience testing, and progressive delivery to validate support teams and evolve system safety along with SLO, SLI, and SLAs. Mentor engineers across disciplines on SRE, platform resilience, secure operational practices, and architectural trade-offs. Evaluate and adopt technologies (open-source, enterprise, homegrown) for security and reliability at scale. Influence product strategy in partnership with engineering leads, ensuring operational reliability is prioritized alongside customer and business outcomes.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees