As a Senior Platform Reliability Engineer, you will be responsible for monitoring, analyzing and optimizing software architecture and maintaining software environment to best support testing and deployment in continuous integration/continuous delivery environment. This role provides a reliable and scalable platform experience to the Global AI Platform Users. You will be responsible for developing self-service capabilities, AIOps/MLOps/GitOps/CI/CD pipelines, and operational automations for provisioning, upgrades, and backups. You will manage clusters, networks, storage, and policies via Terraform/Ansible, preventing configuration drift. Additionally, you will enforce identity/RBAC, secrets management, supply chain security, and regulatory controls, collaborating with risk and audit teams. Optimization of resource usage, capacity planning, and spending control (rightsizing, autoscaling, reservations/spot) are key aspects of this role. You will also be involved in safe rollouts, progressive delivery, and implementing policy-as-code guardrails. This position resolves persistent platform issues when surfaced by technical support teams, provides performance enhancements through automation, and pushes for enhanced reliability of the platform to support product development. You will deliver resilient and scalable applications, focusing on continuous delivery and operational insight. Collaboration with platform and software engineers, platform reliability engineers, Product Owners, and engineering leadership is expected to uncover pain points and opportunities to accelerate the delivery of new value through software. You will investigate new platform solutions to enhance service delivery experience and address incidents and problems, with rotational accountability for on-call support.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Associate degree