ROLE SUMMARY Pfizer's committed to the application of computational science in the areas of drug discovery and development. As part of this mission, we have recently embarked on a large-scale migration of our computational infrastructure to cloud. This role leverages extensive experience in cloud engineering and DevOps and requires a hands-on approach to designing and delivering robust High Performance Computing (HPC) solutions supporting computational workloads across the organization. We are seeking an experienced individual to drive architecture, infrastructure automation, migration and operational excellence. You will collaborate with HPC engineers and scientific computing specialists to develop scalable cloud native infrastructure that underpins modernization of the scientific computing platform. ROLE RESPONSIBILITIES Platform Architecture and Engineering In this role you will design, implement, operate, and own robust and dependable infrastructure for HPC and ML/AI workloads in a cloud environment (AWS/GCP). Lead containerization, deployment, and operation of user- and admin-facing HPC platforms (Slurm, Open On Demand, Prometheus/Grafana, batch and distributed computing platforms) across cloud environments. Translate stakeholder input into robust, high-performance, scalable, cost effective computing platforms. Partner with HPC specialists (engineers, administrators, and users) to capture institutional knowledge and manual processes in IaC workflows, transforming ad-hoc deployment practices into reproducible, version-controlled, automated procedures. Automation and DevOps Develop and maintain infrastructure automation using IaC tools like Terraform and CloudFormation to ensure repeatable environment provisioning and scaling. Create reusable Terraform modules. Develop and enforce standards. Be a driver for implementing and maintaining all cloud infrastructure using IaC tools. Operationalize containerized solutions using Docker and Kubernetes. Own the full lifecycle of infrastructure management, from provisioning to operations, support, updating, and teardown of production computing platforms. Perform troubleshooting, system analysis, and benchmarking to resolve issues and maintain a high-performance environment. Monitoring and Reliability Develop and maintain monitoring, logging, and alerting for the infrastructure (e.g., CloudWatch, Prometheus/Grafana). Design new dashboards, workflows, and utilities to improve observability, cost monitoring, workload efficiency, user, or administration experience. Document architecture, deployment processes, and operational procedures. Partner closely with team members to support delivery of scientific computing services including user support, Linux administration, operations, job scheduling, application management, and resource optimization.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Manager
Number of Employees
5,001-10,000 employees