Staff Engineer High Performance Computing

Pfizer•New York City, NY

20h•Hybrid

About The Position

Pfizer is committed to the application of computational science in the areas of drug discovery and development and has recently initiated a large-scale migration of computational infrastructure to cloud. This role provides technical vision and will drive the execution of high-performance computing (HPC) solutions that support computational workloads across the organization. We are seeking an experienced individual to lead the technical architecture of the cloud HPC platform. Key responsibilities include establishing go-forward cloud HPC platform computing technologies, implementing robust engineering practices, championing infrastructure as code (IaC), and configuring core services that support HPC at scale in the cloud environment. You will work with HPC engineers and scientific computing specialists to develop robust, scalable, high-performance cloud native infrastructure that underpins modernization of the scientific computing platform.

Requirements

B.S. in computer science, life science, data science or similar fields with 6+ years of experience in cloud infrastructure engineering.
A proven track record of developing and supporting robust HPC frameworks in a cloud environment.
Expert level experience with at least one of AWS and GCP, including knowledge of core compute and storage services relevant to HPC.
Deep understanding of modern CI/CD practices, observability and monitoring of cloud-based HPC infrastructure.
Strong knowledge of distributed systems and production system reliability.
Familiarity with monitoring and observability frameworks (CloudWatch, Prometheus, Grafana, etc.)
Solid understanding of cloud networking, identity, security controls, and core services.

Nice To Haves

M.S. in computer science, life science, data science or similar fields.
10-15 years experience in HPC/Cloud engineering
Expertise with distributed computing environments, especially EKS/GKE/Kubernetes
Deep experience with HPC environments, job schedulers, and NVIDIA GPU compute.
Prior experience with HPC deployment utilities including AWS ParallelCluster and Parallel Computing Services, and Google Cloud Cluster Toolkit
Familiarity with other aspects of managing HPC services in a cloud environment: cloud financial models, cost optimization, user support services, application delivery, Linux administration, job scheduling, resource optimization.
Candidate demonstrates a breadth of diverse leadership experiences and capabilities including: the ability to influence and collaborate with peers, develop and coach others, oversee and guide the work of other colleagues to achieve meaningful outcomes and create business impact.

Responsibilities

Lead development and operationalize cloud-based HPC infrastructure required for research, modeling, and large-scale data processing across multiple cloud environments.
Serve as a primary technical expert; evaluate, advocate for, and drive consensus among senior managers and engineers for the go-forward technology platforms and toolkits used for HPC service delivery.
Collaborate with stakeholders, users, and leaders to develop a long-term technical roadmap for cloud-based HPC services.
Lead deep-dive discussions with technical partners at major cloud providers, defining HPC-related requirements and deliverables for Statements of Work.
Drive a culture of shared ownership, transparency, and engineering excellence through mentoring, coaching, and example setting.
Perform troubleshooting, system analysis, and benchmarking to manage escalated, difficult to resolve issues and maintain a high-performance environment.
Design and own robust and dependable high-throughput, parallel, low-latency infrastructure for HPC and ML/AI workloads in multiple cloud environments (AWS/GCP).
Establish technical standards, best practices, architectural frameworks, and implementation guidelines for reproducible HPC platform and application deployments.
Recommend cutting-edge HPC technologies including specialized accelerators, novel storage solutions, managed services, and open-source toolkits that will be integrated into the platform.
Own OS image development, job scheduler configuration, high performance storage systems.
Ensure high performance, reliability, scalability, cost efficiency, and security.
Drive adoption of infrastructure automation using IaC tools like Terraform and CloudFormation.
Establish, promote, and enforce internal standards (naming, tagging, documentation, version control, and change procedures) to ensure repeatable environment provisioning and scaling.
Establish infrastructure lifecycle management procedures, from provisioning to operations, support, updating, and teardown of production computing platforms.
Determine KPIs to guide monitoring, logging, and alerting strategies for the infrastructure.
Collaborate with stakeholders, users, and senior managers to develop meaningful user-facing dashboards, drive resource management, cost efficiency, and workload optimization.
Design workflows, alerting systems and utilities to improve observability, user, or administrator experiences.

Benefits

401(k) plan with Pfizer Matching Contributions
additional Pfizer Retirement Savings Contribution
paid vacation
holiday and personal days
paid caregiver/parental and medical leave
health benefits to include medical, prescription drug, dental and vision coverage.
Relocation assistance may be available based on business needs and/or eligibility.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume