HPC on AWS Lead /Specialist/ SME - REMOTE

Simple SolutionsJacksonville, FL
6dRemote

About The Position

The AWS HPC LEAD & SME is responsible for designing, implementing, and optimizing high-performance computing solutions on the AWS Cloud platform. This role combines deep technical expertise in distributed computing, data-intensive workflows, and AWS HPC services with the ability to lead architecture design sessions, define best practices, and ensure scalability, performance, and cost efficiency across enterprise or research workloads.

Requirements

  • 8-10+ years of experience in high-performance computing, distributed systems, or cloud architecture.
  • Need is for resource with US Citizenship and Active Secret Service Clearence( SCI )
  • Proven expertise in AWS HPC services (EC2 HPC, ParallelCluster, Batch, FSx for Lustre, EFA).
  • Strong knowledge of Linux systems administration, networking (Infiniband, EFA, MPI), and job schedulers (Slurm, Torque, PBS Pro).
  • Hands-on experience with automation and IaC (Terraform, Ansible, CloudFormation).
  • Scripting and development proficiency (Python, Bash, or similar).
  • Experience with monitoring tools (CloudWatch, Grafana, Prometheus) and cost-optimization strategies.
  • AWS Certified Solutions Architect – Professional or AWS Certified Advanced Networking preferred.
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical field.

Nice To Haves

  • Experience with GPU workloads, containerized HPC (ECS/EKS with ParallelCluster), or hybrid/on-prem to cloud HPC migrations.
  • Strong communication and presentation skills for executive and technical audiences.
  • Demonstrated thought leadership in HPC strategy, performance benchmarking, and AWS innovation.

Responsibilities

  • Lead the Design & Build: Develop scalable, high-performance architectures leveraging AWS HPC services such as AWS ParallelCluster, FSx for Lustre, EFA (Elastic Fabric Adapter), AWS Batch, and EC2 HPC instances.
  • Solution Implementation: Deploy, automate, and optimize HPC clusters and data pipelines for compute- and memory-intensive workloads, including modeling, simulation, genomics, CFD, AI/ML training, and financial risk analysis.
  • Performance Optimization: Benchmark, tune, and monitor system performance for compute, storage, and networking components to achieve optimal throughput and cost efficiency.
  • Infrastructure as Code (IaC): Implement reproducible environments using Terraform, AWS CDK, or CloudFormation to streamline provisioning, CI/CD, and configuration management.
  • Data and Storage Management: Design high-throughput parallel storage solutions using S3, FSx for Lustre, EBS, and EFS; integrate with hybrid and on-prem HPC environments.
  • Security and Compliance: Apply AWS Well-Architected Framework and HPC security best practices to ensure compliance with enterprise, academic, or government standards.
  • Collaboration and Leadership: Partner with application scientists, DevOps teams, and business stakeholders to translate workload requirements into optimized HPC architectures. Provide mentoring and technical leadership across multidisciplinary teams.
  • Documentation and Knowledge Sharing: Develop architecture diagrams, reference implementations, and technical playbooks to support ongoing HPC adoption and operations.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service