HPC Engineer

Anduril IndustriesCosta Mesa, CA
13h

About The Position

Anduril is seeking a High Performance Computing (HPC) System Engineer to directly support our most sensitive programs. You will be a part of the team building and maintaining large scale HPC infrastructure. You will have the opportunity to work with and learn from some of the world’s best engineers and cybersecurity professionals as you help to implement cutting edge systems. You will work directly to support systems deployed across the globe in support of national security missions.

Requirements

  • 7+ years of experience in designing, developing, and implementing large scale compute enterprise systems and solutions
  • Strong Knowledge and experience with High Performance Computing concepts to include cluster architecture file system, and high-speed infiniBand/ethernet interconnections
  • Proven expertise in one or more of the following, Red Hat Enterprise Linux, Ubuntu, HPC, GPU, Azure or AWS cloud services
  • Strong understanding and experience with systems automation tools (Ansible, Salt, Puppet)
  • Experience in HPC technologies such as parallel/distribution file systems (e.g., Lustre, GPFS, Pure, VAST)
  • Working knowledge of HPC batch schedule software (e.g., PBSPro, SLURM)
  • AWS/Azure experience building HPC clusters
  • Ability to lift 50 lbs
  • Eligible to obtain an maintain a US Top Secret Clearance

Responsibilities

  • Work in a fast-paced, customer-focused environment supporting high-profile operational and research requirements.
  • Architect and deploy advanced GPU infrastructure, leading the design, deployment, and lifecycle management of cutting-edge NVIDIA hardware including H100, H200, and B200/B300 systems.
  • Ability to rack, stack, cable, and configure physical servers and multi-node GPU systems from end to end.
  • Configure HPC and AI environments, including job schedulers (e.g., Slurm), multi-user login environments, and cluster management software (e.g., Warewulf, NVIDIA Base Command, RunAI).
  • Implement and fine-tune high-speed interconnects (e.g., NVLink, NVSwitch, InfiniBand/NDR) crucial for large-scale distributed training.
  • Configure and manage large-scale, high-performance storage platforms in the multiple petabytes range, optimized for AI/ML data access patterns.
  • Install, configure, and maintain the application stack on HPC clusters, including traditional simulation software (StarCCM+, Ansys, Matlab) and the core AI/ML software stack (NVIDIA drivers, CUDA, PyTorch, TensorFlow).
  • Implement and manage GPU virtualization and sharing technologies, such as Multi-Instance GPU (MIG), to maximize resource utilization across diverse workloads.
  • Troubleshoot complex, system-wide issues related to application performance, user access, compute nodes, storage, and job queueing services.
  • Utilize NVIDIA Data Center GPU Manager (DCGM) and additional tools to proactively monitor GPU health and performance, diagnosing and resolving training bottlenecks in collaboration with ML engineers.
  • Ensure the security and integrity of the server and cluster infrastructure through regular audits, patching, and proactive security measures.
  • Collaborate closely with engineering and AI/ML research stakeholders to gather requirements and architect robust, scalable solutions.
  • Manage the hardware lifecycle, from quoting and procuring hardware from vendors to creating and executing deployment schedules.
  • Provide technical guidance, mentoring, and architectural leadership to other team members.

Benefits

  • Healthcare Benefits
  • US Roles: Comprehensive medical, dental, and vision plans at little to no cost to you.
  • UK & AUS Roles: We cover full cost of medical insurance premiums for you and your dependents.
  • IE Roles: We offer an annual contribution toward your private health insurance for you and your dependents.
  • Additional Benefits
  • Income Protection: Anduril covers life and disability insurance for all employees.
  • Generous time off: Highly competitive PTO plans with a holiday hiatus in December. Caregiver & Wellness Leave is available to care for family members, bond with a new baby, or address your own medical needs.
  • Family Planning & Parenting Support: Coverage for fertility treatments (e.g., IVF, preservation), adoption, and gestational carriers, along with resources to support you and your partner from planning to parenting.
  • Mental Health Resources: Access free mental health resources 24/7, including therapy and life coaching. Additional work-life services, such as legal and financial support, are also available.
  • Professional Development: Annual reimbursement for professional development
  • Commuter Benefits: Company-funded commuter benefits based on your region.
  • Relocation Assistance: Available depending on role eligibility.
  • Retirement Savings Plan
  • US Roles: Traditional 401(k), Roth, and after-tax (mega backdoor Roth) options.
  • UK & IE Roles: Pension plan with employer match.
  • AUS Roles: Superannuation plan.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service