Hewlett Packard Enterprise-posted about 17 hours ago
Full-time • Mid Level
Remote • San Jose, CA
5,001-10,000 employees

HPE is seeking a motivated and skilled Senior Software Engineer to join the Advanced Programming Team within the HPC & AI Advanced Development Organization. This position is remote within the United States and requires valid U.S. work authorization. In this role, the software engineer will collaboratively solve challenges in scaling high-fidelity, discrete-event simulations on HPE supercomputers, using distributed memory and resilient execution techniques like checkpointing. There will also be the development of workflows for distributed, large-scale data analysis of traces, logs, and telemetry data from simulations and HPC systems.

  • Distributed HPC/AI workflow development, experimentation, and testing for enabling interactive processing of large-scale telemetry datasets (terabytes to petabytes).
  • Building solutions by composing existing open-source solutions and using distributed and parallel programming approaches for scaling data and simulation size.
  • Actively participate in a collaborative, consensus-driven design process.
  • Work in an Agile development environment.
  • Create documentation, collaborate with users, and present progress in writing, slides, and verbally.
  • 6-8 years of industry or comparable experience in software engineering.
  • Proficiency in one or more programming languages such as C, C++, or Python.
  • Exposure to high-performance computing (HPC) or scientific computing.
  • Experience designing, building, or operating distributed large-scale systems in production environments.
  • Experience with software engineering workflows, including version control, code reviews, automated testing, and CI/CD pipelines.
  • Proficient in conveying technical concepts clearly and effectively through documentation, presentations, and design discussions.
  • Strong analytical and problem-solving skills.
  • Experience collaborating with scientists or engineers on data science, data analytics, simulations, or modeling.
  • Experience with distributed-memory parallel programming on supercomputers or large-scale clusters.
  • Background in digital twin software development, including integration with visualization tools and AI/ML workflows.
  • Experience working on containerization & orchestration technologies such as Docker, Podman, Apptainer, Slurm and Kubernetes.
  • Experience developing or supporting workflows for HPC system design and operation.
  • Experience developing AI surrogates especially in the context of detecting real-time HPC system errors.
  • Experience incorporating and fine-tuning LLMs to provide a chat interface for any analysis or development environment.
  • Knowledge of parallel and discrete event simulation, especially with SST (https://sst-simulator.org/).
  • Familiarity with checkpointing techniques (efficiency, size optimization, recovery, persistence).
  • Familiarity with performance debugging and optimization at scale.
  • Familiarity with Pandas, NumPy, Dask, Spark or other data science technologies
  • Familiarity with Developer Operations, especially AIOps.
  • Health & Wellbeing We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
  • Personal & Professional Development We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division.
  • Unconditional Inclusion We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service