Core Weave-posted 2 months ago
$165,000 - $242,000/Yr
Full-time • Mid Level
Hybrid • New York, NY
Professional, Scientific, and Technical Services

CoreWeave is seeking a highly skilled and motivated HPC Performance Engineer to join our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role, you will play a crucial part in the design, development, and optimization of our bare-metal systems from POST through joining a Kubernetes cluster. The team's primary responsibilities include maintaining a custom Linux kernel, various OS images (Ubuntu-based), the virtualization stack (kubevirt/qemu/vfio), and the container/pod runtime stack (containerd/nydus/kubelet). You will collaborate closely with cross-functional teams, up stack engineering teams, and stakeholders to ensure our low-level software stack is performant in the context of hardware updates; and providing data, metrics, dashboards, and analysis to substantiate performance assertions.

  • Develop and maintain tools for establishing systems performance baselines
  • Develop and maintain performance regression analysis testing automation
  • Design and maintain performance regression test pipelines for HPC workloads
  • Debug and Tune fabric-level performance to ensure low-latency high throughput configurations
  • Development of telemetry for performance analysis across distributed clusters of servers
  • Triage and fix performance issues in Linux
  • Collect data, produce metrics and visualizations that communicate performance information compared to benchmarks
  • Define Linux and OS requirements, specifications, and system architecture in relation to systems performance
  • 5+ years of professional experience in Systems/HPC Performance Engineering, Benchmarking, and/or Validation
  • Strong experience with MPI workloads and distributed system performance analysis
  • Familiarity with RoCE, InfiniBand, and GPUDirect/Data Direct I/O, NUMA, etc in HPC workloads
  • Hands-on use of public HPC benchmarks (HPCC, HPL, OSU, MLPerf-HPC, STREAM, IO500)
  • Extensive, deep experience in Linux internals
  • Fluency with a programming language geared toward automation (Python preferred, but others possible)
  • Experience writing robust, testable code
  • Experience diagnosing and fixing systems performance issues
  • Experiencing with implementing automation testing
  • Ability to effectively prioritize and communicate proposed features and fixes in a remote-employee environment
  • Strong passion for automation, with a commitment to automating processes comprehensively
  • Excellent documentation skills and attention to detail
  • Strong analytical and problem-solving abilities
  • Familiarity with QA/QE best practices
  • Familiarity with Golang
  • Opinions about software version control and team collaboration
  • Experience working in Cloud environments
  • Experience as a software engineer writing large-scale applications
  • Experience in open-source community software development
  • Experience with machine learning is a huge bonus
  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service