Hardware Systems Engineer, NPI

MetaMenlo Park, CA
1d$144,000 - $204,000

About The Position

Meta is seeking a highly skilled and experienced Systems/Hardware Engineer to join our Release to Production (RTP) team. The RTP team is responsible for the end-to-end Hardware Lifecycle of all Meta servers, including prototyping, pre-production hands-on system validation, hardware debugging, and stress testing. As a Systems/Hardware Systems Engineer, you will work closely with various teams, including HW/SW co-design teams, hardware designers, networking teams, system manufacturers, component vendors, capacity engineering, production engineering, production services, and data center operations teams to enable new systems that will be deployed in our production data centers.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of work experience in one or more domains such as: ASIC development, compute, AI-ML hardware/software, storage, memory, network, server interconnect technologies, or similar
  • Knowledge of architecture and components on one of the following products: server/PC/Laptop
  • Development or debug experience in one or more areas: hardware fault management, error reporting, error handling on hardware products
  • Experience with Python, C/C++ and/or similar languages, within a Linux environment, for server system management, automation, version control, CI/CD, or similar
  • Demonstrated problem-solving skills, with track record of resolving to troubleshoot complex technical issues
  • Demonstrated communication and collaboration skills, with the track record of working effectively with cross-functional teams
  • Experience working in a matrix organization

Nice To Haves

  • 7+ years of experience with a subset of one of the following domains: Compute Systems, Storage Systems, Accelerated Compute Systems/HPC, Kernel/Firmware Development and/or test, Post Silicon Bringup
  • Experience with x86 or ARM-based CPUs and their subsystems (e.g. memory, inter-chiplet communications, RAS/DFT, performance management, power management)
  • Working/functional knowledge of common bus protocols such as I2C, SPI, USB, LP/DDR, and/or PCIe
  • Hands-on experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries. Hands on experience managing/debugging Linux servers
  • Understanding of the hardware development process and how it pertains to test strategy. Experience authoring test plans for complex chipsets for functional, stress and performance testing
  • Familiarity with debugging tools for systems-on-chip (SoCs) - eg. JTAG, GDB, DSTREAM, Trace32
  • Experienced in the integration of lab tools for automated workflows with large scale deployments. Proficiency in continuous integration/continuous delivery tools
  • 2+ years experience scripting automation in Python or equivalent

Responsibilities

  • Interface with external vendors and internal teams to understand system architecture and develop Hardware Fault Management for various server products
  • Drive new platform enablement, hardware validation, tooling specification, and integration, customer workload testing, and experiment creation to detect and diagnose hardware/firmware/software health issues
  • Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues
  • Leverage understanding of RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanisms for better operation quality and cost/efficiency
  • Develop visibility through data visualization and implement systemic solutions to hardware health issues
  • Troubleshoot, diagnose, and root cause system failures, isolating components/failure scenarios while working with internal & external stakeholders
  • Lead bring-up, validation, and deployment of cutting-edge hardware systems in lab and datacenter environments
  • Design and implement robust system-level test plans, including functional, stress, and performance tests
  • Enhance hardware reliability by creating data visualizations and implementing systemic solutions to address recurring health issues

Benefits

  • bonus
  • equity
  • benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service