Staff Test Engineer, Server Compute Firmware - AI Data Center

Celestica International LPAustin, TX

About The Position

We're looking for a skilled and experienced Staff Test Engineer, AI Data Center Infrastructure to help ensure the reliability, performance, and scalability of our AI data center's networking, storage and server infrastructure. In this role, you'll be a key technical contributor, working with cross-functional teams to develop and execute comprehensive test strategies for our critical hardware, firmware, and software components. Your deep expertise in storage and server systems will be essential as we deliver robust and high-performing solutions that support demanding AI/ML workloads. The ideal candidate will be a hands-on technical leader, capable of mentoring junior engineers, driving test automation, and collaborating across engineering teams to deliver robust and high-performing solutions.

Requirements

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field.
  • 10+ years of experience in hardware and/or software testing, with at least 5 years focused on enterprise-level storage and server systems.
  • 5+ years of experience in a lead or senior technical role, mentoring junior engineers or leading test initiatives.
  • Deep expertise in server architectures (x86, ARM, GPU servers), CPU/memory subsystems, PCIe, and power management.
  • Extensive experience in server architectures (x86, ARM, GPU servers), BIOS, CPU/memory subsystems, PCIe, power management, and Baseband Management Controllers (BMC) functionality.
  • Strong understanding of enterprise software security (e.g., secure boot, Root of Trust, Platform Firmware Resilience)
  • Proficiency in scripting languages (e.g., Python, Bash) for test automation and data analysis.
  • Experience with Linux operating systems (e.g., Ubuntu, CentOS, RHEL) and command-line tools.
  • Familiarity with networking concepts (Ethernet, TCP/IP, InfiniBand) and network testing methodologies.
  • Experience with test methodologies such as performance testing, reliability testing, stress testing, and fault injection.
  • Excellent problem-solving, analytical, and debugging skills.
  • Strong communication and interpersonal skills, with the ability to collaborate effectively across diverse teams.

Nice To Haves

  • Familiarity with OCP (Open Compute Project)
  • Experience with cloud environments (AWS, Azure, GCP) and virtualization technologies.
  • Knowledge of containerization technologies (Docker, Kubernetes).
  • Familiarity with AI/ML frameworks (e.g., TensorFlow, PyTorch) and their infrastructure requirements.
  • Experience with performance profiling tools (e.g., fio, Iometer, Perf, VTune).
  • Contributions to open-source projects related to storage, servers, or testing.
  • Certifications in relevant technologies (e.g., NetApp, Dell EMC, HPE, NVIDIA).

Responsibilities

  • Define and implement test strategies for all storage and server hardware, firmware, and software components within the AI data center environment.
  • Lead the definition and development of holistic test strategies, test plans and test cases for complex data center solutions, including functional, performance, reliability, stress, and endurance testing.
  • Mentor and provide technical guidance to junior test engineers, fostering a culture of technical excellence and continuous improvement.
  • Design and implement automated test frameworks and scripts using languages like Python, Go, or similar, to improve efficiency and coverage of testing.
  • Conduct in-depth performance analysis and bottleneck identification for server platforms (e.g., CPU, GPU, memory, PCIe, networking), security (e.g., secure boot, Root of Trust, Platform Firmware Resilience) and OpenBMC interfaces/features
  • This includes debugging issues related to BIOS, BMC functionality and its interaction with server hardware.
  • Develop and maintain robust testbeds and infrastructure for continuous integration and validation.
  • Utilize open-source and commercial test tools relevant to server, BIOS, OpenBMC and storage validation.
  • Collaborate closely with hardware design, software development, infrastructure, and AI/ML engineering teams to understand requirements and integrate testing throughout the product lifecycle.
  • Communicate test progress, results, and critical issues effectively to stakeholders, including executive leadership.
  • Develop specialized test methodologies to validate performance and reliability under heavy AI/ML workloads (e.g., large model training, inference at scale, data ingestion).
  • Understand and test the interactions

Benefits

  • The job description mentions that "All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran."
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service