Sr. System Engineer

Super Micro Computer, Inc.San Jose, CA
43d

About The Position

As a global leader in server technologies, Supermicro has been growing extremely fast in many key markets such as Cloud Computing, Big Data, HPC, AI and Storage, etc. To meet the market demand, Supermicro is developing end to end enterprise IT solutions with compute, storage, networking all integrated into full rack or multi-rack level systems. Senior System Engineer plays an important role in designing, implementing, testing and deploying rack system solutions for data center and enterprise customers.

Requirements

  • BS/MS in Electrical Engineering, Computer Engineering or a related field, MS preferred
  • 8+ years of work-related experience in server/network/storage hardware configuration, testing, debugging and troubleshooting
  • 8+ years of work-related experience in DevOps or in cloud environments, including but not limited to Docker/Containers and Kubernetes
  • Experience with leading AI/ML frameworks such as PyTorch, TensorFlow, etc.
  • Familiar with TCP/IP protocol stack, UDP, IPv4-IPv6, DNS, DHCP and other Application protocols
  • Familiar with HPC, AI or Cloud benchmark tests, networking architecture
  • Excellent Programming skills in Python and shell scripting
  • Strong communication skills and strong sense of teamwork and good team player

Nice To Haves

  • Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI or RCCL/NCCL is a plus
  • CCNA, OpenStack, Openshit, Azure or AWS is a plus

Responsibilities

  • Deploy Rack/Cluster infrastructure and execute comprehensive system level testing on the latest GPUs, CPU processors, Network and Storage, encompassing functionality, compatibility, performance, stress, and reliability testing, leveraging proprietary in-house tools
  • Conduct proof of concept design and testing. Establish expertise in HPC/AI applications and benchmarks, providing optimized benchmarks for HPC/AI applications by fine-tuning system settings, optimizing OS/network configurations, and demonstrating strong problem-solving skills and building robust processes and procedures for HPC/AI solutions
  • Lead day-to-day operational support for Cluster, Storage, HPC and Cloud infrastructure. Identify and document hardware and software quality issues. Collaborate with product management and other Engineering teams to integrate enhancements into future products
  • Write technical documents for test procedures, test reports and troubleshooting procedures related to servers/networks/clusters software and hardware to facilitate knowledge sharing
  • Deliver on-site deployment services to ensure customer acceptance verification and satisfaction
  • Write automation tools for cluster deployment and test environment

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Computer and Electronic Product Manufacturing

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service