System Engineer

SupermicroSan Jose, CA
4h$119,000 - $130,000

About The Position

As a System Engineer, you’ll be the go-to person to roll out and maintain business critical applications and services for Supermicro. You are also responsible for resolving escalated service issues, coaching other engineers to resolutions, engineering and implementing complex projects. You will be a person who is independent with leadership to drive the technical development and with excellent communication skills.

Requirements

  • BS / MS in Electrical Engineering, Computer Engineering or Computer Science
  • 3+ years of work-related experience in Deep Learning and Machine Learning
  • 3+ years of Linux/networking debugging/testing or relevant experience preferred
  • Experience with leading AI/ML frameworks such as PyTorch, TensorFlow, ONNX, etc.
  • Experience with DevOps or in cloud environments, including but not limited to Docker/Containers and Kubernetes
  • Hands-on experience with workload/scheduler Managers (Slurm) for rack/cluster
  • Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI or RCCL/NCCL
  • Familiar with Openstack, Openshift, Azure or AWS
  • Programming experience with windows and Linux shell scripting
  • Strong sense of teamwork and good team player, strong communication skills

Nice To Haves

  • Familiar with Intel/AMD/NVIDIA development tool kits like CUDA, oneAPI, ROCm is a plus
  • Experience with server/network hardware debugging and troubleshooting is a plus

Responsibilities

  • Perform Cluster/Rack level testing and software deployment for local/onsite customers
  • Responsible for Cloud, Storage, and AI/Deep Learning benchmarks and testing
  • Responsible for proof-of-concepts (PoCs) setup and network troubleshooting
  • Perform the testing for AI applications using ML/DL frameworks such as MLPerf, LLM, and RAG
  • Conduct functionality testing, compatibility testing, performance testing, stress, and reliability testing
  • Report hardware and software quality issues and work with other teams to solve the issues
  • Document and analyze test data and test logs, write a test report
  • Contribute to the development of test utilities and test script automation
  • Support internal and external quality issues and drive issue resolution
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service