Senior System Development Engineer – AI Technologies

Dell TechnologiesAustin, TX
2dOnsite

About The Position

Senior Systems Development Engineer Our customers’ system requirements are usually highly complex. Bringing together hardware and software systems design, Systems Development Engineering operates at the very cutting edge of technology to meet them. We design and develop electronic and electro-mechanical or systems-orientated products, conduct feasibility studies on engineering proposals and prepare installation, operation and maintenance specifications and instructions. We’re proud to deliver programs and products to the highest quality standards, on time and within budget. Join us to do the best work of your career and make a profound social impact as a Senior Systems Development Engineer on our Systems Development Engineering Team in Austin, Texas. What you’ll achieve As a Senior Systems Development Engineer, you will design, define and implement complex system requirements for customers and prepare studies and analyses of existing systems. You will: System Platform Engineering: Lead bring‑up, configuration, and validation of system platforms supporting AI workloads (servers, GPU racks, accelerators, networking fabrics); work with BIOS/UEFI, BMC, firmware, drivers, and kernel subsystems to ensure system readiness for large‑scale AI deployments; perform hardware–software co-validation of CPUs, GPUs, DPUs, NICs, accelerators, and memory subsystems under AI‑heavy workloads; validate PCIe fabric behavior, NUMA topology, and data‑path efficiency for model training and inference. System Debugging & Hardware–Software Interaction: Diagnose complex issues across BIOS, firmware, OS, driver stack, container runtime, orchestration layer, and AI frameworks; analyze system logs, kernel traces, hardware event telemetry, GPU health signals, and fabric diagnostics; conduct root‑cause analysis of performance bottlenecks, training failures, model divergence, and hardware stability issues; collaborate with silicon, firmware, OS, and AI software teams to resolve issues rapidly. AI Cluster & Rack‑Level Operations: Deploy and manage AI clusters: GPU servers, accelerators, high‑speed networking (InfiniBand, RoCE), and storage systems; validate cluster readiness for distributed training, including bandwidth, latency, topology checks, and gradient‑sync performance; work with orchestration systems (Kubernetes, Slurm, Ray, Docker, Singularity) to run and optimize AI pipelines; partner with data center teams for rack integration, power/thermal analysis, and capacity planning AI Benchmarking & Performance Analysis: Execute and analyze standard AI benchmarks (MLPerf Training, MLPerf Inference, SPEC AI Benchmarks); build custom benchmarks for transformer models, LLMs, computer vision, multimodal models, and recommendation systems; interpret results to provide optimization recommendations at the hardware, OS, driver, and framework levels; document findings and drive improvements across the platform and AI software ecosystem. Take the first step towards your dream career Every Dell Technologies team member brings something unique to the table. Here’s what we are looking for with this role:

Requirements

  • Bachelor’s or Master’s degree in Computer Engineering, Computer Science, Electrical Engineering, or related field
  • 5+ years of experience in system engineering, platform development, or hardware–software validation
  • Strong understanding of system architecture, CPU/GPU/accelerator internals, memory systems, and I/O subsystems

Responsibilities

  • System Platform Engineering: Lead bring‑up, configuration, and validation of system platforms supporting AI workloads (servers, GPU racks, accelerators, networking fabrics); work with BIOS/UEFI, BMC, firmware, drivers, and kernel subsystems to ensure system readiness for large‑scale AI deployments; perform hardware–software co-validation of CPUs, GPUs, DPUs, NICs, accelerators, and memory subsystems under AI‑heavy workloads; validate PCIe fabric behavior, NUMA topology, and data‑path efficiency for model training and inference.
  • System Debugging & Hardware–Software Interaction: Diagnose complex issues across BIOS, firmware, OS, driver stack, container runtime, orchestration layer, and AI frameworks; analyze system logs, kernel traces, hardware event telemetry, GPU health signals, and fabric diagnostics; conduct root‑cause analysis of performance bottlenecks, training failures, model divergence, and hardware stability issues; collaborate with silicon, firmware, OS, and AI software teams to resolve issues rapidly.
  • AI Cluster & Rack‑Level Operations: Deploy and manage AI clusters: GPU servers, accelerators, high‑speed networking (InfiniBand, RoCE), and storage systems; validate cluster readiness for distributed training, including bandwidth, latency, topology checks, and gradient‑sync performance; work with orchestration systems (Kubernetes, Slurm, Ray, Docker, Singularity) to run and optimize AI pipelines; partner with data center teams for rack integration, power/thermal analysis, and capacity planning
  • AI Benchmarking & Performance Analysis: Execute and analyze standard AI benchmarks (MLPerf Training, MLPerf Inference, SPEC AI Benchmarks); build custom benchmarks for transformer models, LLMs, computer vision, multimodal models, and recommendation systems; interpret results to provide optimization recommendations at the hardware, OS, driver, and framework levels; document findings and drive improvements across the platform and AI software ecosystem.

Benefits

  • Dell is committed to fair and equitable compensation practices.
  • The salary range for this position is $123k - $170k.
  • Your life.
  • Your health.
  • Supported by your benefits.
  • You can explore the overall benefits experience that awaits you as a Dell Technologies team member — right now at MyWellatDell.com
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service