About The Position

DriveNets is a leader in disaggregated high-scale networking solutions for service providers and AI infrastructures. Founded in December 2015, DriveNets created a radical new way to build networks by adapting the architectural model of the cloud to telco-grade networking. This solution accelerates network deployment, improves the network’s economic model, and radically simplifies network operations. With customers including Comcast, Orange, and KDDI - over 80% of AT&T’s network traffic now runs through a disaggregated core powered by DriveNets software. DriveNets Network Cloud-AI solution, based on the same technology, was introduced to the market in 2023, providing the highest-performance Ethernet-based AI networking solution, and is already deployed by Hyperscalers, NeoClouds and Enterprises. Raising over $587 million in three funding rounds, DriveNets continues to deploy the most innovative network infrastructure and is looking for the most talented people to be part of this journey. As a Solution Engineer, you will play a pivotal role in designing, deploying, and optimizing Drivenets’ Network Cloud AI Infrastructure solutions. This individual contributor role requires a blend of technical expertise, leadership, and hands-on experience to implement cutting-edge solutions for our customers. You will collaborate with sales engineering teams, customers, and cross-functional teams - including Product Management, Solution Architects, Engineering, and Marketing - to define technical requirements, articulate solution value, and ensure successful deployment on-site.

Requirements

  • 5+ years of previous experience deploying and administering AI/HPC clusters or general-purpose compute systems.
  • 5+ years of hands-on Linux experience (e.g., RHEL, CentOS, Ubuntu) and production infrastructure support (e.g., networking, storage, monitoring, compute, installation, configuration, maintenance, upgrade, retirement).
  • Proficiency in Cloud, Virtualization, and Container technologies.
  • Deep understanding of operating systems, computer networks, and high-performance applications.
  • Hands-on experience with Bash, Python, and configuration management tools (e.g., Ansible).
  • Established record of leading technical initiatives and delivering results.
  • Ability to write extensive technical content (white papers, technical briefs, test reports, etc.) for external audiences with a balance of technical accuracy, strategy, and clear messaging.
  • Ability to travel domestic and international.

Nice To Haves

  • Familiarity with AI-relevant data center infrastructure and networking technologies such as: Infiniband, RoCEv2, lossless Ethernet technologies (PFC, ECN, etc), accelerated computing, GPU, DPU, etc.
  • Familiarity with GPU resource scheduling managers (Slurm, Kubernetes, etc.).
  • Expertise with NCCL/RCCL, setting up GPU environments, tuning these environments, and collecting benchmark results.
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana, ELK Stack) and Telemetry (gRPC, gNMI, OTLP, etc.).
  • Understanding of data center operations fundamentals in networking, cooling, and power.
  • Proven experience with one or more Tier-1 Clouds (AWS, Azure, GCP or OCI) or emerging Neoclouds, and cloud-native architectures and software.
  • Understanding the AI workload requirements and how it interacts with other parts of the system like networking, storage, deep learning frameworks, etc.
  • Knowledge of AI/ML frameworks (e.g., TensorFlow, PyTorch) and associated tooling is an advantage.

Responsibilities

  • Building robust AI/HPC infrastructure for new and existing customers.
  • Technical hands-on role in building and supporting NVIDIA/AMD based platforms.
  • Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, training stability, real-time monitoring, logging, and alerting.
  • Administer Linux systems, ranging from powerful GPU-enabled servers to general-purpose compute systems.
  • Design and plan rack layouts and network topologies to support customer requirements.
  • Design and evaluate automation scripts for network operations, configuring server and switch fabrics.
  • Perform NCCL, RCCL, LLM, and RDMA performance benchmarks as part of the design and evaluation process of the deployment.
  • Benchmark the latest GPU compute and NIC solutions by all major compute vendors, over the DriveNets networking fabric.
  • Install and configure Drivenets products, ensuring optimal performance and customer satisfaction.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement.
  • Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
  • Introduce new products to the Drivenets’ sales and support teams and to Drivenets’ customers.
  • Deliver technical trainings and TOIs for support/sales engineers, partners, and customers.
  • Collaborate on product definition through customer requirement gathering and roadmap planning.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service