About The Position

We are the AI Infrastructure - Network Operations team at OCI. We support and operate the RDMA/RoCE network fabrics for OCI's largest AI and HPC customers. These fabrics are the foundation underneath OCI's AI, GPU and HPC services, and support major tier-0 vendors in the generative AI industry. If you're running an AI workload at OCI, we're running the RDMA network underneath your workload. A Principal Network Engineer on our team supports the design, deployment, and operations of a large-scale global Oracle cloud computing environment (Oracle Cloud Infrastructure - OCI). Primarily focused on operation and support of RDMA/RoCE network fabrics and systems, through a combination of a deep network understanding and automation skills to operate a production environment. As OCI is a cloud-based network with a global footprint, this support will include hundreds of thousands of network devices supporting millions of servers, connected over a mix of dedicated backbone infrastructure and the Internet.

Requirements

  • Extensive experience in network engineering and operations.
  • Strong understanding of RDMA/RoCE network technologies.
  • Experience with large-scale cloud computing environments.
  • Proficiency in automation tools and scripting for network management.

Nice To Haves

  • Experience with Oracle Cloud Infrastructure (OCI).
  • Familiarity with AI and HPC workloads.
  • Knowledge of network security best practices.

Responsibilities

  • Support the design, deployment, and operations of RDMA/RoCE network fabrics.
  • Operate a large-scale global Oracle cloud computing environment.
  • Utilize deep network understanding and automation skills to manage production environments.
  • Ensure the reliability and performance of network systems supporting AI and HPC workloads.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service