Lead Engineer - Network Operations Centre

Core42 US Services LLC
$133,200 - $199,800

About The Position

We are seeking a highly skilled Lead Engineer – Network Operations to oversee the daily operations and support of the network infrastructure underpinning our global high-performance computing (HPC) environments. This role is responsible for ensuring high availability, security, and optimal performance of switches, firewalls, and network fabrics that support large-scale AI and ML workloads across geographically distributed data centers. The ideal candidate brings deep hands-on experience with enterprise-grade network technologies, low-latency HPC fabrics (e.g., InfiniBand), and automation of network operations.

Requirements

  • Bachelor’s degree in Network Engineering, Computer Science, or a related field; or equivalent hands-on experience.
  • Minimum of 8 years of experience in enterprise network operations or engineering roles, with at least 2 years in a lead or ownership capacity.
  • Extensive hands-on experience with data center networking equipment (e.g., Cisco, Arista, Juniper, Mellanox, or NVIDIA Networking).
  • Deep understanding of Layer 2/3 protocols, the TCP/IP stack, multicast, QoS, and VLAN/VXLAN/EVPN technologies.
  • Proficiency in configuring and managing firewalls (e.g., Palo Alto, Fortinet, Cisco ASA) and VPN solutions to ensure secure network operations.
  • Proven experience in supporting low-latency, high-throughput networks in HPC, AI/ML, or cloud-scale environments.
  • Hands-on experience with InfiniBand or RoCE technologies for HPC network environments.
  • Familiarity with Kubernetes networking (e.g., CNI plugins, network policies, service meshes) for cloud-native networking.
  • Exposure to CI/CD, Git, and modern DevNet practices for automating and optimizing network infrastructure.

Responsibilities

  • Lead the daily operational support of HPC network infrastructure, including Layer 2/3 switches, routers, firewalls, and RDMA-based fabrics (e.g., InfiniBand, RoCE), ensuring network performance and reliability.
  • Troubleshoot and resolve complex network issues affecting HPC workloads and services, minimizing downtime and maximizing throughput.
  • Configure, upgrade, and maintain enterprise-grade firewalls, VPNs, ACLs, and routing protocols (e.g., BGP, OSPF), ensuring network security and performance.
  • Provide network integration support for HPC platforms, including Slurm, Kubernetes, and bare-metal provisioning systems.
  • Design and manage IP address planning, VLAN configurations, network segmentation, and security zones in alignment with operational and compliance requirements.
  • Develop and maintain network automation scripts and infrastructure-as-code solutions (e.g., Ansible, Python, Terraform) to optimize processes and reduce human error.
  • Collaborate closely with compute, storage, security, and site reliability teams to design and implement scalable, resilient, and high-performance network solutions for AI workloads.
  • Document network architecture, configurations, runbooks, and change management procedures in accordance with ITIL/ISO standards.
  • Participate in on-call rotations, providing support for incident response, change management, and root cause analysis (RCA) processes.
  • Lead root cause analysis (RCA) for operational network issues, contributing to post-mortem documentation and driving continuous improvement efforts.
  • Provide mentorship and technical guidance to junior engineers, helping to build skills and foster a collaborative environment.
  • Ensure strict adherence to security and operational policies and assist with audits and documentation related to change and incident management processes.

Benefits

  • bonus
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service