Lead Engineer - Network Operations Centre

Core42 US Services LLC

23h•$133,200 - $199,800

About The Position

We are seeking a highly skilled Lead Engineer – Network Operations to oversee the daily operations and support of the network infrastructure underpinning our global high-performance computing (HPC) environments. This role is responsible for ensuring high availability, security, and optimal performance of switches, firewalls, and network fabrics that support large-scale AI and ML workloads across geographically distributed data centers. The ideal candidate brings deep hands-on experience with enterprise-grade network technologies, low-latency HPC fabrics (e.g., InfiniBand), and automation of network operations.

Requirements

Bachelor’s degree in Network Engineering, Computer Science, or a related field; or equivalent hands-on experience.
Minimum of 8 years of experience in enterprise network operations or engineering roles, with at least 2 years in a lead or ownership capacity.
Extensive hands-on experience with data center networking equipment (e.g., Cisco, Arista, Juniper, Mellanox, or NVIDIA Networking).
Deep understanding of Layer 2/3 protocols, the TCP/IP stack, multicast, QoS, and VLAN/VXLAN/EVPN technologies.
Proficiency in configuring and managing firewalls (e.g., Palo Alto, Fortinet, Cisco ASA) and VPN solutions to ensure secure network operations.
Proven experience in supporting low-latency, high-throughput networks in HPC, AI/ML, or cloud-scale environments.
Hands-on experience with InfiniBand or RoCE technologies for HPC network environments.
Familiarity with Kubernetes networking (e.g., CNI plugins, network policies, service meshes) for cloud-native networking.
Exposure to CI/CD, Git, and modern DevNet practices for automating and optimizing network infrastructure.

Responsibilities

Lead the daily operational support of HPC network infrastructure, including Layer 2/3 switches, routers, firewalls, and RDMA-based fabrics (e.g., InfiniBand, RoCE), ensuring network performance and reliability.
Troubleshoot and resolve complex network issues affecting HPC workloads and services, minimizing downtime and maximizing throughput.
Configure, upgrade, and maintain enterprise-grade firewalls, VPNs, ACLs, and routing protocols (e.g., BGP, OSPF), ensuring network security and performance.
Provide network integration support for HPC platforms, including Slurm, Kubernetes, and bare-metal provisioning systems.
Design and manage IP address planning, VLAN configurations, network segmentation, and security zones in alignment with operational and compliance requirements.
Develop and maintain network automation scripts and infrastructure-as-code solutions (e.g., Ansible, Python, Terraform) to optimize processes and reduce human error.
Collaborate closely with compute, storage, security, and site reliability teams to design and implement scalable, resilient, and high-performance network solutions for AI workloads.
Document network architecture, configurations, runbooks, and change management procedures in accordance with ITIL/ISO standards.
Participate in on-call rotations, providing support for incident response, change management, and root cause analysis (RCA) processes.
Lead root cause analysis (RCA) for operational network issues, contributing to post-mortem documentation and driving continuous improvement efforts.
Provide mentorship and technical guidance to junior engineers, helping to build skills and foster a collaborative environment.
Ensure strict adherence to security and operational policies and assist with audits and documentation related to change and incident management processes.