About The Position

Are you passionate about cutting-edge technology and its implementation on a global scale? This exciting role within the Edge and Network Services (ENS) Foundation team offers a unique opportunity to tackle the challenges of introducing innovative compute and networking technologies across Meta's global data centers. You will collaborate with cross-functional teams such as Production AI engineers, Network and Hardware Engineering, Data Center Connectivity, Facility Engineering/Operations, and SiteOps to execute and support ENS's repair and operational support of the largest AI clusters. This collaboration ensures that new network technologies can be deployed and managed at scale. In this role, your focus will be on network hardware within integrated AI system rack infrastructure and related IP systems. You will ensure that ENS operations processes and tooling are well-defined and executable for these new technologies.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 10+ years of work experience with designing and deploying large-scale data center network infrastructure
  • Experience with data center design, structured cabling, and fiber optic network infrastructure
  • Demonstrated knowledge of NICs, optical transceivers, AOC, and DAC for high-speed interconnects
  • Demonstrated knowledge of TCP, IPv4/6, Routing Protocols, and related network services (DHCP, DNS)
  • Experience with implementing tooling and automation for network configuration and monitoring
  • Track record of solving complex problems, executing tactically, and delivering on infrastructure projects
  • Experience to work independently, stay organized, multitask, prioritize, and communicate effectively

Nice To Haves

  • Demonstrated experience working with scaled AI network solutions for training and inference use cases
  • Operating HPC/AI systems across global locations

Responsibilities

  • Work cross functionally to maintain AI and DC network health while leading long term initiatives to drive for better repair and greater efficiencies
  • Contribute to organizational level strategy and establish team roadmaps and goals that align with current business priorities and organizational strategy
  • Accountable for driving improvements in technical references, NPI process, and deployment/operations documentation standards in support of continuous improvement initiatives
  • Facilitate clear communication of technical requirements, risks, and escalations to leadership and cross-functional partners
  • Integrate new networking technologies into ENS operations processes to efficiently scale Meta’s AI, Compute, and Network capabilities
  • Develop new operational support models for deploying and operating new data center infrastructure
  • Influence design of data center, network, server, and applications to ensure seamless integration
  • Publish technical reference, process, and training documentation for a global network deployment and operations teams
  • Build and nurture business relationships with key stakeholders, partners, and vendors
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service