Network Architect

Cerebras SystemsSunnyvale, CA

About The Position

Cerebras Systems is seeking a Network Architect to join their Cluster Engineering Team. This role will focus on shaping the front-end datacenter and interconnect fabric for current and future AI clusters. The architect will collaborate with hardware vendors, internal networking teams, and industry peers to design resilient, reliable, and high-throughput network architectures for large-scale AI workloads. This is a deeply technical, full-stack role requiring expertise in host-side networking, NIC behavior, cluster-level coordination, and various hardware components (network switches, NICs, Accelerator Compute Engine) and their software layers. The position involves owning proof-of-concept work for new network designs, acting as the central technical voice for network reliability, and leading cross-functional technical projects.

Requirements

  • Ph.D. in Computer Science or Electrical Engineering with 5+ years of industry experience, OR Master's in CS/EE with 10+ years of industry experience.
  • 3+ years designing large-scale networks in datacenter and cloud environments.
  • Extensive hands-on experience debugging networking issues in large distributed systems with multiple platforms and protocols.
  • Demonstrated track record leading multi-phase, multi-team technical projects to completion.
  • Deep expertise across networking platforms: Juniper, Arista, Cisco, and open-box / disaggregated NOS architectures (SONiC).
  • Strong working knowledge of networking protocols and fabric technologies: VXLAN, EVPN, RoCEv2, BGP, DCQCN, PFC, ECN, and streaming telemetry.
  • Programming and automation: proficiency in Python (and/or Go) for building network automation, validation, and tooling.
  • Comfort with config generation frameworks (Ansible, Jinja2), gNMI, and CI/CD pipelines for network infrastructure.
  • SRE and observability: hands-on experience with streaming telemetry pipelines, time-series databases (Prometheus, InfluxDB), visualization (Grafana), log aggregation, and modern incident-management.
  • Ability to define SLIs/SLOs and instrument the network for proactive reliability.
  • Familiarity with network visibility, management, and packet-capture/analysis tools.

Nice To Haves

  • Prior experience at hyperscalers or cloud service providers.
  • Experience with AI/ML or HPC cluster networking, including lossless Ethernet design, rail-optimized topologies, and collective-communication traffic patterns.
  • Track record of contributions to open-source networking projects, standards bodies, or industry conferences.

Responsibilities

  • Design and architect front-end network fabrics for AI/ML and HPC clusters, optimizing for high resource utilization, low latency, and high-throughput communication.
  • Build proof-of-concept implementations of new network designs and features, and drive them from prototype through production rollout.
  • Identify and resolve performance and efficiency bottlenecks across the host-NIC-fabric.
  • Automate the deployment, configuration, and validation of network infrastructure using Python, including topology provisioning, fabric bring-up, config generation, and regression.
  • Stand up and operate SRE-grade telemetry and observability for the cluster network: streaming telemetry (gNMI, OpenConfig, sFlow/IPFIX), metrics pipelines, alerting, and incident workflows.
  • Define the SLIs/SLOs that govern network reliability and drive blameless post-incident analysis.
  • Lead network debugging in large distributed-systems environments spanning multiple platforms and protocols, including deep dives into RoCEv2, PFC/DCQCN, ECMP hashing, congestion behavior, and packet-level forensics.
  • Lead cross-functional, multi-phase technical projects spanning hardware, firmware, host networking, and cluster software.
  • Collaborate with vendors and industry partners to shape network hardware and feature roadmaps.
  • Represent the company in industry forums, standards bodies, and technical communities.
  • Serve as the central point of contact for network reliability issues across the cluster.

Benefits

  • Direct impact on training and inference performance for some of the largest AI systems in the world.
  • Autonomy to shape architecture decisions.
  • Resources to prototype real hardware.
  • Peer group that takes engineering rigor and operational reliability seriously.
  • Build a breakthrough AI platform beyond the constraints of the GPU.
  • Publish and open source their cutting-edge AI research.
  • Work on one of the fastest AI supercomputers in the world.
  • Job stability with startup vitality.
  • Simple, non-corporate work culture that respects individual beliefs.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service